做了驱入式好多年,也要适当跟随一下潮流,搞了台1080ti机器来玩一下机器学习。 CentOS8中安装TensorFlow的文章不多,本文总结一下我的安装步骤。转载的话,请注明来自:Oopsdump.com,谢谢。

CentOS8中自带的python为3.6.8。如果没有安装python,可以通过下面命令进行安装:

yum -y install python36

安装其它需要的依赖:

sudo yum -y install gcc gcc-c++ python3-pip python36-devel atlas atlas-devel gcc-gfortran openssl-devel libffi-devel

Nvidia的显卡驱动可以在官方网站下载:

https://www.nvidia.com/Download/index.aspx?lang=en-us

也可用我的下载地址:http://us.download.nvidia.com/XFree86/Linux-x86_64/440.44/NVIDIA-Linux-x86_64-440.44.run

安装中可能出现的问题:

安装报错:you appear to be running an x server pleaseexit x before installing

解决方案:在命令行模式下用root账户关闭x server之后,重新安装驱动。

首先注销当前账户,在注销后的登录界面按ctrl+alt+f1进入纯命令行界面。

通过su命令进入root用户。

输入systemctl stop gdm.service命令即可关闭x server然后重新运行驱动,中间出现的warning可以忽略,在选择是否支持X server的选项时,需要更改NO到Yes,安装完成后输入reboot重启后进入图形界面。

安装报错:ERROR: The Nouveau kernel driver is currently in use by your system. This  driver is incompatible with the NVIDIA driver……

解决方案:关闭原 Nouveau 驱动。

Nouveau是由第三方为NVIDIA显卡开发的一个开源3D驱动,我们需要开闭后才能加载新驱动。

打开/etc/modprobe.d/blacklist.conf  添加:

blacklist nouveau

打开 /usr/lib/modprobe.d/dist-blacklist.conf 添加两行:

blacklist nouveau
options nouveau modeset=0

重建文件系统备份原来的initramfs nouveau image镜像

mv/boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r)-nouveau.img
dracut /boot/initramfs-$(uname -r).img $(uname -r)

安装dkms:

yum install kernel-devel
yum -y install epel-release
yum -y install dkms

安装libglvnd:

dnf groupinstall "Development Tools"
dnf install libglvnd-devel elfutils-libelf-devel

重启安装NVIDIA驱动./NVIDIA-Linux-x86_64-384.90-1080ti.run

(如果不显示,可以尝试Ctrl+Alt+F2或通过SSH登录)登录不显示,可以试nvidia-xconfig

验证是否安装成功:

# nvidia-smi

Wed Dec  4 04:35:16 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 105...  Off  | 00000000:02:00.0  On |                  N/A |
| 62%   66C    P0    N/A /  95W |   4024MiB /  4039MiB |     88%      Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1470      G   /usr/libexec/Xorg                             39MiB |
|    0      1840      G   /usr/bin/gnome-shell                          42MiB |
|    0      6996      C   .../tensorflow-gpu-1.15.0/venv/bin/python3  3925MiB |
+-----------------------------------------------------------------------------+

CUDA 10.2 安装按官网步骤提示即可。

也可用我下载地址:https://developer.download.nvidia.cn/compute/cuda/10.2/Prod/local_installers/cuda-repo-rhel8-10-2-local-10.2.89-440.33.01-1.0-1.x86_64.rpm

验证是否安装成功:

cd /usr/local/cuda/samples
make
cd 1_Utilities/
make
ls
./deviceQuery
  deviceQuery  deviceQuery.cpp  deviceQuery.o  Makefile  NsightEclipse.xml  readme.txt
(venv) [root@localhost deviceQuery]# ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 1050 Ti"
CUDA Driver Version / Runtime Version          10.2 / 10.2
CUDA Capability Major/Minor version number:    6.1
Total amount of global memory:                 4040 MBytes (4235919360 bytes)
( 6) Multiprocessors, (128) CUDA Cores/MP:     768 CUDA Cores
GPU Max Clock rate:                            1493 MHz (1.49 GHz)
Memory Clock rate:                             3504 Mhz
Memory Bus Width:                              128-bit
L2 Cache Size:                                 1048576 bytes
Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
Total amount of constant memory:               65536 bytes
Total amount of shared memory per block:       49152 bytes
Total number of registers available per block: 65536
Warp size:                                     32
Maximum number of threads per multiprocessor:  2048
Maximum number of threads per block:           1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch:                          2147483647 bytes
Texture alignment:                             512 bytes
Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
Run time limit on kernels:                     Yes
Integrated GPU sharing Host Memory:            No
Support host page-locked memory mapping:       Yes
Alignment requirement for Surfaces:            Yes
Device has ECC support:                        Disabled
Device supports Unified Addressing (UVA):      Yes
Device supports Compute Preemption:            Yes
Supports Cooperative Kernel Launch:            Yes
Supports MultiDevice Co-op Kernel Launch:      Yes
Device PCI Domain ID / Bus ID / location ID:   0 / 2 / 0
Compute Mode:
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.2, NumDevs = 1
Result = PASS
(venv) [root@localhost deviceQuery]# pwd
/usr/local/cuda/samples/1_Utilities/deviceQuery
(venv) [root@localhost deviceQuery]# ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)

上面包内集成的显示驱动比较老,如果安装后显示无法启动,可以使用SSH再次安装一下显示驱动,并进行nvidia-xconfig。

需要手动下载Cudnn来安装:

https://developer.nvidia.com/rdp/cudnn-download

可能需要注册,如果不想注册,可以使用我下载用的地址:

https://developer.download.nvidia.cn/compute/redist/cudnn/v7.6.5/cudnn-10.2-linux-x64-v7.6.5.32.tgz

安装方法:

$ cd /usr/local/cuda/
$ tar -xzvf cudnn-10.2-linux-x64-v7.6.5.32.tgz
$ sudo cp cuda/include/cudnn.h /usr/local/cuda/include
$ sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
$ sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

安装virtualenv,来进行TensorFlow与CentOS本身包的隔离:

pip3 install --upgrade virtualenv

创建一个公用虚拟环境:

mkdir -p /var/venvs/
virtualenv --system-site-packages  /var/venvs/tensorflow

将下面的Diff信息修改到/var/venvs/tensorflow/bin/activate:

        unset _OLD_VIRTUAL_PYTHONHOME
    fi
+   if ! [ -z "${_OLD_VIRTUAL_LIB:+_}" ] ; then
+       LD_LIBRARY_PATH="$_OLD_VIRTUAL_LIBPATH_OPPSDUMP_COM"
+       export LD_LIBRARY_PATH
+       unset _OLD_VIRTUAL_LIBPATH_OPPSDUMP_COM
+   fi

    # This should detect bash and zsh, which have a hash command that must

  _OLD_VIRTUAL_PATH="$PATH"
- PATH="$VIRTUAL_ENV/bin:$PATH"
+ PATH="$VIRTUAL_ENV/bin:/usr/local/cuda/bin:$PATH"
+ export PATH
+ _OLD_VIRTUAL_LIBPATH_OPPSDUMP_COM="$LD_LIBRARY_PATH"
+ LD_LIBRARY_PATH="/usr/local/cuda/lib64:$LD_LIBRARY_PATH"
+ export LD_LIBRARY_PATH
+ export CUDADIR=/usr/local/cuda

# unset PYTHONHOME if set

当需要使用TensorFlow时,先进行:

source /var/venvs/tensorflow/bin/activate

安装TensorFlow:

# CPU版本
pip install --upgrade tensorflow
# GPU版本
pip install --upgrade tensorflow-gpu
# 旧版本
pip install tensorflow=={package_version}

测试命令:

python -c "import tensorflow as tf;print(tf.reduce_sum(tf.random.normal([1000, 1000])))"

如果看到输出:

2019-12-04 09:47:43.083342: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2019-12-04 09:47:43.083399: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib64:
2019-12-04 09:47:43.083440: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib64:
2019-12-04 09:47:43.083478: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib64:
2019-12-04 09:47:43.083516: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib64:
2019-12-04 09:47:43.083553: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib64:
2019-12-04 09:47:43.083593: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib64:
2019-12-04 09:47:43.083599: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

需要在/usr/local/cuda/lib64/中做软链接:

cd /usr/local/cuda/lib64/
ln -s libcudart.so.10.2 libcudart.so.10.0
ln -s libcufft.so.10.1.2.89 libcufft.so.10.0
ln -s libcurand.so.10.1.2.89 libcurand.so.10.0
ln -s libcusolver.so.10.3.0.89 libcusolver.so.10.0
ln -s libcusparse.so.10.3.1.89 libcusparse.so.10.0

还需要在 /usr/lib64目录中做一个软链接:

cd /usr/lib64
ln -s libcublas.so.10.2.2.89 libcublas.so.10.0

注意: 如果使用了seLinux,需要使用chcon -u system_u 更改软链接文件。

本文参考了以下文章,特此感谢:

https://blog.csdn.net/happyfreeangel/article/details/103392787

发表回复