Friday, December 28, 2018
TensorFlow on sherbert
The issues which prevented me from using Open Live Writer have been resolved. Glory be!
Anyway, building tensorflow using Debian’s CUDA package is a freakin’ nightmare. Tensorflow uses an exotic build system called Bazel. Bazel cannot successfully use Debian’s CUDA package because … (dramatic pause) … it creates a shell script that copies the entire contents of /usr/include into its own private directory, file by file, on a single line within the shell script, thereby exceeding the maximum line length allowed by the shell.
This is so mind-blowingly stupid that I can’t bring myself to file a bug report. Makes me want to use Caffe2. But I digress.
So, anyway, here’s how tensorflow can be built using Debian-managed NVIDIA driver, although not Debian-managed CUDA.
- Use Debian nvidia driver: as of this writing, version 390.87-2~bpo9+1
- Install NVIDIA CUDA, 9.1.85.3. Install into /usr/local
- Install NVIDIA libcudnn and libcudnn-dev 7.4.1.5, these get installed into /usr/lib/x86_64-linux-gnu and /usr/include/x86_64-linux-gnu.
- Unzip NVIDIA nccl 2.3.7-1+cuda9.-0 into /usr/local/nccl_2.3.7
- Install bazel 0.19.2 from downloaded dpkg
- Checkout tensorflow into $HOME/tf
- Configure bazel options:
- /usr/bin/python
- ?? /usr/lib/python2.7/dist-packages <- What does this mean? Is it where my existing packages from Debian are?
- XLA JIT: Y
- OpenCL SYCL: N
- ROCm: N
- CUDA: y
- CUDA SDK: 9.1
- /usr/local/cuda
- /usr/local/cuda-9.1
- (Empty for cuDNN 7)
- cuDNN 7: /usr
- TensorRT: N
- NCCL: 2.3.7
- /usr/local/nccl_2.3.7
- CUDA compute: 6.1
- clang: N
- (Empty for /usr/bin/gcc) <- Debian default is 6.3.0
- MPI: N
- (Empty for -march=native -Wno-sign-compare)
- WORKSPACE: N
- “Bazel build”
- Wait … build succeeds
- “Build the package”
- Pip Install
- Add cuda and nccl to /etc/ld.so.conf.d
- Test https://www.tensorflow.org/tutorials/
- Mostly works, with following caveats:
- model = tf.keras.models.Sequential yields
- WARNING:tensorflow:From /PHShome/gcs6/.local/lib/python2.7/site-packages/tensorflow/python/ops/init_ops.py:1253: calling __init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
- model.compile(optimizer='adam' yeilds
- WARNING:tensorflow:From /PHShome/gcs6/.local/lib/python2.7/site-packages/tensorflow/python/ops/resource_variable_ops.py:439: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
- model.fit yields
- 2018-12-28 14:03:34.416116: W tensorflow/compiler/xla/service/platform_util.cc:253] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
- Random web search suggests the following method to choose the correct GPU
- import os
- os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
- os.environ["CUDA_VISIBLE_DEVICES"]="1"
- import tensorflow as tf
- sess = tf.Session()
- Success!