Greg's House of Treats

Friday, December 28, 2018

TensorFlow on sherbert

The issues which prevented me from using Open Live Writer have been resolved. Glory be!

Anyway, building tensorflow using Debian’s CUDA package is a freakin’ nightmare. Tensorflow uses an exotic build system called Bazel. Bazel cannot successfully use Debian’s CUDA package because … (dramatic pause) … it creates a shell script that copies the entire contents of /usr/include into its own private directory, file by file, on a single line within the shell script, thereby exceeding the maximum line length allowed by the shell.

This is so mind-blowingly stupid that I can’t bring myself to file a bug report. Makes me want to use Caffe2. But I digress.

So, anyway, here’s how tensorflow can be built using Debian-managed NVIDIA driver, although not Debian-managed CUDA.

Use Debian nvidia driver: as of this writing, version 390.87-2~bpo9+1
Install NVIDIA CUDA, 9.1.85.3. Install into /usr/local
Install NVIDIA libcudnn and libcudnn-dev 7.4.1.5, these get installed into /usr/lib/x86_64-linux-gnu and /usr/include/x86_64-linux-gnu.
Unzip NVIDIA nccl 2.3.7-1+cuda9.-0 into /usr/local/nccl_2.3.7
Install bazel 0.19.2 from downloaded dpkg
Checkout tensorflow into $HOME/tf
Configure bazel options:

/usr/bin/python
?? /usr/lib/python2.7/dist-packages <- What does this mean? Is it where my existing packages from Debian are?
XLA JIT: Y
OpenCL SYCL: N
ROCm: N
CUDA: y
CUDA SDK: 9.1
/usr/local/cuda
/usr/local/cuda-9.1
(Empty for cuDNN 7)
cuDNN 7: /usr
TensorRT: N
NCCL: 2.3.7
/usr/local/nccl_2.3.7
CUDA compute: 6.1
clang: N
(Empty for /usr/bin/gcc) <- Debian default is 6.3.0
MPI: N
(Empty for -march=native -Wno-sign-compare)
WORKSPACE: N

“Bazel build”

Wait … build succeeds

“Build the package”
Pip Install
Add cuda and nccl to /etc/ld.so.conf.d
Test https://www.tensorflow.org/tutorials/

Mostly works, with following caveats:
model = tf.keras.models.Sequential yields

WARNING:tensorflow:From /PHShome/gcs6/.local/lib/python2.7/site-packages/tensorflow/python/ops/init_ops.py:1253: calling __init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.

model.compile(optimizer='adam' yeilds

WARNING:tensorflow:From /PHShome/gcs6/.local/lib/python2.7/site-packages/tensorflow/python/ops/resource_variable_ops.py:439: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.

model.fit yields

2018-12-28 14:03:34.416116: W tensorflow/compiler/xla/service/platform_util.cc:253] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal

Random web search suggests the following method to choose the correct GPU

import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="1"
import tensorflow as tf
sess = tf.Session()
Success!

# posted by Greg : 2:16 PM

Greg's House of Treats

Friday, December 28, 2018

TensorFlow on sherbert

Links

Archives