TensorFlow
Legacy documentation
This page describes a service provided by a retired ACENET system. Most ACENET services are currently provided by national systems, for which please visit https://docs.computecanada.ca. |
Introduction
The Tensorflow developers expect everyone to have a very specific and very recent version of Ubuntu to be able to run their binaries, or to easily compile the source. Unfortunately, this is not an option on a shared cluster, and thus Tensorflow always needs to be compiled from source on ACENET, with some tweaks as well, depending on the version.
Compilation
Here are the instructions how to compile Tensorflow 1.0 for Python 2.7.10. The same instructions should work with other versions of Python, too. You will only need to load a different Python modulefile in the beginning, and the name of the wheel will be different in the end. The 'sed' line adds an additional compilation option to the config. And there are also changes required to protobuf.bzl.
module purge module load gcc/4.8.3 java/8 python/2.7.10 mkdir -p $HOME/tensorflow cd $HOME/tensorflow/ # Bazel mkdir bazel cd bazel/ wget https://github.com/bazelbuild/bazel/releases/download/0.4.5/bazel-0.4.5-dist.zip unzip bazel-0.4.5-dist.zip ./compile.sh cd ../ export PATH=$HOME/tensorflow/bazel/output:$PATH # SWIG wget http://ufpr.dl.sourceforge.net/project/swig/swig/swig-3.0.10/swig-3.0.10.tar.gz tar xzf swig-3.0.10.tar.gz cd swig-3.0.10/ ./configure --prefix=$HOME/tensorflow/swig make clean && make && make install cd ../ export SWIG_PATH=$HOME/tensorflow/swig/bin/swig # Tensorflow git clone -b r1.0 https://github.com/tensorflow/tensorflow cd tensorflow/ ./configure # Choose suggested defaults, which are usually No. # Ensure that you answer No to jemalloc. bazel clean # Link librt on RHEL6 explicitly. The hashtag symbols in the line below do not designated comments in this instruction set, and are supposed to be typed. sed -i -e 's/return \[\] # No extension link/return \["-lrt"\] # No extension link/g' tensorflow/tensorflow.bzl bazel fetch --config=opt //tensorflow/tools/pip_package:build_pip_package # edit protobuf in the cache directory, it will look something like this # $HOME/.cache/bazel/_bazel_$USER/298d6b5580e10224d4c831585ffd72a9/external/protobuf/protobuf.bzl # Locate the ctx.action section and add the following line in there "env=ctx.configuration.default_shell_env," (without quotes) bazel build --config=opt //tensorflow/tools/pip_package:build_pip_package bazel-bin/tensorflow/tools/pip_package/build_pip_package $HOME/tensorflow/tensorflow_pkg pip install --user $HOME/tensorflow/tensorflow_pkg/tensorflow-1.0.1-cp27-cp27mu-linux_x86_64.whl
Running
It appears that Tensorflow is using all of the available CPU resources on a machine it is running on (very similarly to how Java behaves), which is not suitable for a shared computing environment. When it uses more CPUs, it also uses more memory, and thus can crash (counter-intuitively) on a machine with more resources. In order to match the number of Tensorflow threads to the number of slots requested in the submission script, start a Tensorflow session in your Python script with an explicit request for a number of threads required. Here is an example for a serial job with one thread only.
import tensorflow as tf sess = tf.Session(config=tf.ConfigProto(inter_op_parallelism_threads=1,intra_op_parallelism_threads=1,use_per_session_threads=True))
Here is an example of a submissions script:
#$ -cwd #$ -l h_rt=01:00:00 module purge module load gcc/4.8.3 python/2.7.10 python script.py