TensorFlow

From ACENET
Jump to: navigation, search
Achtung.png Legacy documentation

This page describes a service provided by a retired ACENET system. Most ACENET services are currently provided by national systems, for which please visit https://docs.computecanada.ca.

Introduction

The Tensorflow developers expect everyone to have a very specific and very recent version of Ubuntu to be able to run their binaries, or to easily compile the source. Unfortunately, this is not an option on a shared cluster, and thus Tensorflow always needs to be compiled from source on ACENET, with some tweaks as well, depending on the version.

Compilation

Here are the instructions how to compile Tensorflow 1.0 for Python 2.7.10. The same instructions should work with other versions of Python, too. You will only need to load a different Python modulefile in the beginning, and the name of the wheel will be different in the end. The 'sed' line adds an additional compilation option to the config. And there are also changes required to protobuf.bzl.

module purge
module load gcc/4.8.3 java/8 python/2.7.10
mkdir -p $HOME/tensorflow
cd $HOME/tensorflow/

# Bazel
mkdir bazel
cd bazel/
wget https://github.com/bazelbuild/bazel/releases/download/0.4.5/bazel-0.4.5-dist.zip
unzip bazel-0.4.5-dist.zip
./compile.sh
cd ../
export PATH=$HOME/tensorflow/bazel/output:$PATH

# SWIG
wget http://ufpr.dl.sourceforge.net/project/swig/swig/swig-3.0.10/swig-3.0.10.tar.gz
tar xzf swig-3.0.10.tar.gz
cd swig-3.0.10/
./configure --prefix=$HOME/tensorflow/swig
make clean && make && make install
cd ../
export SWIG_PATH=$HOME/tensorflow/swig/bin/swig

# Tensorflow
git clone -b r1.0 https://github.com/tensorflow/tensorflow
cd tensorflow/
./configure
# Choose suggested defaults, which are usually No.
# Ensure that you answer No to jemalloc.
bazel clean
# Link librt on RHEL6 explicitly. The hashtag symbols in the line below do not designated comments in this instruction set, and are supposed to be typed.
sed -i -e 's/return \[\]  # No extension link/return \["-lrt"\]  # No extension link/g' tensorflow/tensorflow.bzl
bazel fetch --config=opt //tensorflow/tools/pip_package:build_pip_package
# edit protobuf in the cache directory, it will look something like this
# $HOME/.cache/bazel/_bazel_$USER/298d6b5580e10224d4c831585ffd72a9/external/protobuf/protobuf.bzl
# Locate the ctx.action section and add the following line in there "env=ctx.configuration.default_shell_env," (without quotes)
bazel build --config=opt //tensorflow/tools/pip_package:build_pip_package
bazel-bin/tensorflow/tools/pip_package/build_pip_package $HOME/tensorflow/tensorflow_pkg
pip install --user $HOME/tensorflow/tensorflow_pkg/tensorflow-1.0.1-cp27-cp27mu-linux_x86_64.whl

Running

It appears that Tensorflow is using all of the available CPU resources on a machine it is running on (very similarly to how Java behaves), which is not suitable for a shared computing environment. When it uses more CPUs, it also uses more memory, and thus can crash (counter-intuitively) on a machine with more resources. In order to match the number of Tensorflow threads to the number of slots requested in the submission script, start a Tensorflow session in your Python script with an explicit request for a number of threads required. Here is an example for a serial job with one thread only.

import tensorflow as tf
sess = tf.Session(config=tf.ConfigProto(inter_op_parallelism_threads=1,intra_op_parallelism_threads=1,use_per_session_threads=True))

Here is an example of a submissions script:

#$ -cwd
#$ -l h_rt=01:00:00

module purge
module load gcc/4.8.3 python/2.7.10
python script.py