TensorFlow/Keras

What are TensorFlow and Keras?

TensorFlow is a popular open-source machine learning library written in C++ with bindings for Python. Keras is higher level library that is meant to provide a simpler interface specifically for deep learning tasks. For this tutorial we will use TensorFlow as a backend for Keras.

Before starting this guide please read the pages on USING YOUR TAKI ACCOUNT, HOW TO RUN PYTHON ON TAKI, and HOW TO RUN ON THE GPUS.

What modules do I use?

There are several options for loading Keras. If one loads one of the individual Keras modules the prerequisite modules will automatically be loaded as well. For example

[bajona1@taki-usr1 kerasExample]$ module load Keras/2.2.4-foss-2018b-Python-3.6.6 

The following have been reloaded with a version change:
  1) GCCcore/8.2.0 => GCCcore/7.3.0
  2) GMP/6.1.2-GCCcore-8.2.0 => GMP/6.1.2-GCCcore-7.3.0
  3) HDF5/1.10.2-intel-2018b => HDF5/1.10.2-foss-2018b
  4) Python/3.7.6-intel-2019a => Python/3.6.6-foss-2018b
  5) SQLite/3.27.2-GCCcore-8.2.0 => SQLite/3.24.0-GCCcore-7.3.0
  6) Tcl/8.6.9-GCCcore-8.2.0 => Tcl/8.6.8-GCCcore-7.3.0
  7) XZ/5.2.4-GCCcore-8.2.0 => XZ/5.2.4-GCCcore-7.3.0
  8) bzip2/1.0.6-GCCcore-8.2.0 => bzip2/1.0.6-GCCcore-7.3.0
  9) libffi/3.2.1-GCCcore-8.2.0 => libffi/3.2.1-GCCcore-7.3.0
 10) libreadline/8.0-GCCcore-8.2.0 => libreadline/7.0-GCCcore-7.3.0
 11) ncurses/6.1-GCCcore-8.2.0 => ncurses/6.1-GCCcore-7.3.0
 12) zlib/1.2.11-GCCcore-8.2.0 => zlib/1.2.11-GCCcore-7.3.0

There is also currently one Python module on taki that automatically includes Keras, specifically

module load Python/3.7.6-intel-2019a

Running Keras code on taki

Data acquisition

As an example of how Keras can be used, we will create a neural network that classifies handwritten digits. The dataset we use is the MNIST dataset, which is included with scikit-learn, a popular library of machine learning utilities.


Download: ../code-2020/keras/helpers.py

Neural network setup

The network we create is fully connected with two hidden layers using ReLU activators, and 10 output nodes using the softmax activation function. The optimizer we use is Adam, and the loss function we use is categorical cross-entropy.


Download: ../code-2020/keras/network.py

Training the neural network

We now train the network that we defined above. We first load in functions from the previous two files and set values for some standard hyperparameters. We then compile the network to use the optimizer and loss function we had chosen above. Next we load our data set and create categorical labels for it. Finally we train our network by using model.fit(), then save the network to disk.


Download: ../code-2020/keras/train.py

Slurm file

In order to train our network on the GPUs, we need to create a slurm file. The following slurm file requests a single GPU from the gpu partition, loads the Python bundle we’d like to use, then runs Python on train.py using srun. Note that whenever using the GPU nodes on taki it is best practice to request the same proportion of CPUs as GPUs. For instance, if we were to need half the GPUs in a node, then we should also request half of the CPUs that it contains.


Download: ../code-2020/keras/run.slurm

Submitting the job

Once the slurm file is created we can submit the job using sbatch.

[bajona1@taki-usr1 kerasExample]$ sbatch run.slurm
Submitted batch job 3037670

After the run finishes there will be an error file, an output file, the directory containing the network, and possibly a __pycache__ directory.

[bajona1@taki-usr1 kerasExample]$ ll
total 56
drwxr-s--- 4 bajona1 pi_gobbert    5 Jul 31 13:20 digit_prediction/
-rw-rw---- 1 bajona1 pi_gobbert  244 Jul 31 13:06 helpers.py
-rw-rw---- 1 bajona1 pi_gobbert  547 Jul 27 11:32 network.py
drwxrws--- 2 bajona1 pi_gobbert    4 Jul 31 13:20 __pycache__/
-rw-rw---- 1 bajona1 pi_gobbert  313 Jul 27 11:21 run.slurm
-rw-rw---- 1 bajona1 pi_gobbert 6735 Jul 31 13:20 slurm.err
-rw-rw---- 1 bajona1 pi_gobbert 3424 Jul 31 13:20 slurm.out
-rw-rw---- 1 bajona1 pi_gobbert  575 Jul 27 11:40 train.py

Note that the error file will not be empty. Both taki’s module system and TensorFlow use this file for logs.

[bajona1@taki-usr1 kerasExample]$ more slurm.err

The following have been reloaded with a version change:
  1) GCCcore/7.3.0 => GCCcore/8.2.0
  2) binutils/2.30-GCCcore-7.3.0 => binutils/2.31.1-GCCcore-8.2.0
  3) bzip2/1.0.6-GCCcore-7.3.0 => bzip2/1.0.6-GCCcore-8.2.0
  4) icc/2018.3.222-GCC-7.3.0-2.30 => icc/2019.1.144-GCC-8.2.0-2.31.1
  5) iccifort/2018.3.222-GCC-7.3.0-2.30 => iccifort/2019.1.144-GCC-8.2.0-2.31.1
  6) ifort/2018.3.222-GCC-7.3.0-2.30 => ifort/2019.1.144-GCC-8.2.0-2.31.1
  7) iimpi/2018b => iimpi/2019a
  8) imkl/2018.3.222-iimpi-2018b => imkl/2019.1.144-iimpi-2019a
  9) impi/2018.3.222-iccifort-2018.3.222-GCC-7.3.0-2.30 => impi/2018.4.274-iccif
ort-2019.1.144-GCC-8.2.0-2.31.1
 10) intel/2018b => intel/2019a
 11) ncurses/6.1-GCCcore-7.3.0 => ncurses/6.1-GCCcore-8.2.0
 12) zlib/1.2.11-GCCcore-7.3.0 => zlib/1.2.11-GCCcore-8.2.0

2020-07-31 13:20:51.682416: I tensorflow/stream_executor/platform/default/dso_lo
ader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-07-31 13:20:54.314390: I tensorflow/stream_executor/platform/default/dso_lo
ader.cc:48] Successfully opened dynamic library libcuda.so.1

...

The output file will contain information about the network’s performance during training. We see in this example that after eight epochs the network predicts with an accuracy of 0.99 on the data it was trained on.

[bajona1@taki-usr1 kerasExample]$ more slurm.out
Epoch 1/8
 1/29 [>.............................] - ETA: 0s - loss: 6.0164 - accuracy: 0.12
25/29 [========================>.....] - ETA: 0s - loss: 1.1697 - accuracy: 0.69
29/29 [==============================] - 0s 2ms/step - loss: 1.0750 - accuracy: 
0.7229
Epoch 2/8
 1/29 [>.............................] - ETA: 0s - loss: 0.1359 - accuracy: 0.95
26/29 [=========================>....] - ETA: 0s - loss: 0.1254 - accuracy: 0.95
29/29 [==============================] - 0s 2ms/step - loss: 0.1249 - accuracy: 
0.9599
Epoch 3/8
 1/29 [>.............................] - ETA: 0s - loss: 0.0619 - accuracy: 1.00
26/29 [=========================>....] - ETA: 0s - loss: 0.0594 - accuracy: 0.98
29/29 [==============================] - 0s 2ms/step - loss: 0.0625 - accuracy: 
0.9861
Epoch 4/8
 1/29 [>.............................] - ETA: 0s - loss: 0.0397 - accuracy: 1.00
26/29 [=========================>....] - ETA: 0s - loss: 0.0470 - accuracy: 0.98
29/29 [==============================] - 0s 2ms/step - loss: 0.0454 - accuracy: 
0.9894
Epoch 5/8
 1/29 [>.............................] - ETA: 0s - loss: 0.0273 - accuracy: 1.00
26/29 [=========================>....] - ETA: 0s - loss: 0.0261 - accuracy: 0.99
29/29 [==============================] - 0s 2ms/step - loss: 0.0262 - accuracy: 
0.9939
Epoch 6/8
 1/29 [>.............................] - ETA: 0s - loss: 0.0137 - accuracy: 1.00
26/29 [=========================>....] - ETA: 0s - loss: 0.0163 - accuracy: 0.99
29/29 [==============================] - 0s 2ms/step - loss: 0.0162 - accuracy: 
0.9989
Epoch 7/8
 1/29 [>.............................] - ETA: 0s - loss: 0.0360 - accuracy: 0.98
26/29 [=========================>....] - ETA: 0s - loss: 0.0118 - accuracy: 0.99
29/29 [==============================] - 0s 2ms/step - loss: 0.0119 - accuracy: 
0.9983
Epoch 8/8
 1/29 [>.............................] - ETA: 0s - loss: 0.0055 - accuracy: 1.00
25/29 [========================>.....] - ETA: 0s - loss: 0.0094 - accuracy: 0.99
29/29 [==============================] - 0s 2ms/step - loss: 0.0093 - accuracy: 
0.9989

For more information, you can consult the documentation of Keras and TensorFlow.