What are TensorFlow and Keras?
TensorFlow is a popular open-source machine learning library written in C++ with bindings for Python. Keras is higher level library that is meant to provide a simpler interface specifically for deep learning tasks. For this tutorial we will use TensorFlow as a backend for Keras.
Before starting this guide please read the pages on USING YOUR TAKI ACCOUNT, HOW TO RUN PYTHON ON TAKI, and HOW TO RUN ON THE GPUS.
What modules do I use?
There are several options for loading Keras. If one loads one of the individual Keras modules the prerequisite modules will automatically be loaded as well. For example
[bajona1@taki-usr1 kerasExample]$ module load Keras/2.2.4-foss-2018b-Python-3.6.6 The following have been reloaded with a version change: 1) GCCcore/8.2.0 => GCCcore/7.3.0 2) GMP/6.1.2-GCCcore-8.2.0 => GMP/6.1.2-GCCcore-7.3.0 3) HDF5/1.10.2-intel-2018b => HDF5/1.10.2-foss-2018b 4) Python/3.7.6-intel-2019a => Python/3.6.6-foss-2018b 5) SQLite/3.27.2-GCCcore-8.2.0 => SQLite/3.24.0-GCCcore-7.3.0 6) Tcl/8.6.9-GCCcore-8.2.0 => Tcl/8.6.8-GCCcore-7.3.0 7) XZ/5.2.4-GCCcore-8.2.0 => XZ/5.2.4-GCCcore-7.3.0 8) bzip2/1.0.6-GCCcore-8.2.0 => bzip2/1.0.6-GCCcore-7.3.0 9) libffi/3.2.1-GCCcore-8.2.0 => libffi/3.2.1-GCCcore-7.3.0 10) libreadline/8.0-GCCcore-8.2.0 => libreadline/7.0-GCCcore-7.3.0 11) ncurses/6.1-GCCcore-8.2.0 => ncurses/6.1-GCCcore-7.3.0 12) zlib/1.2.11-GCCcore-8.2.0 => zlib/1.2.11-GCCcore-7.3.0
There is also currently one Python module on taki that automatically includes Keras, specifically
module load Python/3.7.6-intel-2019a
Running Keras code on taki
Data acquisition
As an example of how Keras can be used, we will create a neural network that classifies handwritten digits. The dataset we use is the MNIST dataset, which is included with scikit-learn, a popular library of machine learning utilities.
Download: ../code-2020/keras/helpers.py
Neural network setup
The network we create is fully connected with two hidden layers using ReLU activators, and 10 output nodes using the softmax activation function. The optimizer we use is Adam, and the loss function we use is categorical cross-entropy.
Download: ../code-2020/keras/network.py
Training the neural network
We now train the network that we defined above. We first load in functions from the previous two files and set values for some standard hyperparameters. We then compile the network to use the optimizer and loss function we had chosen above. Next we load our data set and create categorical labels for it. Finally we train our network by using model.fit(), then save the network to disk.
Download: ../code-2020/keras/train.py
Slurm file
In order to train our network on the GPUs, we need to create a slurm file. The following slurm file requests a single GPU from the gpu partition, loads the Python bundle we’d like to use, then runs Python on train.py using srun. Note that whenever using the GPU nodes on taki it is best practice to request the same proportion of CPUs as GPUs. For instance, if we were to need half the GPUs in a node, then we should also request half of the CPUs that it contains.
Download: ../code-2020/keras/run.slurm
Submitting the job
Once the slurm file is created we can submit the job using sbatch.
[bajona1@taki-usr1 kerasExample]$ sbatch run.slurm Submitted batch job 3037670
After the run finishes there will be an error file, an output file, the directory containing the network, and possibly a __pycache__ directory.
[bajona1@taki-usr1 kerasExample]$ ll total 56 drwxr-s--- 4 bajona1 pi_gobbert 5 Jul 31 13:20 digit_prediction/ -rw-rw---- 1 bajona1 pi_gobbert 244 Jul 31 13:06 helpers.py -rw-rw---- 1 bajona1 pi_gobbert 547 Jul 27 11:32 network.py drwxrws--- 2 bajona1 pi_gobbert 4 Jul 31 13:20 __pycache__/ -rw-rw---- 1 bajona1 pi_gobbert 313 Jul 27 11:21 run.slurm -rw-rw---- 1 bajona1 pi_gobbert 6735 Jul 31 13:20 slurm.err -rw-rw---- 1 bajona1 pi_gobbert 3424 Jul 31 13:20 slurm.out -rw-rw---- 1 bajona1 pi_gobbert 575 Jul 27 11:40 train.py
Note that the error file will not be empty. Both taki’s module system and TensorFlow use this file for logs.
[bajona1@taki-usr1 kerasExample]$ more slurm.err The following have been reloaded with a version change: 1) GCCcore/7.3.0 => GCCcore/8.2.0 2) binutils/2.30-GCCcore-7.3.0 => binutils/2.31.1-GCCcore-8.2.0 3) bzip2/1.0.6-GCCcore-7.3.0 => bzip2/1.0.6-GCCcore-8.2.0 4) icc/2018.3.222-GCC-7.3.0-2.30 => icc/2019.1.144-GCC-8.2.0-2.31.1 5) iccifort/2018.3.222-GCC-7.3.0-2.30 => iccifort/2019.1.144-GCC-8.2.0-2.31.1 6) ifort/2018.3.222-GCC-7.3.0-2.30 => ifort/2019.1.144-GCC-8.2.0-2.31.1 7) iimpi/2018b => iimpi/2019a 8) imkl/2018.3.222-iimpi-2018b => imkl/2019.1.144-iimpi-2019a 9) impi/2018.3.222-iccifort-2018.3.222-GCC-7.3.0-2.30 => impi/2018.4.274-iccif ort-2019.1.144-GCC-8.2.0-2.31.1 10) intel/2018b => intel/2019a 11) ncurses/6.1-GCCcore-7.3.0 => ncurses/6.1-GCCcore-8.2.0 12) zlib/1.2.11-GCCcore-7.3.0 => zlib/1.2.11-GCCcore-8.2.0 2020-07-31 13:20:51.682416: I tensorflow/stream_executor/platform/default/dso_lo ader.cc:48] Successfully opened dynamic library libcudart.so.10.1 2020-07-31 13:20:54.314390: I tensorflow/stream_executor/platform/default/dso_lo ader.cc:48] Successfully opened dynamic library libcuda.so.1 ...
The output file will contain information about the network’s performance during training. We see in this example that after eight epochs the network predicts with an accuracy of 0.99 on the data it was trained on.
[bajona1@taki-usr1 kerasExample]$ more slurm.out Epoch 1/8 1/29 [>.............................] - ETA: 0s - loss: 6.0164 - accuracy: 0.12 25/29 [========================>.....] - ETA: 0s - loss: 1.1697 - accuracy: 0.69 29/29 [==============================] - 0s 2ms/step - loss: 1.0750 - accuracy: 0.7229 Epoch 2/8 1/29 [>.............................] - ETA: 0s - loss: 0.1359 - accuracy: 0.95 26/29 [=========================>....] - ETA: 0s - loss: 0.1254 - accuracy: 0.95 29/29 [==============================] - 0s 2ms/step - loss: 0.1249 - accuracy: 0.9599 Epoch 3/8 1/29 [>.............................] - ETA: 0s - loss: 0.0619 - accuracy: 1.00 26/29 [=========================>....] - ETA: 0s - loss: 0.0594 - accuracy: 0.98 29/29 [==============================] - 0s 2ms/step - loss: 0.0625 - accuracy: 0.9861 Epoch 4/8 1/29 [>.............................] - ETA: 0s - loss: 0.0397 - accuracy: 1.00 26/29 [=========================>....] - ETA: 0s - loss: 0.0470 - accuracy: 0.98 29/29 [==============================] - 0s 2ms/step - loss: 0.0454 - accuracy: 0.9894 Epoch 5/8 1/29 [>.............................] - ETA: 0s - loss: 0.0273 - accuracy: 1.00 26/29 [=========================>....] - ETA: 0s - loss: 0.0261 - accuracy: 0.99 29/29 [==============================] - 0s 2ms/step - loss: 0.0262 - accuracy: 0.9939 Epoch 6/8 1/29 [>.............................] - ETA: 0s - loss: 0.0137 - accuracy: 1.00 26/29 [=========================>....] - ETA: 0s - loss: 0.0163 - accuracy: 0.99 29/29 [==============================] - 0s 2ms/step - loss: 0.0162 - accuracy: 0.9989 Epoch 7/8 1/29 [>.............................] - ETA: 0s - loss: 0.0360 - accuracy: 0.98 26/29 [=========================>....] - ETA: 0s - loss: 0.0118 - accuracy: 0.99 29/29 [==============================] - 0s 2ms/step - loss: 0.0119 - accuracy: 0.9983 Epoch 8/8 1/29 [>.............................] - ETA: 0s - loss: 0.0055 - accuracy: 1.00 25/29 [========================>.....] - ETA: 0s - loss: 0.0094 - accuracy: 0.99 29/29 [==============================] - 0s 2ms/step - loss: 0.0093 - accuracy: 0.9989
For more information, you can consult the documentation of Keras and TensorFlow.