Users can execute two types of jobs on ADA Clusters: Interactive jobs and Batch jobs.
Batch jobs can be submitted to the SLURM workload manager, which uses a job submission file (SBATCH file) to run the job on available nodes. Unlike interactive jobs, the output of these jobs will be written to a log file instead of being displayed on the terminal. In this case, even if the user disconnects from the cluster, the jobs will continue to run. Batch Jobs are typically designed to run more than one script.
Understanding how to submit jobs to the clusters using SLURM is the first step toward taking advantage of ADA clusters. The majority of HPC jobs are executed by creating and submitting batch scripts.
There are two aspects to a job: resource requests and job steps.
- resource requests describe the number of computational resources (GPUs, RAM, run time, etc) that the job will need to run.
- job steps define the tasks that must be executed.
The best way to manage these two parts is within a single
SBATCH file ( job submission script ) that SLURM uses to allocate resources and process your job steps.
Creating a Job Submission Script
To create a batch script, use your favorite text editor
vim and create a file that contains both SLURM instructions and job instructions. The script is divided into three sections: the hashbang, the directives, and the commands.
- A hashbang appears on the first line of the SLURM script line which specifies the program that executes the script. This is generally
- The Directives are SLURM-specific ones that specify resource requirements for the job including time, memory, nodes etc. These lines must be placed before any other commands or job steps, else they will be ignored.
- The commands may include setting up the environment like loading modules, setting variables, and running an application you want to execute.
Below is a simple example of a submission script :
#!/bin/bash #SBATCH --job-name=test_job # Job name #SBATCH --email@example.com # Where to send mail #SBATCH --mem=2000 # Job memory request #SBATCH --gres=gpu:1 # Number of requested GPU(s) #SBATCH --time=1:00:00 # Time limit days-hrs:min:sec #SBATCH --constraint=rtx_2080 # Specific hardware constraint #SBATCH --error=slurm.err # Error file name #SBATCH --output=slurm.out # Output file name module load python/2.7 # load up the correct modules, if required python data/MNIST.py # launch the code
The “shebang” line must be the very first line and simply informs the system that the file is a bash shell script. Following that, there is a number of SBATCH directives that specify resource requirements and other job-related data.
#SBATCH informs bash that this is a SLURM directive. All of these must appear at the top of the file, prior to any job steps. The above script would request one GPU for 1 hour, along with 2000 MB of memory and hardware constraint
'rtx_2080' for the job. Remember that these are just a few of the many
#SBATCH directives available; for a detailed list, run
'man sbatch'.The last two lines consist of job steps that would execute the job by loading the required module and running a python script.
Submitting a job
This script can now be submitted to SLURM using the
SBATCH command. Upon success, it will return the Job ID that has been assigned to the job.
(base)[uw76577@ada ~]$ sbatch myscript.sh Submitted batch job 33172
Once the job is submitted, it enters the queue in the
PENDING state. When resources become available and the job is determined to be the highest priority, an allocation is created for it and it goes to the
RUNNING state. If the job is successfully completed, it is set to the
COMPLETED state; otherwise, it is set to the
A simple job from the command line
module load python/2.7 sbatch --job-name=test_job --mem=2000 --gres=gpu:1 --time=1:00:00 --constraint=rtx_2080 \ --firstname.lastname@example.org --error=slurm.err --output=slurm.out --wrap="python data/test.py"
is equivalent to