Users can execute two types of jobs on ADA Clusters: Interactive jobs and Batch jobs.
Batch jobs can be submitted to the SLURM
workload manager, which uses a job submission file (SBATCH
file) to run the job on available cluster nodes. Unlike interactive jobs, the output of these jobs will be written to a log file instead of being displayed on the terminal. In this case, even if the user disconnects from the cluster, the jobs will continue to run. Batch Jobs are typically designed to run more than one script.
SBATCH FILE
Understanding how to submit jobs to the clusters using SLURM
is the first step toward taking advantage of ada
cluster. The majority of HPC jobs are executed by creating and submitting batch scripts.
There are two aspects to a job: resource requests and job steps.
- resource requests describe the number of computational resources (GPUs, RAM, run time, etc) that the job will need to run.
- job steps define the tasks that must be executed.
The best way to manage these two parts is within a single SBATCH
file ( job submission script ) that SLURM
uses to allocate resources and process your job steps.
Creating a Job Submission Script
To create a batch script, use your favorite text editor nano
or vim
and create a file that contains both SLURM
instructions and job instructions. The script is divided into three sections: the hashbang, the directives, and the commands.
- A hashbang appears on the first line of the
SLURM
script line which specifies the program that executes the script. This is generally#!/bin/bash
. - The Directives are SLURM-specific ones that specify resource requirements for the job including time, memory, nodes etc. These lines must be placed before any other commands or job steps, else they will be ignored.
- The commands may include setting up the environment like loading modules, setting variables, and running an application you want to execute.
Below is a simple example of a submission script :
myscript.sh
#!/bin/bash
#SBATCH --job-name=test_job # Job name
#SBATCH --mail-user=email@umbc.edu # Where to send mail
#SBATCH --mem=2000 # Job memory request
#SBATCH --gres=gpu:2 # Number of requested GPU(s)
#SBATCH --time=1:00:00 # Time limit days-hrs:min:sec
#SBATCH --constraint=rtx_2080 # Specific hardware constraint
#SBATCH --error=slurm.err # Error file name
#SBATCH --output=slurm.out # Output file name
module load Python/3.9.6-GCCcore-11.2.0 # load up the correct modules
python data/test.py # launch the code
The “shebang” line must be the very first line and simply informs the system that the file is a bash shell script. Following that, there are a number of SBATCH
directives that specify resource requirements and other job-related data. #SBATCH
informs bash that this is a SLURM
directive. All of these must appear at the top of the file, prior to any job steps. The above script would request 2 GPUs for 1 hour, along with 2000 MB of memory and hardware constraint 'rtx_2080'
for the job. Remember that these are just a few of the many #SBATCH
directives available; for a detailed list, run 'man sbatch'
.The last two lines consist of job steps that would execute the job by loading the required module and running a python script.
Submitting a job
This script can now be submitted to SLURM
using the SBATCH
command. Upon success, it will return the Job ID that has been assigned to the job.
(base)[uw76577@ada ~]$ sbatch myscript.sh
Submitted batch job 33172
Once the job is submitted, it enters the queue in the PENDING
state. When resources become available and the job is determined to be the highest priority, an allocation is created for it and it goes to the RUNNING
state. If the job is successfully completed, it is set to the COMPLETED
state; otherwise, it is set to the FAILED
state.
________________________________________________________________________________________________________________________________________________________
A simple job from the command line
module load Python/3.9.6-GCCcore-11.2.0
sbatch --job-name=test_job --mem=2000 --gres=gpu:1 --time=1:00:00 --constraint=rtx_2080 \
--mail-user=email@umbc.edu --error=slurm.err --output=slurm.out --wrap="python data/test.py"
is equivalent to
sbatch myscript.sh