Users can execute two types of jobs on ADA Clusters: Interactive jobs and Batch jobs.
Interactive jobs allow users to interact with applications in real-time within High-Performance Computing (HPC) environment. With these jobs, users can request one or more compute nodes in the HPC cluster via SLURM, and then use it to run commands or scripts directly via the command line. This can be accomplished with the help of srun and salloc commands. The output of these jobs will be displayed on the terminal screen. However, if the user disconnects from the cluster, the interactive jobs will terminate. These jobs can be useful for testing, debugging, and troubleshooting the code.
Note: As a general rule, launching interactive jobs on the login node is not recommended as it manages user sessions and has limited resources. Hence, users should always launch interactive jobs on compute nodes only.
srun
The srun
command launches an interactive session on the compute nodes by requesting resources like memory, time, generic resources, node count, etc. via the SLURM
workload manager. When the resources become available, a shell prompt provides an interactive session. The user can then work interactively on the node for the specified amount of time. The session does not start until the SLURM
can allocate any available node for your job.
The following srun
command runs an interactive job on a compute node for one hour with 200 MB of memory. User can then perform required tasks on the allocated node.
(base) [uw76577@ada ~]$ srun --time=00:10:00 --mem=200 --gres=gpu:1 --pty /bin/bash
(base) [uw76577@g11 ~]$ echo $SLURM_NODELIST
g11
It is worth noticing that once the interactive job starts, the Linux prompt will change and indicate you are on a compute node rather than a login node. The second command outputs the value of the node that is assigned to your interactive job. In this case, the job has been assigned to the g11 compute node. $SLURM_NODELIST
is a SLURM environment variable that stores the value of the current node.
The user should type exit
when finished using the node to release the allocation.
For a full description of srun
and its options, see here.
salloc
salloc
is similar to srun
except that it only results in new resource allocation when it is invoked. Typically, it is used to allocate resources on compute nodes in order to run an interactive session using a series of subsequent srun
commands or scripts to launch parallel tasks. It later releases the resources allocation following the requested time.
The below salloc
the command will create a SLURM job allocation with the specified resources, including GPUs, memory, walltime.
(base) [uw76577@ada ~]$ salloc --time=01:00:00 --mem=500 --gres=gpu:2
salloc: Granted job allocation 34832
(base) [uw76577@ada ~]$ srun --pty /bin/bash
(base) [uw76577@g08 ~]$ hostname
g08
It’s worth noticing that the prompt hasn’t changed when salloc
is invoked. This is due to the fact that, while SLURM granted your job an allocation, you are not yet interactively connected to the allocated node. Only when using the srun
command or submitting a script for execution, an interactive session is launched on the available compute node and the job is executed.
The User should release the resources immediately if not needed anymore, by using exit
or Ctrl-d
.
For a full description of salloc
and its options, see here.