Users can execute two types of jobs on ADA Clusters: Interactive jobs and Batch jobs.
Interactive jobs allow users to interact with applications in real-time within High-Performance Computing (HPC) environment. With these jobs, users can request one or more compute nodes in the HPC cluster via SLURM, and then use it to run commands or scripts directly via the command line. This can be accomplished with the help of srun and salloc commands. The output of these jobs will be displayed on the terminal screen. However, if the user disconnects from the cluster, the interactive jobs will terminate. These jobs can be useful for testing, debugging, and troubleshooting the code.
Note: As a general rule, launching interactive jobs from the login node is not recommended as it manages user sessions and has limited resources. Hence, users should always launch interactive jobs on compute nodes only.
The srun command launches an interactive session the compute nodes by requesting resources like memory, time, generic resources, node count, etc. via the
SLURM workload manager. When the resources become available, a shell prompt provides an interactive session. The user can then work interactively on the node for the specified amount of time. The session does not start until the
SLURM can allocate any available node for your job.
srun command runs an interactive job on a compute node for one hour with 200 MB of memory. User can then perform required tasks on the allocated node.
(base) [uw76577@ada ~]$ srun --time=00:10:00 --mem=200 --gres=gpu:1 --pty /bin/bash (base) [uw76577@g11 ~]$ echo $SLURM_NODELIST g11
It is worth noticing that once the interactive job starts, the Linux prompt will change and indicate you are on a compute node rather than a login node. The second command outputs the value of the node that is assigned to your interactive job. In this case, the job has been assigned to the g11 compute node.
$SLURM_NODELIST is a SLURM environment variable that stores the value of the current node.
The user should type
exit when finished using the node to release the allocation.
For a full description of
srun and its options, see here.
salloc is similar to
srun except that it only results in new resource allocation when it is invoked. Typically it is used to allocate resources on compute nodes in order to run an interactive session using a series of subsequent
srun commands or scripts to launch parallel tasks. It later releases the resources allocation following the requested time.
salloc the command will create a SLURM job allocation with the specified resources, including GPUs, memory, walltime.
(base) [uw76577@ada ~]$ salloc --time=01:00:00 --mem=500 --gres=gpu:2 salloc: Granted job allocation 34832 (base) [uw76577@ada ~]$ srun --pty /bin/bash (base) [uw76577@g08 ~]$ hostname g08
It’s worth noticing that the prompt hasn’t changed when
salloc is invoked. This is due to the fact that, while SLURM granted your job an allocation, you are not yet interactively connected to the allocated node. Only when using the
srun command or submitting a script for execution, an interactive session is launched on the available compute node and the job is executed.
The User should release the resources immediately if not needed anymore, by using
For a full description of
salloc and its options, see here.