Interactive Jobs

Users can execute two types of jobs on ADA Clusters: Interactive jobs and Batch jobs. 

Interactive jobs allow users to interact with applications in real-time within High-Performance Computing (HPC) environment. With these jobs, users can request one or more compute nodes in the HPC cluster via SLURM, and then use it to run commands or scripts directly via the command line. This can be accomplished with the help of srun and salloc commands. The output of these jobs will be displayed on the terminal screen. However, if the user disconnects from the cluster, the interactive jobs will terminate. These jobs can be useful for testing, debugging, and troubleshooting the code.

Note: As a general rule, launching interactive jobs from the login node is not recommended as it manages user sessions and has limited resources. Hence, users should always launch interactive jobs on compute nodes only.

srun

The srun command launches an interactive session the compute nodes by requesting resources like memory, time, generic resources, node count, etc. via the SLURM workload manager. When the resources become available, a shell prompt provides an interactive session. The user can then work interactively on the node for the specified amount of time. The session does not start until the SLURM can allocate any available node for your job.

The following srun command runs an interactive job on a compute node for one hour with 200 MB of memory. User can then perform required tasks on the allocated node. 

(base) [uw76577@ada ~]$ srun  --time=00:10:00 --mem=200 --gres=gpu:1 --pty /bin/bash
(base) [uw76577@g11 ~]$ echo $SLURM_NODELIST
g11

It is worth noticing that once the interactive job starts, the Linux prompt will change and indicate you are on a compute node rather than a login node. The second command outputs the value of the node that is assigned to your interactive job. In this case, the job has been assigned to the g11 compute node. $SLURM_NODELIST is a SLURM environment variable that stores the value of the current node.

The user should type exit when finished using the node to release the allocation.

For a full description of srun and its options, see here.

salloc

salloc is similar to srun except that it only results in new resource allocation when it is invoked. Typically it is used to allocate resources on compute nodes in order to run an interactive session using a series of subsequent srun commands or scripts to launch parallel tasks. It later releases the resources allocation following the requested time.

The below salloc the command will create a SLURM job allocation with the specified resources, including GPUs, memory, walltime.

(base) [uw76577@ada ~]$ salloc  --time=01:00:00 --mem=500 --gres=gpu:2
salloc: Granted job allocation 34832

(base) [uw76577@ada ~]$ srun --pty /bin/bash
(base) [uw76577@g08 ~]$ hostname
g08

It’s worth noticing that the prompt hasn’t changed when salloc is invoked. This is due to the fact that, while SLURM granted your job an allocation, you are not yet interactively connected to the allocated node. Only when using the srun command or submitting a script for execution, an interactive session is launched on the available compute node and the job is executed.

The User should release the resources immediately if not needed anymore, by using exit orCtrl-d.

For a full description of salloc and its options, see here.