Skip to Main Content

Monitoring Jobs

Monitoring the status of running batch jobs

Once your job has been submitted, you can check the status of your job using the squeue command. squeue shows a listing of all currently queued jobs and their state. Common states include:

Code State Explanation
 R  Running  Job has a resource allocation and is currently executing
 PD  Pending  Job is awaiting resource allocation
 CD  Completed  Job has been completed and exited
 F  Failed  Job terminated with a non-zero exit code
 CA  Cancelled  Job was explicitly canceled by the user of system administrator

squeue

squeue provides information about jobs in Slurm scheduling queue, and is best used for viewing jobs and job step information for active jobs. For more details on squeue refer to the squeue manual or run squeue --helpman squeue.

$squeue 
      JOBID PARTITION     NAME   USER  ST     TIME  NODES NODELIST(REASON)
      33067     gpu      mean   user1  R       0:01      1 g05
      18956     gpu      calc   user2  R      48:38      1 g03
      18967     gpu      wrap   user1  R      14:25      1 g09

Most common arguments to squeue are -u $USER for listing only user’s jobs, and -j #job for listing job specified by the job number.

To view current user jobs:

squeue -u $USER

To view filter jobs, use the -j option followed by the job ID.

squeue -j 2542
Commands with options  Outcome
squeue –long              Provide more job information
squeue –user=USER_ID  Provide job information for specific UserID
squeue –states=pending  Show pending jobs only
squeue –account=ACCOUNT_ID  Provide information for jobs running for given AccountID
squeue –Format=jobid,prioritylong,feature,tres-alloc:50,state       Customize output of squeue       
squeue –help  Show all options

 

Checking finished jobs

We can retrieve the history for a completed job (no longer in the queue) using sacct command.

sacct

Sacct reports job accounting information about active or completed jobs.For a complete list of sacct options please refer to the sacct manual or run man sacct.

$sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
     18963        wrap    gpu          pi          2    COMPLETED      0:0 

     18964        mean1   gpu          pi          1    COMPLETED      0:0 

To retrieve statistics of a particular job :

sacct -j <JobID> --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize, 
nnodes,ncpus,nodelist
Commands with options  Outcome
sacct  –starttime=2021-09-27 –endtime=2021-10-04        Show information from 9th Sept 21 to 4th Oct 21
sacct  –format=”JobID,user,elapsed, MaxRSS,ReqMem,MaxVMSize”  Provide job information for specific UserID
sacct –accounts=ACCOUNT_ID  Show pending jobs only
squeue –account=ACCOUNT_ID  Show information for all the users under AccountID
sacct –help Show all options 

 

Useful SLURM commands

sinfo

sinfo allows users to view information about SLURM nodes and partition and its states.                                                                                                                                                                                                                     

 $sinfo
       PARTITION AVAIL  TIMELIMIT  NODES  STATE   NODELIST
        --------- ----- ---------- ------- ------- ----------      
         gpu*         up   infinite     10    mix g[01-02,05-12]
         gpu*         up   infinite      3   idle g[03-04,13]                                                                                                                                                                                                                                                        --------------- ---------- ---------- ---------- ---------- ---------- --------  

scontrol

scontrol is used to monitor and modify queued jobs. SLURM provides more information about the system than squeue and sinfo.

$scontrol show jobs
 JobId=32895 JobName=bash 
 UserId=user1(163392) GroupId=pi_hp(1136) MCS_label=N/A
 Priority=356 Nice=0 Account=pi_hp QOS=normal
 JobState=RUNNING Reason=None Dependency=(null)
 Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
 RunTime=9-04:18:31 TimeLimit=10-00:00:00 TimeMin=N/A 
 SubmitTime=2021-11-12T14:39:33 EligibleTime=2021-11-12T14:39:33 AccrueTime=2021-11-12T14:39:33
 StartTime=2021-11-13T07:28:04 EndTime=2021-11-23T07:28:04 Deadline=N/A 
 SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-11-13T07:28:04
 Partition=gpu AllocNode:Sid=ada:17816 ReqNodeList=(null) ExcNodeList=(null) NodeList=g05 BatchHost=g05
 NumNodes=1 NumCPUs=24 NumTasks=1 CPUs/Task=24 ReqB:S:C:T=0:0:*:*
 TRES=cpu=24,mem=150G,node=1,billing=69,gres/gpu=4 Features=rtx_6000 DelayBoot=00:00:00
 Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*  MinCPUsNode=24 MinMemoryNode=150G MinTmpDiskNode=0
 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
 Command=/bin/bash
 WorkDir=/nfs/ada/hp/users/user1/code/projects/semi-sup-cmsf/constrained_msf 
 TresPerNode=gpu:4 MailUser=navanek1 MailType=NONE

Display information about a specific job.

$ scontrol show --detail JobId=3918

srun

srun command is used run a parallel job on a Slurm-managed cluster. When run directly from the command line, srun will first create a resource allocation in which to run the parallel job.

$srun --time=00:05:00 --mem=200 --gres=gpu:1 hostname 
  g11

salloc

salloc is used to obtain a Slurm job allocation (a set of nodes), mostly to execute a command, and then release the allocation when the command is finished.This command will execute and then wait for the allocation to be obtained. An interactive shell is started on the assigned node(s) once the allocation is granted . At this point, a user can execute normal commands and launch his/her application.This can be useful for troubleshooting/debugging a program or if a program requires user input.

To launch an interactive job requesting 1 GPU, 100GB memory, and 20 minutes of wall time.

 $salloc --nodes=1 --time=0:20:00 --gres=gpu:1 --mem=100 
    salloc: Granted job allocation 33519
$ srun hostname
    g11
Other SLURM commands
Command Meaning
sbatch Submit a batch script to Slurm
sdiag Display scheduling statistics and timing parameters
smap A curses-based tool for displaying jobs, partitions and reservations
sprio Display the factors that comprise a job’s scheduling priority
sreport Generate canned reports from job accounting data and machine utilization statistics
srun Launch one or more tasks of an application across requested resources
sshare Display the shares and usage for each charge account and user
stat Display process statistics of a running job step
sview A graphical tool for displaying jobs, partitions and reservations