SLURM

What is SLURM?

On a local machine, an operating system decides exactly when and on what resources an executing process runs. In a distributed compute environment, this inter-machine coordination needs to be done by an authoritative workload manager (=WLM). This is SLURM. It coordinates all cluster resources by optimizing for resource utilization while not allowing any single user to monopolize the cluster resources. It’s quite a balancing act!

System administrators have tuned SLURM so that it is:

  1. Aware of all cluster compute resources and hardware usage; and
  2. Able to prioritize new requests for compute resources according to the needs of the users of ada; and
  3. Able to allocate requested compute resources for execution across compute resources.

All that the user is required to do is formulate well-defined requests for resources and submit those requests. There are a few important notes on this and a few advanced topics.SLURM’s developers have provided a convenient quick reference guide.

Simple code of conduct for using SLURM
  1. All batch jobs and interactive jobs must be submitted to the scheduler via the login node.
  2. Do not run computational jobs on login nodes; this has a negative impact on many users.
  3. To prevent users from starting jobs without using SLURM, ssh access to the compute nodes is disabled.
  4. SLURM jobs should not be run in your $HOME directory.
  5. Make sure you don’t exceed your disk quota. File system limits are usually the first to have a negative impact on your job.

Please step through the following outline of pages with the right arrows found at the top and bottom of each page within this tutorial.

Throughout this document, please consider any mention of SLURM to be specific to the implementation of SLURM on the ada cluster environment.

  1. Requesting Resources (Flags)
  2. SBATCH File
  3. Monitoring Jobs
  4. Modifying Jobs
  5. Array Jobs
  6. Environment Variables
  7. Resource Contention
  8. Priority
  9. Preemption