What is SLURM?
On a local machine, an operating system decides exactly when and on what resources an executing process runs. In a distributed compute environment, this inter-machine coordination needs to be done by an authoritative workload manager (=WLM). This is SLURM. It coordinates all cluster resources by optimizing for resource utilization while not allowing any single user to monopolize the cluster resources. It’s quite a balancing act!
System administrators have tuned SLURM so that it is:
- Aware of all cluster compute resources and hardware usage; and
- Able to prioritize new requests for compute resources according to the needs of the users of ada; and
- Able to allocate requested compute resources for execution across compute resources.
All that the user is required to do is formulate well-defined requests for resources and submit those requests. There are a few important notes on this and a few advanced topics.
Throughout this document, please consider any mention of SLURM to be specific to the implementation of SLURM on the ada cluster environment.
Simple code of conduct for using SLURM
- All batch jobs and interactive jobs must be submitted to the scheduler via the login node.
- Do not run computational jobs on login nodes; this has a negative impact on many users.
- To prevent users from starting jobs without using SLURM,
ssh
access to the compute nodes is disabled. - SLURM jobs should not be run in your
$HOME
directory.
Below is an outline of the SLURM topics we discuss on this site.