The ADA cluster is a huge, shared system that must have an accurate estimate of the resources your program(s) will use to effectively schedule jobs. Hence, to run any computational task on the HPC cluster, you must first request computing resources. This is accomplished through the use of the SLURM workload manager, whose sole purpose is to match compute resources in the cluster (GPUs, memory, etc.) with the user’s resource request.
In other words, jobs can only be run on HPC clusters by requesting resources via the SLURM workload manager.
Factors to consider while requesting resources :
- How long your job needs to run.
- How many GPUs, cores, and nodes your job will need.
- How much memory your job will need.
- If your job requires any special hardware features.
Requesting Memory (RAM)
Every job requires a certain amount of memory (RAM) to run. Hence, it is necessary to request the appropriate amount of memory per core on each node for your job using
--mem option. As SLURM strictly imposes the memory your job can use, if insufficient memory is requested and allocated for your job, your program may crash. And, if too much memory is allocated, the resources that can be used for other tasks will be wasted. Usually, you can request more memory than you think you’ll need for a specific job, then track how much you use it to fine-tune future requests for similar jobs.
To specify the amount of memory, use the following parameter which provides an upper limit on the amount of memory required per node.
--mem =<M>Note: the specified memory size must be in MB.
To run your job on the available GPU(s), it is necessary to mention the specific generic resource using
--gres=gpu flag and their count associated with the node. The level of parallelization refers to how many GPU cores and nodes are required for your job. Hence, you must tell the SLURM how many GPU(s), cores, and nodes your job will actually require.
To request GPU(s) for your job, use the following option which provides the number of GPUs per node.
--gres=gpu:<N>Note: N specifies the number of GPUs per node.
--gres=gpu:2 will request 2 nodes with 2 GPUs each, for a total of 4 GPUs.
When submitting a job, it is important to specify how long you expect your job to run using
--time option. You should carefully select a time limit for running your job, if you undervalue this limit, the system will cancel your job before it is finished. Also, if you overestimate this value, your job may have to wait longer than required in the queue.
To specify your estimated runtime, use the following option which provides an upper-bound on the wallclock time.
--time=<T>Time can be expressed in any of the following ways:
M:S(M minutes, S seconds)
H:M:S(H hours, M minutes, S seconds)
D-H(D days, H hours)
D-H:M(D days, H hours, M minutes)
D-H:M:S(D days, H hours, M minutes, S seconds)
Additionally, if you want to run any job that requires a specific hardware feature, you could include these as constraints when submitting your job. Specific GPU cards can be requested by specifying the feature with
Requesting 2 GPUs, 100MB of memory, and 24 hours of wall-clock time to run an executable.
srun --gres=gpu:2 --mem=100 --time=24:00:00 executable