To run computational tasks on Ada clusters, you must first request computing resources. This is accomplished through the use of SLURM resource scheduler, whose sole purpose is to match compute resources in the cluster (GPUs, memory, etc.) with user resource requests.
The ADA cluster is a huge, shared system that must have an accurate estimate of the resources your program(s) will use in order to effectively schedule jobs.
Factors to consider :
- How long your job must run – You should estimate how long your job must run and carefully select a time limit. If you undervalue this value, the system will cancel your job before it is finished. If you overestimate this value, your work may have to wait longer than necessary in the queue.
- How many GPUs,cores and nodes your job will need -The level of parallelization refers to how many GPU cores and nodes your job will require.You must tell the Slurm scheduler how many GPU(s) , cores and nodes your job will require.
- How much memory will your job need – Every job requires a certain amount of memory (RAM) to run.Hence , it is necessary to determine how much memory your job will require to request the scheduler.If insufficient memory is requested and allocated, your program may crash and if too much memory is allocated, resources that can be used for other tasks will be wasted.
- If your job requires the use of special features – If your job necessitates the use of one or more specific models or hardware. When you submit your job, you might include these as restrictions.
Requesting Memory (RAM)
Slurm strictly upholds the amount of memory that your job can use. Ensure that you request the appropriate amount of memory per core on each node in your job using –mem-per-cpu or memory per node in your job using –mem. You can request more memory than you think you’ll need for a specific job, then track how much you actually use it to fine-tune future requests for similar jobs.
–mem=<M> provides an upper limit on the amount of memory required per node
–mem-per-cpu=<M> provides a lower-bound on the memory needed per CPU
Note : the specified memory size must be in MB
To run your job on the available GPU, it is necessary to mention the specific generic resource i.e GPUs using — gres flag and their count associated with node. Jobs will not be allocated any generic resources unless specifically requested at job submit time using the following option:
Note that this value is per node .For example, –nodes=2 –gres=gpu:2 will request 2 nodes with 2 GPUs each, for a total of 4 GPUs.
When submitting a job, it is critical to specify how long you expect your job to take. If you provide a time that is too short, the scheduler will suspend your job before it finishes. If you specify a time that is too long, your job may remain in the queue for longer than it should as the scheduler tries to identify appropriate resources on which to run it.
To specify your estimated runtime, use the following parameter which provides an upper-bound on the wallclock time needed:
Time can be expressed in any of the following ways:
M:S(M minutes, S seconds)
H:M:S(H hours, M minutes, S seconds)
D-H(D days, H hours)
D-H:M(D days, H hours, M minutes)
D-H:M:S(D days, H hours, M minutes, S seconds)
You may want to run programs that require specific hardware. Specific GPU cards can be requested by specifying a “feature” with the — constraint flag.
Requesting 2 GPUs, 100GB of memory, and 24 hours of wall-clock time to run an executable,
srun –gres=gpu:2 –mem=100G –time=24:00:00 executable