Job queue system
The resource manager SLURM is installed on qtech. SLURM manages a job queue in order to optimize the use of resources on the workstation.
In the following, a reference for the most common use cases. The complete documentation is available at: https://slurm.schedmd.com.
Job submission
The most important commands to request resources are
srun
to submit a single commandsbatch
to submit a scriptsalloc
to allocate resources for an interactive shell
A series of arguments must be passed to these commands in order to specify the required resources. The most important ones are
-A <username>
The username of the user submitting the job-c <cores>
The desired number of cores-mem <MB>
The amount of RAM required--gres=gpu:1
To be used if the Tesla GPU is used.
Please indicate the required resources carefully, in order to improve the queue system. Always use the least resources possible to maximize the probability that your job is executed sooner than later.
WARNING If the job exceeds the specified limits, the resource manager will terminate it.
To receive updates on the status of the job by email, use the following arguments:
--mail-user=<email>
Email address--mail-type=<type>
Type of the events to be notified:BEGIN
,END
,FAIL
,REQUEUE
, orALL
to receive all communications. More than one option can be specified, separating them with a comma.
Status of the queue system and cancelling jobs
- A dashboard with the status of the queue system is available at https://qtech.fisica.unimi.it/status.
- Alternatively, you can use the
squeue
command from the terminal. -
To cancel a job, use the command
scancel <job-id>
where job-id
is the job number. You can retrieve it using the squeue
command.
Sbatch
and script submission
To submit a script to the queue system, prepare the script according to the instructions below and use the command
sbatch scriptname.sh
You must pass SLURM the required information (account, number of core, memory, etc.)
either via arguments to the sbatch
command or with initial comments in the script
starting with #SBATCH
.
If not specified with the appropriate parameter --output
, stdout
and stderr
will be
redirected to the file slurm-JOBID.out
in the directory from where sbatch
was called,
where JOBID is the numeric ID given to the job.
If the job fails or the outcome is not the one expected, please check the output file
slrum-JOBID.out
for errors or warnings.
Example
Here is an example of a script using one core.
#!/bin/bash
#SBATCH -A qtech
#SBATCH -c 1
#SBATCH --job-name=example
#SBATCH --mail-user=matteo.rossi@unimi.it
#SBATCH --mail-type=END
echo "Example"
Tips and rules
Despite the goodwill and effort of the administrator, the system is far from being robust. Given the limited amount of resources, it is advisable to follow these rules for the proper execution of jobs for all users.
- Before running long and intensive tasks, please run a test case on
qtech
with small dimensions (that means short execution time and low resources), to check that everything works as expected. If it does, then increase the size of the problem to your needs. - Indicate the needed resources precisely. Failure to do so might overload the workstation and make your and others’ jobs slow down or even crash. On the other hand, requiring more resources than needed might have your job queued for a longer time, or prevent other jobs to be run in parallel.
- Run jobs as short as possible. If a job consists of different steps, or parameter sweeps, consider dividing it into more jobs. SLURMS allows job arrays and specifying job dependencies if a job depends on the completion of a previous job. This allows a finer control of the queue and most importantly it prevents major data loss in case your job crashes.
- Problems of various nature can cause your job to crash, be suspended or cancelled. Add checkpoints to your code as often as possible in order to prevent data loss
- Contact the system administrator in case of doubts, problems with the queue and so on.
Using MATLAB
To use MATLAB in batch mode, use the following script. This script evaluates the
MATLAB script specified in SCRIPT_NAME
and save the workspace to the file $SCRIPT_NAME.mat
.
The parameter (-nodisplay
) starts MATLAB in text mode.
The command parpool
starts as many workers as specified in the #SBATCH -c 6
line.
If you are not using parallel computation, set -c 1
and remove the parpool
instruction.
#!/bin/sh
#SBATCH -A rossi
#SBATCH -c 6
#SBATCH --job-name=matlab_example
#SBATCH --mail-user=matteo.rossi@unimi.it
#SBATCH --mail-type=END
SCRIPT_NAME=matlab_example
matlab -nodisplay -r "parpool('local',$SLURM_CPUS_PER_TASK), $SCRIPT_NAME, save $SCRIPT_NAME.mat; quit"
Using the GPU
You need to get exclusive access to the GPU by using the --gres=gpu:1
in the job
submission command, or in the script comments.
Example:
#!/bin/sh
#SBATCH -A qtech
#SBATCH -c 1
#SBATCH --job-name=cuda
#SBATCH --mail-user=matteo.rossi@unimi.it
#SBATCH --mail-type=END
#SBATCH --gres=gpu:1
srun ~/NVIDIA_CUDA-7.5_Samples/5_Simulations/nbody/nbody -benchmark -numbodies=550144