Job queue system

The resource manager SLURM is installed on qtech. SLURM manages a job queue in order to optimize the use of resources on the workstation.

In the following, a reference for the most common use cases. The complete documentation is available at: https://slurm.schedmd.com.

Job submission

The most important commands to request resources are

srun to submit a single command
sbatch to submit a script
salloc to allocate resources for an interactive shell

A series of arguments must be passed to these commands in order to specify the required resources. The most important ones are

-A <username> The username of the user submitting the job
-c <cores> The desired number of cores
-mem <MB> The amount of RAM required
--gres=gpu:1 To be used if the Tesla GPU is used.

Please indicate the required resources carefully, in order to improve the queue system. Always use the least resources possible to maximize the probability that your job is executed sooner than later.

WARNING If the job exceeds the specified limits, the resource manager will terminate it.

To receive updates on the status of the job by email, use the following arguments:

--mail-user=<email> Email address
--mail-type=<type> Type of the events to be notified: BEGIN, END, FAIL, REQUEUE, or ALL to receive all communications. More than one option can be specified, separating them with a comma.

Status of the queue system and cancelling jobs

A dashboard with the status of the queue system is available at https://qtech.fisica.unimi.it/status.
Alternatively, you can use the squeue command from the terminal.
To cancel a job, use the command
```
  scancel <job-id>
```

where job-id is the job number. You can retrieve it using the squeue command.

`Sbatch` and script submission

To submit a script to the queue system, prepare the script according to the instructions below and use the command

sbatch scriptname.sh

You must pass SLURM the required information (account, number of core, memory, etc.) either via arguments to the sbatch command or with initial comments in the script starting with #SBATCH.

If not specified with the appropriate parameter --output, stdout and stderr will be redirected to the file slurm-JOBID.out in the directory from where sbatch was called, where JOBID is the numeric ID given to the job.

If the job fails or the outcome is not the one expected, please check the output file slrum-JOBID.out for errors or warnings.

Example

Here is an example of a script using one core.

    #!/bin/bash
    #SBATCH -A qtech
    #SBATCH -c 1
    #SBATCH --job-name=example
    #SBATCH --mail-user=matteo.rossi@unimi.it
    #SBATCH --mail-type=END

    echo "Example"

Tips and rules

Despite the goodwill and effort of the administrator, the system is far from being robust. Given the limited amount of resources, it is advisable to follow these rules for the proper execution of jobs for all users.

Before running long and intensive tasks, please run a test case on qtech with small dimensions (that means short execution time and low resources), to check that everything works as expected. If it does, then increase the size of the problem to your needs.
Indicate the needed resources precisely. Failure to do so might overload the workstation and make your and others’ jobs slow down or even crash. On the other hand, requiring more resources than needed might have your job queued for a longer time, or prevent other jobs to be run in parallel.
Run jobs as short as possible. If a job consists of different steps, or parameter sweeps, consider dividing it into more jobs. SLURMS allows job arrays and specifying job dependencies if a job depends on the completion of a previous job. This allows a finer control of the queue and most importantly it prevents major data loss in case your job crashes.
Problems of various nature can cause your job to crash, be suspended or cancelled. Add checkpoints to your code as often as possible in order to prevent data loss
Contact the system administrator in case of doubts, problems with the queue and so on.

Using MATLAB

To use MATLAB in batch mode, use the following script. This script evaluates the MATLAB script specified in SCRIPT_NAME and save the workspace to the file $SCRIPT_NAME.mat.

The parameter (-nodisplay) starts MATLAB in text mode. The command parpool starts as many workers as specified in the #SBATCH -c 6 line. If you are not using parallel computation, set -c 1 and remove the parpool instruction.

    #!/bin/sh
    #SBATCH -A rossi
    #SBATCH -c 6
    #SBATCH --job-name=matlab_example
    #SBATCH --mail-user=matteo.rossi@unimi.it
    #SBATCH --mail-type=END

    SCRIPT_NAME=matlab_example

    matlab  -nodisplay -r "parpool('local',$SLURM_CPUS_PER_TASK), $SCRIPT_NAME, save $SCRIPT_NAME.mat; quit"

Using the GPU

You need to get exclusive access to the GPU by using the --gres=gpu:1 in the job submission command, or in the script comments.

Example:

    #!/bin/sh
    #SBATCH -A qtech
    #SBATCH -c 1
    #SBATCH --job-name=cuda
    #SBATCH --mail-user=matteo.rossi@unimi.it
    #SBATCH --mail-type=END
    #SBATCH --gres=gpu:1

    srun ~/NVIDIA_CUDA-7.5_Samples/5_Simulations/nbody/nbody -benchmark -numbodies=550144