What is Slurm?
Slurm (Simple Linux Utility for Resource Management) is a job scheduler that manages computational resources in a cluster. It allocates resources to jobs, dispatches them, monitors their execution, and cleans up after job completion.
Why use Slurm?
- Resource allocation: Once resources are allocated to your job, they're exclusively yours for the duration of execution, regardless of system load.
- Detached execution: No need to keep an open terminal session.
- Efficient resource use: Jobs start as soon as requested resources are available, even outside working hours.
- Fair scheduling: Jobs are prioritized based on requested resources, user's system share, and queue time.
Slurm Concepts
Before diving into Slurm usage, it's important to understand some key concepts:Which Partition can I use?
You have 3 partitions on the Cluster
Basic Usage
Loading Software as modules
To use a software that's not part of the system you can load it as a module
module avail
list all available modulesmodule load R/4.4.0
load R version 4.40module list
list loaded modulesmodule unload module_name
unload loaded module module_namemodule purge
unload all loaded modulesSimple Job Submission
Prefix your command with
srun myprogram
Run an interactive bash session
srun --pty bash
Note: This uses default settings, which may not always be suitable.
Specifying a Partition
Use the -p
option with srun
:
srun -p partition_name myprogram
Running Detached Jobs (Batch Mode)
- Create a shell script (batch script) containing:
- Slurm directives (lines starting with
#SBATCH
) - Any necessary preparatory steps (e.g., loading modules)
- Your
srun
command
- Slurm directives (lines starting with
- Submit the script using
sbatch
:sbatch myscript.sh
Using Conda
You can use conda inside your Batch script
# Load Conda
Monitoring Jobs
Checking Job Status
Use squeue
to see which jobs are running or queued:
squeue
To see only your jobs:
squeue -u yourusername
Viewing Job Details
Use scontrol
:
scontrol show job <jobid>
Checking Job Output
Slurm captures console output to a file named slurm-<jobid>.out
in the submission directory. You can examine this file while the job is running or after it finishes.
Resource Requests
CPUs
To request multiple CPU threads:
#SBATCH --cpus-per-task=X
srun --cpus-per-task=X myprogram
Note: This argument must be given to both sbatch
(via #SBATCH
) and srun
. The first one for the job allocation, the second for the task e
Other Resources
Specify in your batch script using #SBATCH
directives:
#SBATCH --mem=8G
#SBATCH --time=02:00:00
#SBATCH --gres=gpu:1
Here you have the options for Memory, Time Limit and GPUs
Example batch script
#!/bin/bash #SBATCH --job-name=conda_job #SBATCH --output=output_%j.log #SBATCH --error=error_%j.log #SBATCH --time=01:00:00 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 #SBATCH --mem=4G # Load Conda module load anaconda3 # Activate your environment conda activate myenv # Run your Python script python my_script.py
An example script with conda, launch it with:
sbatch conda_job.sh
Useful Slurm Commands
squeue
: Show job queue informationsinfo
: Display node and partition informationscancel <jobid>
: Delete a jobsacct
: View accounting data for jobsscontrol
: Detailed info