Minimal Slurm FAQ

What's an HPC system?

An HPC (High-Performance Computing) system is a network of computers (cluster) designed to process large amounts of data and perform complex calculations at high speeds. Key features include:

  • Parallel processing: Multiple computers work together on a single problem.
  • Powerful processors: Often uses specialized CPUs or GPUs for faster computations.
  • High-speed interconnects: Fast communication between nodes for efficient data sharing.
  • Large storage capacity: Handles vast amounts of data for scientific simulations or data analysis.

  • HPC systems are crucial for tasks like weather forecasting, molecular modeling, and artificial intelligence research.

    What is Slurm?

    Slurm (Simple Linux Utility for Resource Management) is a job scheduler that manages computational resources in a cluster. It allocates resources to jobs, dispatches them, monitors their execution, and cleans up after job completion.

    Why use Slurm?

    1. Resource allocation: Once resources are allocated to your job, they're exclusively yours for the duration of execution, regardless of system load.
    2. Detached execution: No need to keep an open terminal session.
    3. Efficient resource use: Jobs start as soon as requested resources are available, even outside working hours.
    4. Fair scheduling: Jobs are prioritized based on requested resources, user's system share, and queue time.

    Slurm Concepts

    Before diving into Slurm usage, it's important to understand some key concepts:
  • Node : A computer in the cluster.
  • Partition: A group of nodes with specific characteristics.
  • Job : A resource allocation request for a specific program or task.
  • Task : An instance of a running program within a job.
  • Which Partition can I use?


    You have 3 partitions on the Cluster
  • normal_prio: 1 day max runnig time, priority 100
  • normal_prio_long: 15 days max running time, priority 50
  • high_prio: 1 day max running time, priority 1000
  • Basic Usage

    Loading Software as modules

    To use a software that's not part of the system you can load it as a module

    module avail

    list all available modules

    module load R/4.4.0

    load R version 4.40

    module list

    list loaded modules

    module unload module_name

    unload loaded module module_name

    module purge

    unload all loaded modules

    Simple Job Submission

    Prefix your command with

    srun myprogram

    Run an interactive bash session

    srun --pty bash

    Note: This uses default settings, which may not always be suitable.

    Specifying a Partition

    Use the -p option with srun:

    srun -p partition_name myprogram

    Running Detached Jobs (Batch Mode)

    1. Create a shell script (batch script) containing:
      • Slurm directives (lines starting with #SBATCH)
      • Any necessary preparatory steps (e.g., loading modules)
      • Your srun command
    2. Submit the script using sbatch:
      sbatch myscript.sh

    Using Conda

    You can use conda inside your Batch script

  • # Load Conda
  • module load anaconda3
  • # Activate your environment
  • conda activate myenv
  • # Run your Python script
  • python my_script.py
  • Monitoring Jobs

    Checking Job Status

    Use squeue to see which jobs are running or queued:

    squeue

    To see only your jobs:

    squeue -u yourusername

    Viewing Job Details

    Use scontrol:

    scontrol show job <jobid>

    Checking Job Output

    Slurm captures console output to a file named slurm-<jobid>.out in the submission directory. You can examine this file while the job is running or after it finishes.

    Resource Requests

    CPUs

    To request multiple CPU threads:

    #SBATCH --cpus-per-task=X
    srun --cpus-per-task=X myprogram

    Note: This argument must be given to both sbatch (via #SBATCH) and srun. The first one for the job allocation, the second for the task e

    Other Resources

    Specify in your batch script using #SBATCH directives:

    
                #SBATCH --mem=8G
                #SBATCH --time=02:00:00 
                #SBATCH --gres=gpu:1 
                

    Here you have the options for Memory, Time Limit and GPUs

    Example batch script

    #!/bin/bash #SBATCH --job-name=conda_job #SBATCH --output=output_%j.log #SBATCH --error=error_%j.log #SBATCH --time=01:00:00 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 #SBATCH --mem=4G # Load Conda module load anaconda3 # Activate your environment conda activate myenv # Run your Python script python my_script.py

    An example script with conda, launch it with:

    sbatch conda_job.sh

    Useful Slurm Commands

    Best Practice

    Please:

  • Use resources carefully. Test your requirements with an interactive session with srun before launching sbatch scripts
  • Don't launch a lot of jobs together. Respect the other users.
  • Use the appropriate queue for your job.
  • Send a mail to the sysadmin if you have any doubts.