Slurm User Guide

Basic Guide to Using Slurm

Slurm (Simple Linux Utility for Resource Management) is an open-source workload manager and job scheduling system for Linux clusters. This guide will walk you through the basics of using Slurm to submit, manage, and monitor jobs on a cluster.

Table of Contents

  • Understanding Slurm Concepts
  • Basic Slurm Commands
  • Submitting Jobs
  • Monitoring Jobs
  • Managing Jobs
  • Advanced Slurm Usage
  • Best Practices
  • Understanding Slurm Concepts

    Before diving into Slurm usage, it's important to understand some key concepts:
  • Node : A computer in the cluster.
  • Partition: A group of nodes with specific characteristics.
  • Job : A resource allocation request for a specific program or task.
  • Task : An instance of a running program within a job.
  • Basic Slurm Commands

    Here are some essential Slurm commands you'll use frequently:
  • srun : Run a parallel job
  • sbatch : Submit a batch script
  • scancel: Cancel a job
  • squeue : View information about jobs in the queue
  • sinfo : View information about Slurm nodes and partitions
  • Batch Jobs

    For batch jobs, create a submission script and use `sbatch`:

    1. Create a script (e.g., `job_script.sh`):

    
    #!/bin/bash  
    #SBATCH --job-name=my_job 
    #SBATCH --output=output_%j.log
    #SBATCH --error=error_%j.log
    #SBATCH --ntasks=1
    #SBATCH --time=01:00:00
    #SBATCH --mem=1G
    
    # Your commands here
    echo "Hello, Slurm!"
        

    2. Submit the job:
     sbatch job_script.sh

    Monitoring Jobs

    To view information about your jobs in the queue:
    squeue -u $USER
    To see detailed information about a specific job:
    scontrol show job job_id
    ## Managing Jobs To cancel a job:
    scancel job_id
    To hold a job:
    scontrol hold job_id
    To release a held job:
    scontrol release job_id

    Resource Constraints

    Specify resource requirements in your job script:

    #SBATCH --cpus-per-task=4 
    #SBATCH --mem=8G 
    #SBATCH --gres=gpu:2 
        

    Job Dependencies

    Create dependencies between jobs:

    sbatch --dependency=afterok:job_id next_job.sh

    Best Practices

    1. Estimate resources accurately: Request only the resources you need to avoid long queue times.
    2. Set appropriate time limits: This helps the scheduler plan more effectively.
    3. Use job names: Give your jobs meaningful names for easier management.
    4. Monitor your jobs: Regularly check the status of your jobs and kill them if they're not behaving as expected.
    5. Use appropriate partitions: Choose the right partition based on your job's requirements.
    6. Optimize your code: Well-optimized code can reduce resource usage and improve job throughput.