Slurm User Guide

Basic Guide to Using Slurm

Slurm (Simple Linux Utility for Resource Management) is an open-source workload manager and job scheduling system for Linux clusters. This guide will walk you through the basics of using Slurm to submit, manage, and monitor jobs on a cluster.

Understanding Slurm Concepts

Basic Slurm Commands

Submitting Jobs

Monitoring Jobs

Managing Jobs

Advanced Slurm Usage

Best Practices

Understanding Slurm Concepts

Before diving into Slurm usage, it's important to understand some key concepts:

Node : A computer in the cluster.

Partition: A group of nodes with specific characteristics.

Job : A resource allocation request for a specific program or task.

Task : An instance of a running program within a job.

Basic Slurm Commands

Here are some essential Slurm commands you'll use frequently:

srun : Run a parallel job

sbatch : Submit a batch script

scancel: Cancel a job

squeue : View information about jobs in the queue

sinfo : View information about Slurm nodes and partitions

Batch Jobs

For batch jobs, create a submission script and use `sbatch`:

1. Create a script (e.g., `job_script.sh`):


#!/bin/bash  
#SBATCH --job-name=my_job 
#SBATCH --output=output_%j.log
#SBATCH --error=error_%j.log
#SBATCH --ntasks=1
#SBATCH --time=01:00:00
#SBATCH --mem=1G

# Your commands here
echo "Hello, Slurm!"

2. Submit the job:

 sbatch job_script.sh

Monitoring Jobs

To view information about your jobs in the queue:

squeue -u $USER

To see detailed information about a specific job:

scontrol show job job_id

## Managing Jobs To cancel a job:

scancel job_id

To hold a job:

scontrol hold job_id

To release a held job:

scontrol release job_id

Resource Constraints

Specify resource requirements in your job script:

#SBATCH --cpus-per-task=4 
#SBATCH --mem=8G 
#SBATCH --gres=gpu:2

Job Dependencies

Create dependencies between jobs:

sbatch --dependency=afterok:job_id next_job.sh

Best Practices

Estimate resources accurately: Request only the resources you need to avoid long queue times.
Set appropriate time limits: This helps the scheduler plan more effectively.
Use job names: Give your jobs meaningful names for easier management.
Monitor your jobs: Regularly check the status of your jobs and kill them if they're not behaving as expected.
Use appropriate partitions: Choose the right partition based on your job's requirements.
Optimize your code: Well-optimized code can reduce resource usage and improve job throughput.

Slurm User Guide

Basic Guide to Using Slurm

Table of Contents

Understanding Slurm Concepts

Basic Slurm Commands

Batch Jobs

Monitoring Jobs

Resource Constraints

Job Dependencies

Best Practices