Using the Module System in Slurm Jobs
The module system is a software environment management tool widely used in HPC environments. It allows users to dynamically modify their shell environment to access different software packages and versions. Here's how to effectively use the module system in your Slurm jobs:
1. Basic Module Commands
Before diving into Slurm-specific usage, let's review some basic module commands:
`module avail`: List all available modules
`module list`: Show currently loaded modules
`module load `: Load a specific module
`module unload `: Unload a specific module
`module purge`: Unload all currently loaded modules
`module show `: Display information about a module
2. Using Modules in Slurm Job Scripts
Here's an example of how to use modules in a Slurm job script:
#!/bin/bash
#SBATCH --job-name=module_job
#SBATCH --output=output_%j.log
#SBATCH --error=error_%j.log
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
# Purge all loaded modules
module purge
# Load required modules
module load gcc/9.3.0
module load python/3.8.5
module load openmpi/4.0.4
# Your job commands here
python my_script.py
3. Loading Software Stacks
Sometimes, you might need to load a complete software stack. Many HPC systems provide meta-modules for this purpose:
#!/bin/bash
#SBATCH --job-name=stack_job
#SBATCH --output=output_%j.log
#SBATCH --error=error_%j.log
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
# Load a complete software stack
module load foss/2020a
# Load additional modules as needed
module load python/3.8.5
# Your job commands here
4. Module Dependencies
Some modules may have dependencies or conflicts. The module system often handles these automatically, but it's good to be aware of them:
#!/bin/bash
#SBATCH --job-name=dep_job
#SBATCH --output=output_%j.log
#SBATCH --error=error_%j.log
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
# Load a module with dependencies
module load tensorflow/2.4.1-cuda11.0-python3
# The above might automatically load CUDA, cuDNN, and Python modules
# Your job commands here
python my_tensorflow_script.py
5. Using Module Collections
If you frequently use the same set of modules, you can create a module collection:
# Create a module collection
module save my_collection
# In your Slurm script
#!/bin/bash
#SBATCH --job-name=collection_job
#SBATCH --output=output_%j.log
#SBATCH --error=error_%j.log
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
# Load your module collection
module restore my_collection
# Your job commands here
6. Best Practices for Using Modules with Slurm
- Purge before loading: Start your script with `module purge` to ensure a clean environment.
- Be specific: Use full module names including versions to ensure reproducibility.
- Check for conflicts: Use `module show` to check for potential conflicts before loading modules.
- Use module collections: For complex environments, create and use module collections.
- Document your modules: Comment your Slurm script to explain why each module is needed.
- Use module load in job scripts: Don't rely on modules loaded in your login environment; explicitly load them in your job script.
7. Troubleshooting Module Issues in Slurm Jobs
If you encounter module-related issues:
Check your Slurm output and error logs for module-related errors.
Ensure the modules you're trying to load are available on the compute nodes (they might differ from login nodes).
Use `module show ` to verify module details and dependencies.
If a module isn't found, check if you need to load a specific compiler or MPI implementation first.
8. Advanced Module Usage
Some advanced module features:
Module versioning: `module load /`
Swapping modules : `module swap `
Module aliases : `module alias my_python python/3.8.5`