Chapter 2 Submitting Batch Jobs
When running programs that take a very long time to complete, it’s impractical to wait for them to run on your local machine or cluster interactively. Instead, you can submit these programs as batch jobs to a High-Performance Computing (HPC) cluster. This tutorial will guide you through creating and submitting a SLURM job script to run a batch job on an HPC cluster. We will be using the sbatch
command to submit the job.
- Create a SLURM Job Script:
A SLURM job script is a Bash script that contains directives for the SLURM workload manager. These directives specify resources such as the number of nodes, CPU cores, memory, job duration, and more. Below is a sample SLURM job script:
YourFileName.slurm:
#!/bin/bash
#SBATCH -N 3 # Requests 3 node for the job
#SBATCH -c 24 # Requests 24 CPU core
#SBATCH --mem-per-cpu=128G # Allocates 128 GB of memory per CPU core
#SBATCH --time=0-00:15:00 # 15 minutes
#SBATCH --output=my.stdout # Directs the standard output to a file named "my.stdout"
#SBATCH --error=my.stderr # Directs the standard error to a file named "my.stderr"
#SBATCH --mail-user=abac123@case.edu # Specifies the email address to receive job notifications.
#SBATCH --mail-type=ALL # Sends email notifications for all events (job start, end, fail, etc.)
#SBATCH --job-name="just_a_test" # Names the job "just_a_test"
# Put commands for executing job below this line
# example:
module load Python
python --version
- Save the Job Script:
Save the script with a .slurm
extension. For example, save it as YourFileName.slurm
.
- Access the HPC Cluster:
Connect to the HPC cluster using cluster/_pioneer Shell Access
.
- Navigate to the Directory Containing Your SLURM Script:
Use the cd
command to navigate to the directory where you saved YourFileName.slurm
.
- Submit the SLURM Job Script:
Use the sbatch command to submit your job script to the SLURM scheduler:
Monitor the Job: You can check the progress of the job in the
Job/Active Jobs
section.Check Job Output:
Once the job completes, check the output file (my.stdout
in this example) for the results of your job. If the job failed, you can check the my.stderr
file for the reason.