Batch Jobs

Batch Job Submission

The usual way to access the supercomputer power of bwGRiD is to submit batch jobs. The user has to write a job control file, a shell script containing all necessary Linux commands. The job control file is then submitted to a job control system. This job control system decides when and on which nodes the job will run. The bwGRiD project uses the Portable Batch System (PBS), version Torque, together with the job scheduler Moab. The command for queuing a batch job is

qsub job-control-file

Example for a job control file:

#!/bin/sh
# The first line sets the shell. Instead of the default shell sh, identical with 
# the Bourne Again Shell bash, you can set another one: ksh, tcsh, zsh. 
# Even /usr/bin/perl is allowed. 

# Lines starting with #PBS are options for the qsub command. 

# Number of nodes and number of processors per node (cores per node)  
#PBS -l nodes=4:ppn=8 

# Walltime limit for the job
#PBS -l walltime=1:30:00

# Output of some information:
echo Running on host `hostname` 
echo This job runs on the following processors: 
echo `cat $PBS_NODEFILE` 

# Change to the directory where the qsub command was executed 
cd $PBS_O_WORKDIR 
echo Directory is `pwd` 

# Load software modules 

# Start your application 


In the example 32 cores - 4 nodes (nodes=4) with 8 cores (ppn=8) - are requested for 1 hour and 30 minutes (walltime=1:30:00). The walltime of a job should be determined with caution. If the walltime is too low, the job may be aborted before completion. If the walltime is too high, the scheduler must reserve a larger time frame for your job which may lead to a long waiting time in the queue depending on the utilization of the system.

When a job starts, the names of the assigned cores are written to a file. The name of this file is stored in the variable PBS_NODEFILE. The command cat $PBS_NODEFILE shows the content of this file. If you ask for more than one core per node, the nodes are listed several times, accordingly. For example, if you request ppn=8, each node is listed 8 times. Some programs use the information in PBS_NODEFILE to start processes on the allocated cores. The command hostname displays the name of the first node (head node).

The nodes are always allocated exclusively - no matter how many cores per node you request with ppn. You are the only user of the assigned nodes and you are responsible for using the available 8 cores per node efficiently. Employing less than 8 cores per node is reasonable, if the memory requirement of 8 processes is too large to fit in the main memory of a node. When estimating the memory requirement of your job, you should not plan more than 14 GB per node, since the operating system needs some of the available 16 GB. How to employ all 8 cores with scalar programs is described further below.

Another useful variable is PBS_O_WORKDIR. This variable stores the path to the directory where you execute the qsub command. A job script always begins in the home directory. With cd PBS_O_WORKDIR you can easily change to your working directory.

When your job is completed, you find two output files in your working directory. If the jobs number is 1234 for instance, you will receive the file job-control-file.o1234 for standard output and the file job-control-file.e1234 for standard error.

More information on PBS variables and PBS options is available with the command man qsub.

Batch Job Status

Status information about your jobs is shown with the command

qstat

The most important status abbreviations are

 

Q Queued (waiting)
T Transfer (on the way to run)
R Running
C Complete (finished or canceled)
H Hold (by PBS command qhold)

 

The command

showq

displays information about the jobs of all users.

Node Status

Information about the status of all nodes is obtained with the command

pbsnodes

The number of available nodes can be inquired by

freenodes

To learn which nodes are used by your running jobs type

qstat -n

The name pattern of the nodes is described here.

Batch Classes

Currently, there are the batch classes single and normal. The jobs are automatically assigned to a class according to the requested time and number of nodes. The class single is for jobs which only need a single node. The walltime limit for this class is 120 h. The class normal accepts jobs with up to 64 nodes and 48 h. 

To get a list of all existing batch classes type

qstat -q

The class batch is used by PBS for new jobs before they are assigned to a appropriate class. The other classes are reserved for tests and special user groups.

Deleting Batch Jobs

Queued and running jobs of your own can be deleted by using

qdel job_number

The job number can be taken from the output of the qstat command (column 'Job id'). For example, if the job-id is 414.intern1, the job number is 414.

Many Scalar Tasks in One Batch Job

It is possible to perform more than one scalar task at the same time and start the processes belonging to them in only one job. With pbsdsh you can distribute your tasks to nodes and cores. pbsdsh executes (spawns) a Unix/Linux program on one or more cores under the control of PBS. Here is an example of a job on 16 cores.

#!/bin/sh 
#PBS -l nodes=2:ppn=8 
pbsdsh $PBS_O_WORKDIR/myscript.sh

Since the same shell script myscript.sh is executed on each core, that script needs to be clever enough to decide what its role is. Unless all processes shall do the same, we have to distinguish cores or processes. The environment variable PBS_VNODENUM helps. In case of n requested cores it takes a value from 0 to n-1 and numbers the requested cores. You can use PBS_VNODENUM

  • to submit it to the same program as an argument,
  • to start a different program in each process or
  • to read different input files.

The following three examples show the shell script myscript.sh belonging to these three cases.

Example: Submit PBS_VNODENUM as Argument

#!/bin/sh 
cd $PBS_O_WORKDIR 
PATH=$PBS_O_PATH 
./myprogram $PBS_VNODENUM

Setting the current directory and the environment variable PATH is necessary since only a very basic environment is defined by default.

Example: Start Different Programs

#!/bin/sh 
cd $PBS_O_WORKDIR 
PATH=$PBS_O_PATH 
./myprogram.$PBS_VNODENUM

Example: Read Different Input Files

#!/bin/sh 
cd $PBS_O_WORKDIR 
PATH=$PBS_O_PATH 
./myprogram < mydata.$PBS_VNODENUM

 

responsible: Sabine Richling
Latest Revision: 2011-12-19