Creating a local queuing system on Linux Debian using Torque

2015-03-03

Introduction

Imagine you would like to perform a large series of calculation. Obviously, you would not run the complete series of calculations at the same time. In principle, you would like to start as many jobs as the number of processors on your computer can handle and start the next job in the series when the previous job has finished. You could write your own program for this, but there are many of such programs available. In this blog post, I will show you how to install and use Torque. Although Torque can be set up to run in a computer cluster, I will show you how you can install and use it on just a single machine. This tutorial is written for Linux Debian, but should in principle also work for Linux Ubuntu and with (hopefully) small modifications for the other distributions.

Compiling Torque

Download the tarball from the website. Extract, compile and install it on your machine.


wget http://www.adaptivecomputing.com/index.php?wpfb_dl=2868 -O torque-5.1.0.tar.gz
cd torque-5.1.0
./configure --prefix=/opt/gcc-4.7.2/torque-5.1.0
make -j5
sudo make install

Installing the daemons

Torque uses four daemons that have to be loaded at boot time. Copy their init.d scripts to the /etc/init.d folder like so


sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched

and add these to the boot procedure


sudo update-rc.d pbs_mom defaults
sudo update-rc.d pbs_server defaults
sudo update-rc.d pbs_sched defaults
sudo update-rc.d trqauthd defaults

Configuring the server and the nodes

Next, we would like to set our machine as both the server as well as the (only) node.

Start the trqauthd daemon


sudo /etc/init.d/trqauthd start

Log in as root and add the binary folders of Torque to the path.


su
export PATH=/opt/gcc-4.7.2/torque-5.1.0/sbin:/opt/gcc-4.7.2/torque-5.1.0/bin:$PATH
./torque.setup root

If you get an error like the following


qmgr obj= svr=default: Bad ACL entry in host list MSG=First bad host:

check your /etc/hosts file and change the directives in there. Torque only reads the first two columns to match the hostname with an IP adress.

Add your machine to the list of nodes by editing /var/spool/torque/server_priv/nodes. You have to specify the number of cores in your machine after the np= directive.


localhost np=6

Edit /var/spool/torque/mom_priv/config and set your machine as the $pbsserver. Also configure the bitmap for the logging events.


$pbsserver      ST-A1771
$logevent       225

Please note that in the above file, ST-A1771 should be replaced by the name of your local machine. Moreover, this name should match an IP address which can be configured in /etc/hosts. (thanks to danielmejia55_at_gmail_dot_com for mentioning this, see comments below)

Start pbs_mom.


/etc/init.d/pbs_mom start

In order for every user to submit files to the queuing system and check the current status of the queue, you would like that every user has the bin folder of Torque in their $PATH variable. As such, add the Torque binaries folder to the PATH in /etc/profile


echo 'export PATH=/opt/gcc-4.7.2/torque-5.1.0/bin:$PATH' >> /etc/profile

Finally, log out as root (CTRL+D)

Checking the installation

To check that everything is correctly configured, run


pbsnodes -a

If you do not get something like this, you can try to reset pbs_server. (see below)


     state = free
     power_state = Running
     np = 6
     ntype = cluster
     status = rectime=1424867568,cpuclock=OnDemand:1998MHz,varattr=,jobs=,state=free,netload=6890398942,gres=,loadave=0.00,ncpus=4,physmem=8066840kb,availmem=11324404kb,totmem=11970324kb,idletime=240,nusers=1,nsessions=2,sessions=3336 27395,uname=Linux ST-A1771 3.2.0-4-amd64 #1 SMP Debian 3.2.65-1+deb7u1 x86_64,opsys=linux
     mom_service_port = 15002
     mom_manager_port = 15003

To reset pbs_server, run


sudo /etc/init.d/pbs_server restart

Also start the scheduler


sudo /etc/init.d/pbs_sched start

And test a job by running


echo "sleep 30" | qsub

When you run


qstat

you should see something like


Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
0.ST-A1771                 STDIN            ivo                    0 R batch

Note: danielmejia55_at_gmail_dot_com has noted (see comments below) that if you encounter an error that certain queue directives are missing, that you need to set these first. He has kindly provided the settings he has used.

Submission files

Below, an example submission file for a multiprocessor job is given. In the submission file, you specify the name of the job after the PBS -N directive, the number of nodes and the number of processors per node and finally the maximum time the job is allowed to run. Typically, you would like to run the job in the same folder as where the jobfile is residing. To do so, you can use the $PBS_O_WORKDIR variable. Furthermore, you can use the $PBS_NP variable to pass the number of processes to the mpirun program.


#!/bin/bash
#
#This is an example script example.sh
#
#These commands set up the Torque Environment for your job:
#PBS -N TestJob
#PBS -l nodes=1:ppn=4,walltime=00:12:00

pwd
cd $PBS_O_WORKDIR
pwd

#print the time and date
date

mpirun -np $PBS_NP ./testjob

#print the time and date again
date

If you have questions or comments, feel free to drop a line! Like what you read? Share this page with your friends and colleagues.

Comments

Question:
What is the answer to Eight + Eight?
Please answer with a whole number, i.e. 2, 3, 5, 8,...
danielmejia55_at_gmail_dot_com
2015-05-14 01:45:56
First of all, Thank you for this guide. Though, I've been having some issues with it.
* It is worth to mention that the name "ST-A1771 " stands for the name of your local machine, so it must be already defined in the "/etc/hosts" file and in "/var/spool/torque/mom_priv/config" with the same name
* When trying to test the installation "echo "sleep 30" | qsub" it notifies that some "queue" directives for the server are missing. Those are implemented with "qmgr servername" and needed to run tests
Question:
What is the answer to Four + Eight?
Please answer with a whole number, i.e. 2, 3, 5, 8,...
ivo_at_ivofilot_dot_nl
2015-05-18 11:09:35
Hi danielmejia55_at_gmail_dot_com,

Thanks you for your comments and helpful suggestions!

I didn't get the notification about the missing queue directives. I am not sure why. I will write a small remark in the post to read your nice comments for people who might encounter them.

Thanks again!
Question:
What is the answer to One + Nine?
Please answer with a whole number, i.e. 2, 3, 5, 8,...
danielmejia55_at_gmail_dot_com
2015-05-14 01:55:09
The settings I've used where
# qmgr my_torque_server_name
Qmgr: create queue batch
Qmgr: set queue batch queue_type = Execution
Qmgr: set queue batch max_running = 6
Qmgr: set queue batch resources_max.ncpus = 8
Qmgr: set queue batch resources_max.nodes = 1
Qmgr: set queue batch resources_default.ncpus = 1
Qmgr: set queue batch resources_default.neednodes = 1:ppn=1
Qmgr: set queue batch resources_default.walltime = 24:00:00
Qmgr: set queue batch max_user_run = 6
...
Question:
What is the answer to Seven + Ten?
Please answer with a whole number, i.e. 2, 3, 5, 8,...