2015-03-03

# Introduction

Imagine you would like to perform a large series of calculation. Obviously, you would not run the complete series of calculations at the same time. In principle, you would like to start as many jobs as the number of processors on your computer can handle and start the next job in the series when the previous job has finished. You could write your own program for this, but there are many of such programs available. In this blog post, I will show you how to install and use Torque. Although Torque can be set up to run in a computer cluster, I will show you how you can install and use it on just a single machine. This tutorial is written for Linux Debian, but should in principle also work for Linux Ubuntu and with (hopefully) small modifications for the other distributions.

# Compiling Torque


cd torque-5.1.0
./configure --prefix=/opt/gcc-4.7.2/torque-5.1.0
make -j5
sudo make install



# Installing the daemons

Torque uses four daemons that have to be loaded at boot time. Copy their init.d scripts to the /etc/init.d folder like so


sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched



and add these to the boot procedure


sudo update-rc.d pbs_mom defaults
sudo update-rc.d pbs_server defaults
sudo update-rc.d pbs_sched defaults
sudo update-rc.d trqauthd defaults



# Configuring the server and the nodes

Next, we would like to set our machine as both the server as well as the (only) node.

Start the trqauthd daemon


sudo /etc/init.d/trqauthd start




su
export PATH=/opt/gcc-4.7.2/torque-5.1.0/sbin:/opt/gcc-4.7.2/torque-5.1.0/bin:$PATH ./torque.setup root   If you get an error like the following  qmgr obj= svr=default: Bad ACL entry in host list MSG=First bad host:   check your /etc/hosts file and change the directives in there. Torque only reads the first two columns to match the hostname with an IP adress. Add your machine to the list of nodes by editing /var/spool/torque/server_priv/nodes. You have to specify the number of cores in your machine after the np= directive.  localhost np=6   Edit /var/spool/torque/mom_priv/config and set your machine as the $pbsserver. Also configure the bitmap for the logging events.


$pbsserver ST-A1771$logevent       225



Please note that in the above file, ST-A1771 should be replaced by the name of your local machine. Moreover, this name should match an IP address which can be configured in /etc/hosts. (thanks to danielmejia55_at_gmail_dot_com for mentioning this, see comments below)

Start pbs_mom.


/etc/init.d/pbs_mom start



In order for every user to submit files to the queuing system and check the current status of the queue, you would like that every user has the bin folder of Torque in their $PATH variable. As such, add the Torque binaries folder to the PATH in /etc/profile  echo 'export PATH=/opt/gcc-4.7.2/torque-5.1.0/bin:$PATH' >> /etc/profile



Finally, log out as root (CTRL+D)

# Checking the installation

To check that everything is correctly configured, run


pbsnodes -a



If you do not get something like this, you can try to reset pbs_server. (see below)


state = free
power_state = Running
np = 6
ntype = cluster
mom_service_port = 15002
mom_manager_port = 15003



To reset pbs_server, run


sudo /etc/init.d/pbs_server restart



Also start the scheduler


sudo /etc/init.d/pbs_sched start



And test a job by running


echo "sleep 30" | qsub



When you run


qstat



you should see something like


Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
0.ST-A1771                 STDIN            ivo                    0 R batch



Note: danielmejia55_at_gmail_dot_com has noted (see comments below) that if you encounter an error that certain queue directives are missing, that you need to set these first. He has kindly provided the settings he has used.

# Submission files

Below, an example submission file for a multiprocessor job is given. In the submission file, you specify the name of the job after the PBS -N directive, the number of nodes and the number of processors per node and finally the maximum time the job is allowed to run. Typically, you would like to run the job in the same folder as where the jobfile is residing. To do so, you can use the $PBS_O_WORKDIR variable. Furthermore, you can use the $PBS_NP variable to pass the number of processes to the mpirun program.


#!/bin/bash
#
#This is an example script example.sh
#
#These commands set up the Torque Environment for your job:
#PBS -N TestJob
#PBS -l nodes=1:ppn=4,walltime=00:12:00

pwd
cd $PBS_O_WORKDIR pwd #print the time and date date mpirun -np$PBS_NP ./testjob

#print the time and date again
date



#### Drop a line

Question:
What is the answer to Eight + Eight?
danielmejia55_at_gmail_dot_com
2015-05-14 01:45:56
First of all, Thank you for this guide. Though, I've been having some issues with it.
* It is worth to mention that the name "ST-A1771 " stands for the name of your local machine, so it must be already defined in the "/etc/hosts" file and in "/var/spool/torque/mom_priv/config" with the same name
* When trying to test the installation "echo "sleep 30" | qsub" it notifies that some "queue" directives for the server are missing. Those are implemented with "qmgr servername" and needed to run tests

Question:
What is the answer to Ten + Four?
ivo_at_ivofilot_dot_nl
2015-05-18 11:09:35
Hi danielmejia55_at_gmail_dot_com,

I didn't get the notification about the missing queue directives. I am not sure why. I will write a small remark in the post to read your nice comments for people who might encounter them.

Thanks again!

Question:
What is the answer to Nine + Nine?
danielmejia55_at_gmail_dot_com
2015-05-14 01:55:09
The settings I've used where
# qmgr my_torque_server_name
Qmgr: create queue batch
Qmgr: set queue batch queue_type = Execution
Qmgr: set queue batch max_running = 6
Qmgr: set queue batch resources_max.ncpus = 8
Qmgr: set queue batch resources_max.nodes = 1
Qmgr: set queue batch resources_default.ncpus = 1
Qmgr: set queue batch resources_default.neednodes = 1:ppn=1
Qmgr: set queue batch resources_default.walltime = 24:00:00
Qmgr: set queue batch max_user_run = 6
...

Question:
What is the answer to Two + Six?