Friday, June 29, 2012

PBS/Torque New Installation

I just installed PBS/Torque (default pbs_sched) onto a head node and four compute nodes. As far as I can tell I have everything correct. When I pbsnodes I get all my compute nodes coming back as "state=free". The problem is that when i set server_priv/nodes only the first hostname on the list performs its jobs and sends them back to the server. If i just reorder the hostnames in the list then again the new 1st hostname is the only one that runs its jobs, after restarting the pbs_server.

Only the first hostname runs its jobs and sends back its .eo files with a "C" state in qstat. When I add enough jobs to run on the other hosts/compute nodes they show "R" and just stay there for a few minutes and then disappear. If i log into that computer and check /var/spool/torque/spool I can see the OU/ER files associated with the jobs that were run on this computer. Doing a tracejobs gives

06/28/2012 08:14:19 M scan_for_terminated: job 19.HEADNODE task 1 terminated, sid=12337

06/28/2012 08:14:19 M job was terminated

06/28/2012 08:14:19 M obit sent to server

06/28/2012 08:14:19 M server rejected job obit - 15001

06/28/2012 08:14:19 M removed job script

I don't know what to do!! Please help.

