Using
Our cluster runs the open source Torque/Maui portable batch
scheduling system (PBS). A batch scheduler takes user submitted
jobs, and distributes them across the the cluster in an intelligent
manner, so users don't need to worry about sharing resources fairly or
sshing into compute nodes to start their jobs. Users submit jobs
to the queue using qsub
. I've compiled my own brief
intro to qsub
, and there are lots more floating about
the internet.
While PBS queues are great for distributing embarassingly parallel jobs across the cluster, your application may need processes running on seperate compute nodes to share data. A common approach is to use the Message Passing Interface (MPI). Our cluster uses the mpich2 implementation. Cluster-aware applications written in MPI can be started through Torque using an alternate mpiexec from the Ohio Supercomputer Center. There is a nice, brief introduction by Kristina Wanous at the University of Northern Iowa.
Managing
Our cluster (9 dual-core nodes) runs Debian. The compute nodes all boot to NFS roots off the server node. Once that hurdle was passed, setting up Torque, Maui, mpich2, and mpiexec was pretty simple, mostly the usual:
wget ...
tar ...
configure ...
make
make install
with a bit of configuring for our setup. I'll put up some more detailed notes and our config options when I get the time.