Using

Our cluster runs the open source Torque/Maui portable batch scheduling system (PBS). A batch scheduler takes user submitted jobs, and distributes them across the the cluster in an intelligent manner, so users don't need to worry about sharing resources fairly or sshing into compute nodes to start their jobs. Users submit jobs to the queue using qsub. I've compiled my own brief intro to qsub, and there are lots more floating about the internet.

While PBS queues are great for distributing embarassingly parallel jobs across the cluster, your application may need processes running on seperate compute nodes to share data. A common approach is to use the Message Passing Interface (MPI). Our cluster uses the mpich2 implementation. Cluster-aware applications written in MPI can be started through Torque using an alternate mpiexec from the Ohio Supercomputer Center. There is a nice, brief introduction by Kristina Wanous at the University of Northern Iowa.

Managing

Our cluster (9 dual-core nodes) runs Debian. The compute nodes all boot to NFS roots off the server node. Once that hurdle was passed, setting up Torque, Maui, mpich2, and mpiexec was pretty simple, mostly the usual:

wget ...
tar ...
configure ...
make
make install

with a bit of configuring for our setup. I'll put up some more detailed notes and our config options when I get the time.