Skip to content

Documentation and scripts that are useful on gilgamesh

Notifications You must be signed in to change notification settings

KitchinHUB/gilgamesh

Repository files navigation

About gilgamesh

gilgamesh.cheme.cmu.edu is a computing cluster housed in the Department of Chemical Engineering at CMU.

The cluster consists 30 nodes:

  • 2 nodes with 32 cores 256 GB RAM/node (nodes n00-n01)
  • 6 nodes with 32 cores 128 GB RAM/node (nodes n02-n07)
  • 12 nodes with 32 cores 64GB RAM/node (nodes n08-n19)
  • 11 nodes with 48 cores 128 GB/RAM/node (ndoes n20-n30)

Nodes 11 and 12 are currently dead.

The cpus are from AMD and are connected by Gigabit and Infiniband networks.

Accounts on gilgamesh

Accounts are usually reserved for members of the research groups supporting gilgamesh (Kitchin, McGaughey, Viswanathan). To request an account send an email to [email protected] with your name, advisor and andrewID in it.

Note: There is no backup on gilgamesh. You are responsible for backing up your data.

Accessing gilgamesh.cheme.cmu.edu

gilgamesh can only be accessed by ssh from IP addresses in the cmu domain. Therefore, your options to login are:

  • from a computer on campus.
  • through a VPN (e.g. CiscoVPN with the campus certificate).
  • from a remote machine that is logged into a campus machine.

Setting up your account

We make extensive use of modules to set up the compilers, executables and libraries that are available to you. In your setup file (.cshrc or .bashrc) you simply load the modules that you need. e.g.:

module load netcdf-3.6.1-gcc

That command will put the netcdf libraries and commands in your paths. You can load modules in your setup file, or at the command line, or in your scripts.

To find out which modules are available, use the command:

module avail

Note for bash/tcsh users, you must put the line “source /etc/profile.d/modules.sh” near the top of your .cshrc file.

Here is an example .bashrc file that loads a basic environment with Python

# .bashrc

# Source global definitions
if [ -f /etc/bashrc ]; then
	. /etc/bashrc
fi

source /etc/profile.d/modules.sh

# User specific aliases and functions
module load molsim-06-640-s16

#end

Running jobs on gilgamesh

ALL compute intensive jobs should be submitted through our queue system. We utilize Torque and Maui for scheduling and running jobs. The standard Torque commands are available at the command line. To use the Maui commands you must first load the maui module (module load maui) at the command line or in your startup file.

The queue system is setup to enable fair usage of the entire cluster by all users. Jobs are run with different priorities according to the resources they use. Short jobs (up to 24 hours) are given higher priority than long jobs (up to 168 hours). There are limits that prevent any group or user from using more than half of the cluster.

Temporary exceptions to these queues may be considered. These exceptions must be requested with the reason and justification and approval of the request from your advisor.

Preparation of jobscripts

Normally job scripts consist of scripts or executables which execute the commands that run your simulation. Usually you need to remember to have this line in your script:

cd $PBS_O_WORKDIR

or:

cd /path/to/your/jobscript

which tells your script to change directory from $HOME where it starts to the working directory where the calculation is supposed to be run. It is not necessary for your jobscript to be executable, but any commands in the script should exist on your $PATH and should be executable.

Once you create your script, then you must submit it to the queue. To do this you use the qsub command along with options that request resources, e.g. how long would you like to run, how many nodes you need, how much memory, etc… These options must be put at the command line, all #PBS directives inside scripts are ignored. Here are some examples of submitting job scripts.

#. To request an interactive session on a node to run scripts from the command line for up to 5 hours.

qsub -I -l walltime=5:00:00

To run a serial job for up to 48 hours

qsub -l walltime=48:00:00 -joe jobscript.sh

To request 32 cores on one node for up to 168 hours

qsub -l walltime=168:00:00,nodes=1:ppn=32 -joe jobscript.sh

Note your jobscript must now how to start a parallel process, e.g. using mpirun. Simply asking for the nodes will not make it run in parallel! To request any 32 cores on any node for 24 hours

qsub -l walltime=24:00:00,nodes=32 -joe jobscript

To request any 32 cores on any node with 128 GB of memory for 24 hours

qsub -l walltime=24:00:00,nodes=32,mem=128gb -joe jobscript

Once you have submitted your jobs you may want to monitor them. To find out the status of your jobs use the command:

qstat | grep $USER

this will list your jobs and tell you whether they are running, or queued, and give you the jobid of each job. To get details about a particular job use:

qstat -f jobid

If you need to delete a job, use the qdel command:

qdel jobid

Getting help on gilgamesh

When asking for help it is critical that you provide as much information about your problem as you can. simply saying “it doesn’t work” will not get you an answer. You should provide all error messages that you observed, and anything you did that led up to the error.

Use google! I have often found solutions by googling the error message, or parts of the error message.

If the problem is related to a job in the queue, examine the output of the jobscript.ojobid and jobscript.ejobid (the output and error files from the queue). use the tracejob jobid to see what the queue knows about the job and its history. Try running the job through an interactive queue session from the command line (qsub -I). Sometimes it easier to figure out what is happening that way.

Search https://lists.andrew.cmu.edu/pipermail/gilgamesh-users/ of [email protected] for similar problems. If you don’t find anything then send an email to mailto:[email protected] to see if anyone else has had the problem, and when you finally get the solution send that to the maillist so that others can see the solution too.

Check with other users and your advisor to see if he/she knows anything about the problem.

Software

In general you can find a list of installed software packages by typing:

module avail

Members of the Kitchin, and McGaughey groups may install software for their groups in /opt/kitchingroup or /opt/mcgaugheygroup.

If rpms exist in known repositories you may request additional software be installed on the cluster, provided it does not conflict with existing software and that all dependencies on other packages are resolved.

If new software requires compiling and root privileges you will need to discuss the installation with Professor Kitchin, and provide him with detailed information about the compilation.

Matlab

Using Matlab on gilgamesh

Matlab should be automatically available on all the nodes on gilgamesh. Here is how you set up your m-file to run as a batch job. Make sure you put exit at the end, e.g. like this.

a = 4;
b = 5;
a + b
exit

To submit this job to the queue, use this script:

#!/bin/bash
cd $PBS_O_WORKDIR
unset DISPLAY
pwd
matlab  -nojvm -nosplash -nodisplay -r myscript

#end

Another way to submit the job that doesn’t require writing a script is:

qsub -l walltime=3:00:00 << EOF
> cd $PBS_O_WORKDIR
> unset DISPLAY
> matlab  -nojvm -nosplash -nodisplay -r myscript
> EOF

Running Matlab interactively

You may prefer to run Matlab interactively. In this case, request an interactive job through the queue:

qsub -I -l walltime=4:00:00

Note it is hard to tell what the options above are in some browsers. the first one is a capital i, the second is lower case L. Wait to get a prompt, and run your script. Note that the nodes do not have a display, so you won’t be able to create graphs or use the graphics desktop.

Gilgamesh in emacs

There is a gilgamesh.el file in this directory that provides some useful functions to interact with gilgamesh.

(load-file "gilgamesh.el")
(load-file "ivy-gilgamesh.el")

elisp:qstat is a helm interface to qstat. You can get info on and delete jobs as actions.

elisp:gilgamesh-net will show a buffer with the net traffic on each node.

elisp:gilgamesh-pbsnodes should prompt for a node, then show you a buffer of the jobs on the node.

elisp:ivy-top Runs top with an ivy selection buffer

elisp:ivy-qstat Runs qstat with an ivy-selection buffer

elisp:ivy-pbsnodes Runs pbsnodes via ivy. It shows the node and load in the selection buffer.

elisp:ivy-node-ps will prompt you for a node, then show the processes on the node.

elisp:ivy-net will show a list of nodes with net-traffic. The default action is to show the processes on the node.

Administrative notes

Adding new users

Use the gui tool (system-config-users) to add new users and configure groups.

sudo 
(load-file "gilgamesh-cmds.el")
(sudo "/usr/bin/system-config-users")

Use the command

chage -d 0 userid

to force the users to change their passwords on login.

I recommend you use a site like http://makeagoodpassword.com to generate reasonable passwords.

Restarting torque daemons

To restart the torque daemon on a compute node, all of the startup scripts are located in /etc/beowulf/init.d on the head node. To start a script on a particular compute node, run the following command:

NODE=15 /etc/beowulf/init.d/90torque

where ‘15’ is the node number. You may get some warnings that some mounts are already there, but the net result should be that pbs_mom is restarted on that node.

Restarting nodes

Most of the time you can go to http://gilgamesh.cheme.cmu.edu/scyld-imf to reboot nodes. You need the root password for this. Click on the nodes tab, select the nodes to reboot, and then click IPMI -> IPMI Power Cycle to restart the node. They usually crash because they run out of memory.

Occasionally you have to go to the cluster room and manually power cycle the nodes.

Killing jobs on nodes that have crashed

Sometimes jobs get stuck in the queue when nodes crash. In that case I usually delete the job like this:

qdel -p jobid

You probably need to use sudo for this.

Restarting the cluster

When the power goes out, or you shut the cluster down for some reason, it is a little tricky to restart it. For some reason the infiniband adapter does not come up when we expect it to, so the NFS server hangs for several minutes until it times out. When that happens the home-directories do not get mounted, so you have log in as root on the head node and run:

mount -a

That should make it work. We don’t reboot often enough to fix this.

Commands that might be helpful

Check the man pages on these. I do not use them regularly

ipmitool
can be used to reboot nodes at the command line
bpctl
Control the operational state and ownership of compute nodes.
beostat
beostat - Display raw data from the Beostat system.
beosetup
Configure, restart, and view information about a Scyld ClusterWare cluster
beoconfig
Operate on Scyld ClusterWare cluster configuration files.
service beowulf {start, stop, restart, reload}
see /etc/init.d/beowulf
bpstat
Show cluster node status and cluster process mapping.
beostatus
Display status information about the cluster
bpsh
Run a program on a node. The nodes are setup in a way that we do not just ssh to a node to run a command, use bpsh.
bpcp
Copies files and/or directories between cluster machines.

The ps command shows all the processes on all the nodes.

To see what is happening on the login node use:

ps aux | bpstat -P master | sort -k 3 -n -r

About

Documentation and scripts that are useful on gilgamesh

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published