-
Notifications
You must be signed in to change notification settings - Fork 2
SACLA HPC system
This page describes HPC system at SACLA. The author is not a member of the facility. Information here can be outdated (please let me know!). For official documentations, see http://xhpcfep.hpc.spring8.or.jp/ (unavailable from the internet; use VPN).
NOTE: SACLA HPC system was upgraded in 2016 summer. The job system and the numbers of nodes and cores have been changed.
Linux users can use openconnect (http://www.infradead.org/openconnect/). Most Linux distributions contain it with NetworkManager GUI in the package system. If you do not have a root privilege for the client computer (e.g. HPC system at your home institution), you can technically run it in the port-forwarding mode or as a SOCKS proxy. See http://www.infradead.org/openconnect/nonroot.html for details. Of course you should make sure this is allowed by the IT policy at your institution.
The computing system is managed by PBS (TORQUE/Maui before upgrade). Although some commands
resemble those of SGE (UGE) and LSF (used at LCLS), the syntax may be different.
Study the online manual (man qsub
) for details.
You can submit and check status of jobs from fep
, xhpcsmp-bl2 and xhpcsmp-bl3.
You can check the status of jobs by
qstat -u $USER # status of your jobs
qstat # status of all jobs
qstat -q # status of queues
To submit a job, you use the "qsub" command. Typical usages are:
qsub job.sh # submit a job that uses one core
qsub -l nodes=1:ppn=14 job.sh # submit a job that occupies 14 cores in a node
qsub -I -X -l nodes=1:ppn=14 # start an interactive shell occupying half a node (14 cores)
# and enable X window
Note that the script file name must come after arguments to qsub
.
Unfortunately, the -d
option to specify the working directory is no longer available.
$PBS_O_WORKDIR
points to the path where qsub
was executed.
When you run qsub without the -q option, your job is submitted to
the serial
queue, which consists of 18 worker nodes. Each node has 28 cores
but you can use up to 14 cores from a job. You can submit many jobs but at
most 18 of your jobs run simultaneously, even when no jobs have been submitted by other users.
Each job has a time limit of 24 hours, after which the job will be killed.
Each node has 64 GB of RAM. Be careful not to overflow the memory. If you run out of memory, the worker node might crash and you have to ask IT staff to manually restart the node.
If you need more memory, you can use the smp
queue (-q smp). This queue
has only one node but you can use up to 44 cores and 512 GB of RAM.
During your beamtime, you can also use xhpcsmp-blN.hpc.spring8.or.jp (N is 2 or 3, depending on the beamline). It has 32 cores and 512 GB of RAM. This node is not managed by the job system. Just ssh to it. This machine is not available from VPN.
There are also parallel queues called psmall
and plarge
.
They differ in the number of maximum nodes you can get. Some of them
might be disabled when nodes are allocated to the priority queue (see below).
qsub -q psmall -l nodes=10:ppn=28 job.sh
This reserves 10 x 28 = 280 cores in 10 nodes. The script runs in one of the nodes. From the script, you can get the list of allocated nodes by $PBS_NODEFILE. It is your responsibility to divide your tasks and launch subprocesses in each node (using GNU parallel or ssh).
You can apply for a priority queue (up to 10 nodes) during the beamtime. To apply, ask your SACLA local contact person at least 10 days BEFORE your experiment.
The size of hit images resulting from a 48 hr beamtime rarely exceeds 2 TB. Even when the hit rate is very high,
a 4 TB portable HDD should suffice. You can copy your data in the computer room in the second floor of the
SACLA main building. You can also transfer your data over VPN using rsync
. The throughput is about
10 - 25 MB/sec (~ 1.5 TB / day) within Japan. Oversea transfer can suffer from high latency.
Sometimes bbcp
helps. Unfortunately, SACLA does not have the Globus endpoint.
To use bbcp, you have to install it in your client (your local computer).
It is available from http://www.slac.stanford.edu/~abh/bbcp/.
On the server (SACLA) side, bbcp
is installed in ~sacla_sfx_app/local/bin
.
To enable it, you should add
source ~sacla_sfx_app/setup.sh
to .bashrc
.
To transfer files, first make the file list using find
. For example:
find -name "run*.h5" > files.list
Add the server name by sed
:
sed -i -e 's,^,yourname@xhpcfep02:' files.list
This makes a list like:
yourname@xhpcfep02:/path/to/file1.h5
yourname@xhpcfep02:/path/to/file2.h5
yourname@xhpcfep02:/path/to/file3.h5
Copy this list to your local computer using rsync
(or scp
or FileZilla
, whatever).
Then run the bbcp:
bbcp -P 2 -s 16 -w 100M -I files.lst /path/to/local/destination
For parameter tuning, study the bbcp manual.