SACLA HPC system

SACLA HPC System

This page describes HPC system at SACLA. The author is not a member of the facility. Information here can be outdated (please let me know!). For official documentations, see http://xhpcfep.hpc.spring8.or.jp/ (unavailable from the internet; use VPN).

NOTE: SACLA HPC system was upgraded in 2016 summer. The job system and the numbers of nodes and cores have been changed.

VPN client for Linux

Linux users can use openconnect (http://www.infradead.org/openconnect/). Most Linux distributions contain it with NetworkManager GUI in the package system. If you do not have a root privilege for the client computer (e.g. HPC system at your home institution), you can technically run it in the port-forwarding mode or as a SOCKS proxy. See http://www.infradead.org/openconnect/nonroot.html for details. Of course you should make sure this is allowed by the IT policy at your institution.

Job system

The computing system is managed by PBS (TORQUE/Maui before upgrade). Although some commands resemble those of SGE (UGE) and LSF (used at LCLS), the syntax may be different. Study the online manual (man qsub) for details.

You can submit and check status of jobs from fep, xhpcsmp-bl2 and xhpcsmp-bl3.

Checking the status

You can check the status of jobs by

qstat -u $USER # status of your jobs
qstat          # status of all jobs
qstat -q       # status of queues

Submitting a job

To submit a job, you use the "qsub" command. Typical usages are:

qsub job.sh                   # submit a job that uses one core
qsub -l nodes=1:ppn=14 job.sh # submit a job that occupies 14 cores in a node
qsub -I -X -l nodes=1:ppn=14  # start an interactive shell occupying half a node (14 cores)
                              # and enable X window

Note that the script file name must come after arguments to qsub.

Unfortunately, the -d option to specify the working directory is no longer available. $PBS_O_WORKDIR points to the path where qsub was executed.

Queues

When you run qsub without the -q option, your job is submitted to the serial queue, which consists of 18 worker nodes. Each node has 28 cores but you can use up to 14 cores from a job. You can submit many jobs but at most 18 of your jobs run simultaneously, even when no jobs have been submitted by other users.

Each job has a time limit of 24 hours, after which the job will be killed.

Each node has 64 GB of RAM. Be careful not to overflow the memory. If you run out of memory, the worker node might crash and you have to ask IT staff to manually restart the node.

Fat memory node

If you need more memory, you can use the smp queue (-q smp). This queue has only one node but you can use up to 44 cores and 512 GB of RAM.

During your beamtime, you can also use xhpcsmp-blN.hpc.spring8.or.jp (N is 2 or 3, depending on the beamline). It has 32 cores and 512 GB of RAM. This node is not managed by the job system. Just ssh to it. This machine is not available from VPN.

Parallel queues

There are also parallel queues called psmall and plarge. They differ in the number of maximum nodes you can get. Some of them might be disabled when nodes are allocated to the priority queue (see below).

qsub -q psmall -l nodes=10:ppn=28 job.sh

This reserves 10 x 28 = 280 cores in 10 nodes. The script runs in one of the nodes. From the script, you can get the list of allocated nodes by $PBS_NODEFILE. It is your responsibility to divide your tasks and launch subprocesses in each node (using GNU parallel or ssh).

Priority queue

You can apply for a priority queue (up to 10 nodes) during the beamtime. To apply, ask your SACLA local contact person at least 10 days BEFORE your experiment.

Data size and transfer

The size of hit images resulting from a 48 hr beamtime rarely exceeds 2 TB. Even when the hit rate is very high, a 4 TB portable HDD should suffice. You can copy your data in the computer room in the second floor of the SACLA main building. You can also transfer your data over VPN using rsync. The throughput is about 10 - 25 MB/sec (~ 1.5 TB / day) within Japan. Oversea transfer can suffer from high latency. Sometimes bbcp helps. Unfortunately, SACLA does not have the Globus endpoint.

To use bbcp, you have to install it in your client (your local computer). It is available from http://www.slac.stanford.edu/~abh/bbcp/. On the server (SACLA) side, bbcp is installed in ~sacla_sfx_app/local/bin. To enable it, you should add

source ~sacla_sfx_app/setup.sh

to .bashrc.

To transfer files, first make the file list using find. For example:

find -name "run*.h5" > files.list

Add the server name by sed:

sed -i -e 's,^,yourname@xhpcfep02:' files.list

This makes a list like:

yourname@xhpcfep02:/path/to/file1.h5
yourname@xhpcfep02:/path/to/file2.h5
yourname@xhpcfep02:/path/to/file3.h5

Copy this list to your local computer using rsync (or scp or FileZilla, whatever). Then run the bbcp:

bbcp -P 2 -s 16 -w 100M -I files.lst /path/to/local/destination

For parameter tuning, study the bbcp manual.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly