Skip to content

Running the pipeline

biochem_fan edited this page Oct 13, 2021 · 18 revisions

Running the pipeline

Introduction

Here you will process a lysozyme dataset collected at 7 keV. This is a part of datasets used for the S-SAD phasing reported in Nakane et al, Acta Cryst. D, 2015. Feel free to play with it. Full datasets (processed by Cheetah) are available at CXIDB #33.

If you are not familiar with CrystFEL, we recommend that you go through the CrystFEL tutorial. In this tutorial, we focus on SACLA specific issues and omit basics common to LCLS.

Online and offline pipeline

As described in Nakane et al, J. Appl. Cryst., 2016, the data processing pipeline consists of the online (realtime) analysis and the offline analysis.

The online analysis monitors the hit rate and detector saturation in realtime. This is discussed in Online (realtime) analysis. The offline pipeline loads recorded images through SACLA API, runs spot finding and converts hit images into the HDF5 format. The outputs can be processed by many programs, including CrystFEL, cppxfel and cctbx.xfel. In this page, we run the offline pipeline. In the next page, Running CrystFEL, we run CrystFEL and solve the structure.

The pipeline must be run on the SACLA HPC system. Once hit images have been extracted, you can download them to process them locally. If you do not have a SACLA HPC account, or just want to learn how to process the outputs from the pipeline, skip this page and read Running CrystFEL.

Data collection

At SACLA, raw images are grouped into runs. Typically, a run consists of 150 dark images without X-ray, followed by 5000 exposed images. The dark images are used to calculate the dark-current of the detector. Multiple runs are collected as long as the sample in the injector lasts. This is different from the standard practice at LCLS, where people tend to collect a long run with tens of thousands of images from a sample batch.

Run 1:  D1 D2 D3 ... D150 E1 E2 E3 ...................... E5000
Run 2:  D1 D2 D3 ... D150 E1 E2 E3 ...................... E5000
Run 3:  D1 D2 D3 ... D150 E1 E2 E3 ...................... E5000
...

(D: dark image, E: exposed image)

A run is identified by a run number, while an image is identified by a tag number.

You can change the number of dark and exposed images in a run by a "Run control GUI". However, we recommend you to stick to the default. Since images become available for processing only after the run has been completed, making a run bigger increases the latency of data processing. You need at least 50 dark images in each run to get a reliable estimate of the dark current.

Log in to the SACLA HPC

First, establish a VPN connection to SACLA. Then log in to the xhpcfep node (fep stands for front end processor). If you are on site, you can use xhpcsmp-bl3 (or xhpcsmp-bl2 for beamline BL2), which is more powerful.

ssh -Y yourname@xhpcfep # VPN
ssh -Y [email protected] # on site

Once connected, you have to set up public key authentication. Please run ssh xfer. If this does not ask you a password, you don't need this step; please skip to the next section. Otherwise, run the following.

mkdir -p ~/.ssh
ssh-keygen -t ed25519 # create a key pair
cat ~/.ssh/id_ed25519.pub >> ~/.ssh/authorized_keys # add this to the allowed list

Please run ssh xfer again and make sure it no longer asks you a password.

Start the pipeline

Next, create and go to your work directory.

mkdir /work/yourname/cheetah-test
cd /work/yourname/cheetah-test

WARNING! Files under /work is automatically deleted one month after final access. Copy important files to /UserData/yourname (accessible from xfer2) for long term storage. /home doesn't have time limits but the quota is smaller. This is where you install your script & programs. Details are discussed in SACLA HPC system.

Now you are ready to launch "Cheetah dispatcher" GUI.

source ~sacla_sfx_app/setup.sh
cheetah-dispatcher

If this is the first time you launch Cheetah dispatcher at the directory, Cheetah requests you to setup a configuration file.

ERROR: Configuration file was not found!

You should copy /home/sacla_sfx_app/packages/tools/sacla-photon.ini into this directory
and confirm the settings.

Copy sacla-photon.ini as shown in the console. Basically, you don't have to edit it. If you want to change spot finding paramaters for Cheetah, you can do it here. But as written in Nakane et al, J. Appl. Cryst., 2016, the default parameters work well for almost all cases.

Note that we don't use Cheetah's spot list for indexing in CrystFEL. So the spot finding quality by Cheetah is not very important. It suffices to distinguish hits from non-hits. If you tune parameters very much, you might get 0.5 % more hits, but these images tend to be very weak and does not contribute much signal anyway. The author does not think it doesn't worth the efforts.

We use Cheetah's peakfinder6 algorithm for spot finding. Sometimes people from other beamlines ask if we should use the peakfinder8 algorithm, which uses radial averages for spot finding. Our answer is no. The reason is two fold. (1) Cheetah doesn't know the optimized geometry during data collection, and (2) we often have shadows from the injector or the vacuum suction, which are not rotationally symmetric. The peakfinder8 has been implemented in CrystFEL 0.6.3, so you can use it for indexing regardless of Cheetah's configuration.

cp /home/sacla_sfx_app/packages/tools/sacla-photon.ini .
cheetah-dispatcher

If you have a priority queue allocated for your beamtime, you can specify it as --queue=occupancy, where occupancy is the name of the queue.

By default, the beam line 2 is selected, since most SFX experiments will be performed there after the 2017B season. If your data are collected in the beam line 3, make sure you choose BL3 in the GUI. You can also add --bl=3 to the command line to make it the default.

Submit jobs

Type "266711-266721" to the "Run ID" text box, choose "BL3" and click the "Submit" button. During your beamtime, you have to specify your own run number, of course.

You can omit the second number. Then Cheetah will automatically detect and submit all runs after the first. (e.g. "266711-" ; but do NOT do it now. It will submit more than 150000 runs from 266711 to the latest runs!)

Screen shot of Cheetah dispatcher

Low level fitering

As described in Nakane et al, J. Appl. Cryst., 2016, low-level filtering (LLF) is no longer used and removed from the GUI.

Monitor the progress

You can select multiple rows by clicking with Ctrl keys (as in Explorer or Finder). After selection, a right-click shows a menu. "Count sums" tells you the summary.

Type: normal
Total: 54801
Processed: 54801
Accepted: 54801
Hits: 34250 (62.5% of accepted)
Indexed: 18621  (54.4% of hits)

One job usually takes 5 to 10 minutes to complete. If it took significantly longer, your runs might have been transferred to the tape archive. This (automatically and transparently) happens several weeks after data collection.

Check the cell parameters

"Check cell" in the right-click menu launches cell_explorer in the CrystFEL suite and show the histogram of the cell parameters. You can use the mouse wheel to zoom in or out. You can drag on the histogram with the shift key to choose a region of the histogram. Select the main peaks in a, b, c, alpha, beta and gamma panels and press Ctrl-F (or [Tools]-[Fit cell] in the menu).

Screen shot of cell_explorer

Here we get the putative cell parameters 39.2 80.4 80.9 89.7 89.55 89.8. Note that these values are not very accurate because the diffraction geometry has not been optimized yet. Also note that CrystFEL does not apply Bravais lattice constraints and report the cells in the primitive setting.

If you have a CCP4 license, you can use othercell by Phil Evans to explore the metric symmetry.

$ othercell
Type cell >>
39.2 80.4 80.9 89.7 89.55 89.8
Type lattice type >>
P
Type target cell, or eof (control/D) for no target >>


<<<<<<<<<<<

Input  cell:
39.200080.400080.9000   89.700089.550089.8000
 - Lattice Type P
Lattice point group: P 4 2 2
 within angular tolerance    3.0 (reset with eg -tol[erance] 5)

                      Lattice cell:   80.65  80.65  39.20     90.00  90.00  90.00
Lattice unit cell after reindexing:   80.40  80.90  39.20     89.55  89.80  89.70

Let's take tetragonal "80.65 80.65 39.20 90.00 90.00 90.00" as the first guess.

View diffraction images

"View hits" in the right-click menu launches the hdfsee viewer in CrystFEL with several extensions.

Screen shot of hdfsee

If indexing has been started, the stream file is loaded. Otherwise, only the image file is loaded. With the stream file, a table is displayed. Here, you can sort images by resolution or number of spots. By clicking a row, the selected image can be displayed.

Key bindings are as follows:

  • F3 or [View]-[Set binning]
    change binning. 2 by default. 1 shows the image in the original size
  • F5 or [View]-[Boost intensity]
    bigger numbers will darken the image
  • F8 or [View]-[Features]
    mark detected spots by black circles, predicted (integrated) spots by red circles
  • F9 or [View]-[Resolution Rings]
    show resolution rings

Because the diffraction geometry has not been optimized yet, the spots tend to be over-predicted to compensate for positional errors.

Note that even when multiple rows are selected, you can examine images and cell parameters of a single run alone.

TODO Fix this

Examine the output

In your work directory, you will find these files and directories:

  • 266702-0/
    • 266702.geom (CrystFEL geometry)
    • 209060-dark.h5 (Dark average)
    • 209060-geom.h5 (Cheetah geometry)
    • run266702-0.h5 (hit images)
    • ... (other log files)
  • 266702-1/
    • run266702-1.h5
    • ...
  • 266702-2/
    • run266702-2.h5
    • ...
  • 266703-0/
    • 266703.geom
    • run266703-0.h5
    • ...
  • 266703-1/
    • run266703-1.h5
    • ...
  • 266703-2/
    • run266703-2.h5
    • ...
  • 266704-0/...
  • 266704-1/...
  • 266704-2/...
  • ...

All hit images in a processing batch are packed into a big HDF5 file (runXXXXXX-Y.h5) to reduce file system overhead.

The images had (1) dark current subtracted and (2) detector gains normalized. (2) means that 1 photon corresponds to 10 counts. In other words, the detector gain is 10 by definition.

Although geometry files are generated in each folder, they are all the same; you can use any of them. The detector geometry is constant during your beamtime.