Skip to content

Running the pipeline

biochem_fan edited this page Apr 21, 2017 · 18 revisions

Running the pipeline

Introduction

Here you will process a lysozyme dataset collected at 7 keV. This is a part of datasets used for the S-SAD phasing reported in Nakane et al, Acta Cryst. D, 2015. Feel free to play with it. Full datasets (processed by Cheetah) are available at CXIDB #33.

If you are not familiar with CrystFEL, we recommend that you go through the CrystFEL tutorial. In this tutorial, we focus on SACLA specific issues and omit basics common to LCLS.

Online and offline pipeline

As described in Nakane et al, J. Appl. Cryst., 2016, the data processing pipeline consists of the online (realtime) analysis and the offline analysis.

The online analysis monitors the hit rate and detector saturation in realtime. This is discussed in Online (realtime) analysis. The offline pipeline loads recorded images through SACLA API, runs spot finding and converts hit images into the HDF5 format. The outputs can be processed by many programs, including CrystFEL, cppxfel and cctbx.xfel. In this page, we run the offline pipeline. In the next page, Running CrystFEL, we run CrystFEL and solve the structure.

The pipeline must be run on the SACLA HPC system. Once hit images have been extracted, you can download them to process them locally. If you do not have a SACLA HPC account, or just want to learn how to process the outputs from the pipeline, skip this page and read Running CrystFEL.

Data collection

At SACLA, raw images are grouped into runs. Typically, a run consists of 150 dark images without X-ray, followed by 5000 exposed images. The dark images are used to calculate the dark-current of the detector. Multiple runs are collected as long as the sample in the injector lasts. This is different from the standard practice at LCLS, where people tend to collect a long run with tens of thousands of images from a sample batch.

Run 1:  D1 D2 D3 ... D150 E1 E2 E3 ...................... E5000
Run 2:  D1 D2 D3 ... D150 E1 E2 E3 ...................... E5000
Run 3:  D1 D2 D3 ... D150 E1 E2 E3 ...................... E5000
...

(D: dark image, E: exposed image)

A run is identified by a run number, while an image is identified by a tag number.

You can change the number of dark and exposed images in a run by a "Run control GUI". However, we recommend you to stick to the default. Since images become available for processing only after the run has been completed, making a run bigger increases the latency of data processing. You need at least 50 dark images in each run to get a reliable estimate of the dark current.

Start the pipeline

First, establish a VPN connection to SACLA. Then log in to the fep node (front end processor). If you are on site, you can use xhpcsmp-bl3 (or xhpcsmp-bl2 for beamline BL2), which is more powerful.

ssh -Y yourname@fep # VPN
ssh -Y [email protected] # on site

Next, create and go to your work directory.

mkdir /work/yourname/cheetah-test
cd /work/yourname/cheetah-test

WARNING! Files under /work is automatically deleted one month after final access. Copy important files to /UserData/yourname (accessible from xfer2) for long term storage. /home doesn't have time limits but the quota is smaller. This is where you install your script & programs. Details are discussed in SACLA HPC system.

Now you are ready to launch "Cheetah dispatcher" GUI.

source ~sacla_sfx_app/setup.sh
cheetah-dispatcher

If this is the first time you launch Cheetah dispatcher at the directory, Cheetah requests you to setup a configuration file.

ERROR: Configuration file was not found!

You should copy /home/sacla_sfx_app/packages/tools/sacla-photon.ini into this directory
and confirm the settings.

Copy sacla-photon.ini as shown in the console. Basically, you don't have to edit it. If you want to change spot finding paramaters for Cheetah, you can do it here. But as written in Nakane et al, J. Appl. Cryst., 2016, the default parameters work well for almost all cases.

cp /home/sacla_sfx_app/packages/tools/sacla-photon.ini .
cheetah-dispatcher

If you have a priority queue allocated for your beamtime, you can specify it as --queue=occupancy, where occupancy is the name of the queue.

Submit jobs

Type "266711-266721" to the "Run ID" text box and click the "Submit" button. During your beamtime, you have to specify your own run number, of course.

You can omit the second number. Then Cheetah will automatically detect and submit all runs after the first. (e.g. "266711-" ; but do NOT do it now. It will submit more than 150000 runs from 266711 to the latest runs!)

Screen shot of Cheetah dispatcher

Low level fitering

The "MaxI threshold" textbox controls the threshold for the low level fitering (LLF). Since we do not use LLF nowadays, just leave it 0 (disabled).

LLF calculates the maximal pixel value within the ROI (region of interest). Cheetah can skip images whose LLF value is less than the threshold. This is similar to the veto system at LCLS. This was useful to accelerate the processing in 2014 but no longer necessary because the performance has been improved.

For the details, read Nakane et al, J. Appl. Cryst., 2016.

Monitor the progress

You can select multiple rows by clicking with Ctrl keys (as in Explorer or Finder). After selection, a right-click shows a menu. "Count sums" tells you the summary.

Type: normal
Total: 54801
Processed: 54801
Accepted: 54801
Hits: 34250 (62.5% of accepted)
Indexed: 18621  (54.4% of hits)

Check the cell parameters

"Check cell" in the right-click menu launches cell_explorer in the CrystFEL suite and show the histogram of the cell parameters. You can use the mouse wheel to zoom in or out. You can drag on the histogram with the shift key to choose a region of the histogram. Select the main peaks in a, b, c, alpha, beta and gamma panels and press Ctrl-F (or [Tools]-[Fit cell] in the menu).

Screen shot of cell_explorer

Here we get the putative cell parameters 39.2 80.4 80.9 89.7 89.55 89.8. Note that these values are not very accurate because the diffraction geometry has not been optimized yet. Also note that CrystFEL does not apply Bravais lattice constraints and report the cells in the primitive setting.

If you have a CCP4 license, you can use othercell by Phil Evans to explore the metric symmetry.

$ othercell
Type cell >>
39.2 80.4 80.9 89.7 89.55 89.8
Type lattice type >>
P
Type target cell, or eof (control/D) for no target >>


<<<<<<<<<<<

Input  cell:
39.200080.400080.9000   89.700089.550089.8000
 - Lattice Type P
Lattice point group: P 4 2 2
 within angular tolerance    3.0 (reset with eg -tol[erance] 5)

                      Lattice cell:   80.65  80.65  39.20     90.00  90.00  90.00
Lattice unit cell after reindexing:   80.40  80.90  39.20     89.55  89.80  89.70

Let's take tetragonal "80.65 80.65 39.20 90.00 90.00 90.00" as the first guess.

View diffraction images

"View hits" in the right-click menu launches the hdfsee viewer in CrystFEL with several extensions.

Screen shot of hdfsee

If indexing has been started, the stream file is loaded. Otherwise, only the image file is loaded. With the stream file, a table is displayed. Here, you can sort images by resolution or number of spots. By clicking a row, the selected image can be displayed.

Key bindings are as follows:

  • F3 or [View]-[Set binning]
    change binning. 2 by default. 1 shows the image in the original size
  • F5 or [View]-[Boost intensity]
    bigger numbers will darken the image
  • F8 or [View]-[Features]
    mark detected spots by black circles, predicted (integrated) spots by red circles
  • F9 or [View]-[Resolution Rings]
    show resolution rings

Because the diffraction geometry has not been optimized yet, the spots tend to be over-predicted to compensate for positional errors.

Note that even when multiple rows are selected, you can examine images and cell parameters of a single run alone.

TODO Fix this

Examine the output

In your work directory, you will find these files and directories:

  • 266702-0/
    • 266702.geom (CrystFEL geometry)
    • 209060-dark.h5 (Dark average)
    • 209060-geom.h5 (Cheetah geometry)
    • run266702-0.h5 (hit images)
    • ... (other log files)
  • 266702-1/
    • run266702-1.h5
    • ...
  • 266702-2/
    • run266702-2.h5
    • ...
  • 266703-0/
    • 266703.geom
    • run266703-0.h5
    • ...
  • 266703-1/
    • run266703-1.h5
    • ...
  • 266703-2/
    • run266703-2.h5
    • ...
  • 266704-0/...
  • 266704-1/...
  • 266704-2/...
  • ...

All hit images in a processing batch are packed into a big HDF5 file (runXXXXXX-Y.h5) to reduce file system overhead.

The images had (1) dark current subtracted and (2) detector gains normalized. (2) means that 1 photon corresponds to 10 counts. In other words, the detector gain is 10 by definition.

Although geometry files are generated in each folder, they are all the same; you can use any of them. The detector geometry is constant during your beamtime.