Skip to content

Subcommand: unchunkify

Lucas Czech edited this page Nov 30, 2022 · 19 revisions

Unchunkify a set of jplace files using abundance map files and create per-sample jplace files.

Usage: gappa prepare unchunkify [options]

Options

Input
--abundances-path Required. TEXT:PATH(existing)=[] ...
List of abundances files or directories to process. For directories, only files with the extension .json[.gz] are processed.
--jplace-path TEXT:PATH(existing)=[] ... Excludes: --chunk-list-file --chunk-file-expression
List of jplace files or directories to process. For directories, only files with the extension .jplace[.gz] are processed.
--sequence-path TEXT:PATH(existing)=[] ...
List of sequence files or directories to process. For directories, only files with the extension .(fasta|fas|fsa|fna|ffn|faa|frn|phylip|phy)[.gz] are processed.
--chunk-list-file TEXT Excludes: --jplace-path --chunk-file-expression
If provided, needs to contain a list of chunk file paths in the numerical order that was produced by the chunkify command.
--chunk-file-expression TEXT Excludes: --jplace-path --chunk-list-file
If provided, the expression is used to load jplace files by replacing any '@' character with the chunk number.
Settings
--jplace-cache-size UINT=0
Cache size to determine how many jplace files are kept in memory. Default (0) means all. Use this if the command runs out of memory. It however comes at the cost of longer runtime.
--hash-function TEXT:{SHA1,SHA256,MD5}=SHA1
Hash function that was used for re-naming and identifying sequences in the chunkify command.
Output
--out-dir TEXT=.
Directory to write output files to.
--file-prefix TEXT
File prefix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
--file-suffix TEXT
File suffix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
Global Options
--allow-file-overwriting FLAG
Allow to overwrite existing output files instead of aborting the command.
--verbose FLAG
Produce more verbose output.
--threads UINT
Number of threads to use for calculations.
--log-file TEXT
Write all output to a log file, in addition to standard output to the terminal.

Description

The command reverses the effects of the chunkify command (see there for details on the workflow). That is, it takes the abundance map files and the per-chunk placement files as input, and creates a placement file for each of the original input sequences files, with all abundances and sequences names correctly restored. The command is thus one of the steps of our data preprocessing pipeline for phylogenetic placements as described here.

The easiest way to input the placement files to the command is the --jplace-path option, which takes a list of files or a directory containing .jplace files. This option works in all cases, and can even handle cases where sequences were moved around between chunks, or chunks that were merged later, and so on. It simply uses the hash names of the sequences to identify them.

Optionally, when sequence file(s) containing the chunked data (with hashed sequence names) are supplied to --sequence-path, per sample sequence files are additionally written to the output folder.

Details

For large datasets, using the --jplace-path option might need too much memory, as all files have to be scanned for the sequence hash names first. This is necessary if the jplace files do not correspond exactly to the chunk files. However, if each jplace file was created from one chunk file, there is no need to scan for hashes in other files. Thus, we offer two memory- and time-saving alternatives:

--chunk-list-file

The option takes a file, which needs to contain one jplace file path per line, in the order of the original chunks. For example, let's say the original sequence files were split into 13 chunks chunk_0.fasta to chunk_12.fasta by the chunkify command. Each of them was then placed on the reference tree, producing 13 jplace files. Then, the list file could look like this:

/path/to/chunk_0/result_0.jplace
/path/to/chunk_1/result_1.jplace
/path/to/chunk_2/result_2.jplace
/path/to/chunk_3/result_3.jplace
/path/to/chunk_4/result_4.jplace
/path/to/chunk_5/result_5.jplace
/path/to/chunk_6/result_6.jplace
/path/to/chunk_7/result_7.jplace
/path/to/chunk_8/result_8.jplace
/path/to/chunk_9/result_9.jplace
/path/to/chunk_10/result_10.jplace
/path/to/chunk_11/result_11.jplace
/path/to/chunk_12/result_12.jplace

That is, each line contains a path, in the original order of the chunks. Then, in order to create the placement entry for a sequence, the number n of the chunk in which the sequence was "chunkified" is used to find the correct jplace file by using the file in the n-th line of the list.

--chunk-file-expression

Alternatively, if the naming of the per-chunk jplace files is as straight forward as above, that is, the file names are just numbered, it is also possible to use an expression instead of the list file, where the @ character is used as a placeholder for the number:

--chunk-file-expression /path/to/chunk_@/[email protected]

This has the same effect as using the list file.

Citation

When using this method, please do not forget to cite

Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Genesis and Gappa: Processing, Analyzing and Visualizing Phylogenetic (Placement) Data. Bioinformatics, 2020. doi:10.1093/bioinformatics/btaa070

Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Methods for Automatic Reference Trees and Multilevel Phylogenetic Placement. Bioinformatics, 2018. doi:10.1093/bioinformatics/bty767

Clone this wiki locally