Subcommand: unchunkify

Unchunkify a set of jplace files using abundance map files and create per-sample jplace files.

Usage: gappa prepare unchunkify [options]

Options

Input
`--abundances-path`	Required. `TEXT:PATH(existing)=[] ...` List of abundances files or directories to process. For directories, only files with the extension `.json[.gz]` are processed.
`--jplace-path`	`TEXT:PATH(existing)=[] ... Excludes: --chunk-list-file --chunk-file-expression` List of jplace files or directories to process. For directories, only files with the extension `.jplace[.gz]` are processed.
`--sequence-path`	`TEXT:PATH(existing)=[] ...` List of sequence files or directories to process. For directories, only files with the extension `.(fasta\|fas\|fsa\|fna\|ffn\|faa\|frn\|phylip\|phy)[.gz]` are processed.
`--chunk-list-file`	`TEXT Excludes: --jplace-path --chunk-file-expression` If provided, needs to contain a list of chunk file paths in the numerical order that was produced by the chunkify command.
`--chunk-file-expression`	`TEXT Excludes: --jplace-path --chunk-list-file` If provided, the expression is used to load jplace files by replacing any '@' character with the chunk number.
Settings
`--jplace-cache-size`	`UINT=0` Cache size to determine how many jplace files are kept in memory. Default (0) means all. Use this if the command runs out of memory. It however comes at the cost of longer runtime.
`--hash-function`	`TEXT:{SHA1,SHA256,MD5}=SHA1` Hash function that was used for re-naming and identifying sequences in the chunkify command.
Output
`--out-dir`	`TEXT=.` Directory to write output files to.
`--file-prefix`	`TEXT` File prefix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
`--file-suffix`	`TEXT` File suffix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
Global Options
`--allow-file-overwriting`	`FLAG` Allow to overwrite existing output files instead of aborting the command.
`--verbose`	`FLAG` Produce more verbose output.
`--threads`	`UINT` Number of threads to use for calculations.
`--log-file`	`TEXT` Write all output to a log file, in addition to standard output to the terminal.

Description

The command reverses the effects of the chunkify command (see there for details on the workflow). That is, it takes the abundance map files and the per-chunk placement files as input, and creates a placement file for each of the original input sequences files, with all abundances and sequences names correctly restored. The command is thus one of the steps of our data preprocessing pipeline for phylogenetic placements as described here.

The easiest way to input the placement files to the command is the --jplace-path option, which takes a list of files or a directory containing .jplace files. This option works in all cases, and can even handle cases where sequences were moved around between chunks, or chunks that were merged later, and so on. It simply uses the hash names of the sequences to identify them.

Optionally, when sequence file(s) containing the chunked data (with hashed sequence names) are supplied to --sequence-path, per sample sequence files are additionally written to the output folder.

Details

For large datasets, using the --jplace-path option might need too much memory, as all files have to be scanned for the sequence hash names first. This is necessary if the jplace files do not correspond exactly to the chunk files. However, if each jplace file was created from one chunk file, there is no need to scan for hashes in other files. Thus, we offer two memory- and time-saving alternatives:

`--chunk-list-file`

The option takes a file, which needs to contain one jplace file path per line, in the order of the original chunks. For example, let's say the original sequence files were split into 13 chunks chunk_0.fasta to chunk_12.fasta by the chunkify command. Each of them was then placed on the reference tree, producing 13 jplace files. Then, the list file could look like this:

/path/to/chunk_0/result_0.jplace
/path/to/chunk_1/result_1.jplace
/path/to/chunk_2/result_2.jplace
/path/to/chunk_3/result_3.jplace
/path/to/chunk_4/result_4.jplace
/path/to/chunk_5/result_5.jplace
/path/to/chunk_6/result_6.jplace
/path/to/chunk_7/result_7.jplace
/path/to/chunk_8/result_8.jplace
/path/to/chunk_9/result_9.jplace
/path/to/chunk_10/result_10.jplace
/path/to/chunk_11/result_11.jplace
/path/to/chunk_12/result_12.jplace

That is, each line contains a path, in the original order of the chunks. Then, in order to create the placement entry for a sequence, the number n of the chunk in which the sequence was "chunkified" is used to find the correct jplace file by using the file in the n-th line of the list.

`--chunk-file-expression`

Alternatively, if the naming of the per-chunk jplace files is as straight forward as above, that is, the file names are just numbered, it is also possible to use an expression instead of the list file, where the @ character is used as a placeholder for the number:

--chunk-file-expression /path/to/chunk_@/[email protected]

This has the same effect as using the list file.

Citation

When using this method, please do not forget to cite

Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Genesis and Gappa: Processing, Analyzing and Visualizing Phylogenetic (Placement) Data. Bioinformatics, 2020. doi:10.1093/bioinformatics/btaa070

Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Methods for Automatic Reference Trees and Multilevel Phylogenetic Placement. Bioinformatics, 2018. doi:10.1093/bioinformatics/bty767