Subcommand: dispersion

Calculate the Edge Dispersion between samples.

Usage: gappa analyze dispersion [options]

Options

Input
`--jplace-path`	Required. `TEXT:PATH(existing)=[] ...` List of jplace files or directories to process. For directories, only files with the extension `.jplace[.gz]` are processed.
Settings
`--mass-norm`	Required. `TEXT:{absolute,relative}=absolute` Set the per-sample normalization method. With `absolute`, the total mass is not changed, so that input jplace samples with more pqueries (more placed sequences) have a higher influence on the result. With `relative`, the total mass of each sample is normalized to 1.0, so that each sample has the same influence on the result, independent of its number of sequences and their abundances.
`--point-mass`	`FLAG` Treat every pquery as a point mass concentrated on the highest-weight placement. In other words, ignore all but the most likely placement location (the one with the highest LWR), and set its LWR to 1.0.
`--ignore-multiplicities`	`FLAG` Set the multiplicity of each pquery to 1.0. The multiplicity is the equvalent of abundances for placements, and hence ignored with this flag.
`--edge-values`	`TEXT:{both,imbalances,masses}=both` Values per edge used to calculate the dispersion. Using `masses` focuses on per-branch dispersion, while using `imbalances` focuses on per-clade dispersion; see the paper for details.
`--method`	`TEXT:{all,cv,cv-log,sd,sd-log,var,var-log,vmr,vmr-log}=all` Method of dispersion. Either `all` (as far as they are applicable), or any of: coefficient of variation (`cv`, standard deviation divided by mean), coefficient of variation log-scaled (`cv-log`), standard deviation (`sd`), standard deviation log-scaled (`sd-log`)variance (`var`), variance log-scaled (`var-log`), variance to mean ratio (`vmr`, also called Index of Dispersion), variance to mean ratio log-scaled (`vmr-log`). It typically is useful to use `all`, in order to spot all patterns that can emerge from this method.
Color
`--color-list`	`TEXT=viridis` List of colors to use for the palette. Can either be the name of a color list, a file containing one color per line, or an actual comma-separated list of colors. Colors can be specified in the format `#rrggbb` using hex values, or by web color names.
`--reverse-color-list`	`FLAG` If set, the order of colors of the `--color-list` is reversed.
`--mask-color`	`TEXT=#dfdfdf` Color used to indicate masked or invalid values, such as infinities or NaNs. Color can be specified in the format `#rrggbb` using hex values, or by web color names.
Output
`--out-dir`	`TEXT=.` Directory to write output files to.
`--file-prefix`	`TEXT` File prefix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
`--file-suffix`	`TEXT` File suffix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
Tree Output
`--write-newick-tree`	`FLAG` If set, the tree is written to a Newick file. This format cannot store color information.
`--write-nexus-tree`	`FLAG` If set, the tree is written to a Nexus file. This can for example be opened in FigTree.
`--write-phyloxml-tree`	`FLAG` If set, the tree is written to a Phyloxml file. This can for example be used in Archaeopteryx.
`--write-svg-tree`	`FLAG` If set, the tree is written to a SVG file. This gives a file for vector graphics editors.
Newick Tree Output
`--newick-tree-branch-length-precision`	`INT=6 Needs: --write-newick-tree` Number of digits to print for branch lengths in Newick format.
`--newick-tree-quote-invalid-chars`	`FLAG Needs: --write-newick-tree` If set, node labels that contain characters that are invalid in the Newick format (i.e., spaces and `:;()[],{}`) are put into quotation marks. If not set (default), these characters are instead replaced by underscores, which changes the names, but works better with most downstream tools.
Svg Tree Output
`--svg-tree-shape`	`TEXT:{circular,rectangular}=circular Needs: --write-svg-tree` Shape of the tree.
`--svg-tree-type`	`TEXT:{cladogram,phylogram}=cladogram Needs: --write-svg-tree` Type of the tree, either using branch lengths (`phylogram`), or not (`cladogram`).
`--svg-tree-stroke-width`	`FLOAT=5 Needs: --write-svg-tree` Svg stroke width for the branches of the tree.
`--svg-tree-ladderize`	`FLAG Needs: --write-svg-tree` If set, the tree is ladderized.
Global Options
`--allow-file-overwriting`	`FLAG` Allow to overwrite existing output files instead of aborting the command.
`--verbose`	`FLAG` Produce more verbose output.
`--threads`	`UINT` Number of threads to use for calculations.
`--log-file`	`TEXT` Write all output to a log file, in addition to standard output to the terminal.

Description

The command takes a set of jplace files, and calculates and visualizes the Edge Dispersion per edge of the reference tree. The files need to have the same reference tree.

Edge Dispersion is explained and evaluated in detail in our article (doi:10.1371/journal.pone.0217050). The following figure and its caption are an example adapted from this article:

Dispersion Trees.

The command is useful as a first exploratory tool to detect placement heterogeneity across samples. Subfigure (a) shows the standard deviation of the edge masses, without any further processing. One outlier (marked with an arrow) dominates the variances, which hides the values on most other edges. Thus, in subfigure (b), we used logarithmic scaling, which reveals more details on the edges with lower placement mass variance. Subfigure (c) shows the Index of Dispersion of the edge masses, that is, the variance normalized by the mean. That means, edges with a higher number of placements can also have a higher variance. The subfigure again uses a logarithmic scale because of the outlier. The subfigure reveals more details on edges that exhibit a lower variance, which are shown in medium green colors. Lastly, subfigure (d) shows the variance of edge imbalances (instead of edge masses), and thus reveals information about whole clades of the tree.

Details

By default, the command creates dispersion trees using all valid combinations of variants of the method. The following two options change this behavior.

Edge Masses and Imbalances (`--edge-values`)

Controls whether to use masses or imbalances. By default, trees using both of them are crated. Using masses highlights the dispersion on single edges, while using imbalances considers whole clades. See the article for details on the differences between these two variants.

Dispersion Method (`--method`)

Controls which method of dispersion is used for the visualization. By default, all valid ones are used, that is, trees for each of them are created.

When using edge masses (see --edge-values), the per-branch values can be scaled and normalized in different ways: Simple variance (var) or standard deviation (sd), coefficient of variation (cv, that is, standard deviation divided by mean), or variance to mean ratio (vmr, also called the Index of Dispersion), and the logarithmically scaled versions of these (var-log, sd-log, cv-log, and vmr-log).

When using edge imbalances however, only the variance and standard deviation are valid methods. This is because imbalances are not zero-based values, so dividing by mean is not a reasonable operation.

Normalization (`--mass-norm`)

As the command is meant to show differences in a set of jplace samples files, it is important how those are normalized. Thus, the option is required.

If using --mass-norm relative, each sample (that is, each input jplace file) is normalized to unit mass 1.0, so that they all contribute equally to the result. Hence, the dispersion is measured relatively. That is, a branch exhibits a high dispersion if samples differ in the relative amount of placements on that branch (or in the clade, for imbalances) compared to the other placements in that sample.

On the other hand, if --mass-norm absolute is specified, the samples are not normalized. Thus, dispersion is measured absolutely. Branches then exhibit a high dispersion, if samples differ in the absolute number of placements on that branch (or clade). This can vastly differ from the normalized result, as the dispersion then depends on the total number of pqueries in each sample - which in turn depend on things like amplification bias, rarefaction, and other factors that can change the total number of sequences per sample.

The decision whether to use relative or absolute abundances depends on the use case and what each sample represents. See our article for details.

Citation

When using this method, please do not forget to cite

Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Genesis and Gappa: Processing, Analyzing and Visualizing Phylogenetic (Placement) Data. Bioinformatics, 2020. doi:10.1093/bioinformatics/btaa070

Lucas Czech, Alexandros Stamatakis. Scalable Methods for Analyzing and Visualizing Phylogenetic Placement of Metagenomic Samples. PLOS ONE, 2019. doi:10.1371/journal.pone.0217050

Home

Citation and References

General Usage

Phylogenetic Placement

Module analyze

Module edit

Module examine

Module prepare

Module simulate

Module tools

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subcommand: dispersion

Options

Description

Details

Edge Masses and Imbalances (`--edge-values`)

Dispersion Method (`--method`)

Normalization (`--mass-norm`)

Citation

Clone this wiki locally

Subcommand: dispersion

Options

Description

Details

Edge Masses and Imbalances (--edge-values)

Dispersion Method (--method)

Normalization (--mass-norm)

Citation

Clone this wiki locally

Edge Masses and Imbalances (`--edge-values`)

Dispersion Method (`--method`)

Normalization (`--mass-norm`)