Skip to content

ABYSS P Source Code Overview

Ben Vandervalk edited this page Jul 12, 2013 · 1 revision

Overview

There are two main things that ABYSS-P does:

  1. build a distributed de Bruijn graph from the input reads, AND
  2. find unambiguous (non-branching) paths in the de Bruijn graph

Unambiguous paths in the de Bruijn represent subsection of the target genome that can be reconstructed with high confidence. Each unambiguous path corresponds to exactly one contig in the output FASTA file generated by ABYSS-P.

Btw, this is a good introductory de Bruijn graph paper to refer to when thinking about the ABYSS-P code: http://www.nature.com/nbt/journal/v29/n11/full/nbt.2023.html

Important Source Files

  • Parallel/parallelAbyss.cpp -- "main" function for ABYSS-P
  • Parallel/NetworkSequenceCollection.cpp -- ABYSS-P pipeline logic
  • Assembly/AssemblyAlgorithms.cpp -- assembly code shared between ABYSS (single process) and ABYSS-P (MPI)

ABYSS-P Pipeline and State Machine

  • The ABYSS-P is implemented as a state machine, inside NetworkSequenceCollection.cpp.
  • Each MPI process has its own instance of NetworkSequenceCollection and the variable NetworkSequenceCollection.m_state indicates its current state (e.g. "NAS_LOADING")
  • The transitions between states in the master MPI process (rank 0) are handled by the large switch statement in NetworkSequenceCollection::runControl()
  • The transitions between states in the slave MPI processes (rank > 0) are handled by the large switch statement in NetworkSequenceCollection::run()
  • The code in run()/runControl() generally follows the "Bulk Synchronous Parallel" programming model: http://en.wikipedia.org/wiki/Bulk_synchronous_parallel

ABYSS-P Assembly States

Here are the assembly states (in the order that they occur in the pipeline):

(1) NAS_LOADING:

Read the sequences from the input read files, break the sequences into kmers, load the kmers into the distributed hash table. The distributed hash table consists of one has hash table per MPI process. Each kmer of the input data is uniquely mapped to one MPI process according to the formula in NetworkSequenceCollection::computeNodeID.

(2) NAS_GEN_ADJ:

Generate the adjacency information (i.e. the edges of the de Bruijn graph). For each kmer in the distributed hash table, this stage store a bit vector indicating the existence of "adjacent" kmers within the distributed hash table. Here, "adjacent" means that either the first k-1 or the last k-1 characters overlap the kmer in question.

(3) NAS_ERODE:

Remove kmers from the tips of branches in the de Bruijn that have low "kmer multiplicity", where "kmer multiplicity" is the number of times a particular kmer occurs in the input data. This is a form of error correction. The idea is that kmers containing read errors will occur only a small number of times (often just once), whereas "normal" kmers will be repeated many times due to overlapping reads.

(4) NAS_TRIM:

Remove short branches in the de Bruijn graph. This is another form of error correction. Similarly to the erode step, the assumption is that short branches in the de Bruijn graph occur mainly due to read errors.

(5) NAS_COVERAGE:

Remove "unambiguous paths" in the de Bruijn graph that have low "coverage". Here, an "unambiguous path" means a sequence of nodes (kmers) that have only two adjacent kmers: one kmer overlapping the last k-1 bases (connected by an incoming edge) and one kmer overlapping the first k-1 bases (connected by an outgoing edge). "Coverage" means the average kmer multiplicity of the kmers in the unambiguous path. By chance, some areas of the genome will have low read coverage, and these areas of the genome cannot be assembled with as much confidence as high coverage areas.

(6) NAS_DISCOVER_BUBBLES, NAS_POP_BUBBLES

"Bubbles" are a pattern of alternate branches that occur in the de Bruijn graph, primarily due to SNPs. It looks like this:

        _ _ _ _ _ _
 _ _ _ /            \ _ _ _ _
       \_ _ _ _ _ _ /

where each dash is a node for a kmer. ABySS resolves bubbles by removing the branch with lower coverage (i.e. average kmer multiplicity.)

(7) NAS_SPLIT_AMBIGUOUS

Remove nodes that in the de Bruijn graph that are "ambiguous". Here, a node is "ambiguous" if has either: (i) more than one incoming edge, or (ii) more than one outgoing edge. This stage allows the following stage (NAS_ASSEMBLE) to efficient identify all unambiguous paths in the de Bruijn graph and construct contigs for them.

(8) NAS_ASSEMBLE

For each unambiguous path in the de Bruijn graph, output a contig sequence!