Skip to content

Commit

Permalink
Docs Update: BAM Tags, changelog. (#20)
Browse files Browse the repository at this point in the history
  • Loading branch information
PB-DB authored May 12, 2022
1 parent 8f3aefa commit f8184bc
Show file tree
Hide file tree
Showing 7 changed files with 215 additions and 15 deletions.
23 changes: 19 additions & 4 deletions docs/changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,22 @@ nav_order: 99

# Version changelog

* **3.4.0**
* **3.7.0**
* Adding `bcstats`, `correct`, and `groupdedup` to CLI
* `bcstats` emits frequency statistics for 10x barcodes
* `correct` uses a truth-set to correct sequencing errors in cell barcodes
* `groupdedup` provides substantial performance improvements over dedup
* Support SEGMENT read type

* 3.6.0
* Adding `tag` and `dedup` to CLI

* 3.5.0
* SMRT Link release 11.0
* Remove support for CLR data and disable `polish` step
* Enable `cluster --use-qvs` as always on

* 3.4.0
* SMRT Link release 10.0.0
* Add support for UMI and cell barcode handling, by adding `tag` and `dedup`
* Add `refine --min-rq` to support RQ filtering for unfiltered
Expand All @@ -22,7 +37,7 @@ nav_order: 99
* 3.2.1
* Fix a gff index 1-off bug in `collapse`
* We have removed implicit dependencies from the bioconda recipe. Please
install `pbccs`, `lima`, and `pbcoretools` as needed.
install `pbccs`, `lima`, and `pbcoretools` as needed

* 3.2.0
* **`polish` dropped support for RS II datasets!**
Expand All @@ -31,7 +46,7 @@ nav_order: 99
* Add `refine --min-polya-length`
* Add `cluster --singletons` to output unclustered FLNCs; potential sample
prep artifacts!
* Fix minimap2 bugs. Outputs might change slightly.
* Fix minimap2 bugs. Outputs might change slightly

* 3.1.2
* Reduce `polish` memory footprint
Expand All @@ -44,4 +59,4 @@ nav_order: 99
* 3.1.0
* We outsourced the poly(A) tail removal and concatemer detection into a new
tool called `refine`. Your custom `primers.fasta` is used in this step to
detect concatemers.
detect concatemers
18 changes: 12 additions & 6 deletions docs/general-faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,18 +8,24 @@ nav_order: 5
## BAM tags explained
Following BAM tags are being used:

- `ib` Barcode summary: triplets delimited by semicolons, each triplet contains two barcode indices and the ZMW counts, delimited by comma. Example: `0,1,20;0,3,5`
- `ic` Sum of number of passes from all ZMWs used to create consensus
- `im` ZMW names associated with this isoform
- `is` Number of ZMWs associated with this isoform
- `ib` Barcode summary: triplets delimited by semicolons, each triplet contains two barcode indices and the read counts, delimited by comma. Example: `0,1,20;0,3,5`
- `ic` Number of reads used to generate consensus. If less than `is`, this means that reads were down-sampled when consensus-calling
- `im` Read names associated with this isoform
- `is` Number of reads associated with this isoform
- `it` List of barcodes/UMIs clipped during `tag`
- `iz` Maximum number of subreads used for polishing
- `rq` Predicted accuracy for polished isoform
- `XA` Order of `tag` names
- `XC` barcode sequence `tag`
- `XC` Cell/group barcode sequence `tag`
- `CB` Cell/group barcode sequence `tag`. This is an alias for XC, but its presence indicates that the barcode has been corrected
- `CR` Raw cell/group barcode sequence `tag`
- `XG` PacBio's `GGG` UMI suffix `tag`
- `XM` UMI sequence `tag`
- `XO` overhang sequence `tag`
- `XO` Overhang sequence `tag`
- `nb` Edit distance between corrected cell/group barcode and raw cell/group barcode
- `gp` Pass/fail for cell/group barcode correction using a truth-set. 1 for pass, 0 for fail
- `nc` Number of known cell/group barcodes in the truth-set sharing the shortest distance from the raw barcode. If this number is > 1, this indicates ambiguity in remapping
- `oc` Original known cell/group barcodes in the truth-set sharing the shortest distance from the raw barcode

Quality values are capped at `93`.

Expand Down
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ Please refer to our [official pbbioconda page](https://github.com/PacificBioscie
for information on Installation, Support, License, Copyright, and Disclaimer.

## Latest Version
Version **3.4.0**: [Full changelog here](/changelog)
Version **3.7.0**: [Full changelog here](/changelog)

## What's new!
New documentation is up, a 1:1 port from the original GitHub docs with minor
Expand Down
31 changes: 31 additions & 0 deletions docs/isoseq-tags.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
---
layout: default
title: BAM Tags
nav_order: 8
---

#### Iso-seq Tags

| Tag | Type | Short Name | Relevant Executable | Value |
| --- | ---- | ---------- | ----- | ----- |
|CR| string | Cell Raw | `correct` | Raw (uncorrected) barcode. |
|CB| string | Cell Barcode | `correct` | Corrected cell/group barcode. |
|UR| string | UMI Raw | None currently | Molecular/UMI barcode. |
|UB| string | UMI Barcode | None currently | Corrected molecular/UMI barcode. |
|XM| string | UMI Barcode | `tag` | Corrected molecular/UMI barcode. |
|XC| string | Cell Barcode | `tag`, `correct` | Original Cell barcode. |
|XA| string | tag name order| `tag`, `correct` | Order of tags names. |
|nc| int | Number of Candidates | `correct` | Number of candidate barcodes. |
|oc| string | Other Choices | `correct` | String representation of other potential barcodes. |
|gp| int | Group Passes | `correct` | Flag specifying whether or not the barcode for the given read passes filters. 1 for passing, 0 for failing. |
|nb| int | Barcode Distance | `correct` | Edit distance from the barcode for the read to the barcode to which it was reassigned. This is 0 if the barcode matches exactly, -1 if the barcode could not be rescued, and the edit distance otherwise. |
|ic| int | input-consensus | `dedup`, `groupdedup` | Number of reads used to generate consensus. If less than `is`, this means that reads were down-sampled when consensus-calling. |
|is| int | input-sequences | `dedup`, `groupdedup` | Number of reads associated with isoform. |
|XO| string | X Overhang | `tag` | Overhang sequence tag. |
|XG| string | X GGG | `tag` | PacBio's GGG UMI suffix tag |
|rq | float | read quality | | Predicted accuracy for polished isoform |
|iz | int | maximum subreads used | | maximum number of subreads used for polishing |
|it | string | trimmed | `tag` | List of barcodes/UMIs clipped during tag |
|im | string | names | `dedup`, `groupdedup` | List of names of input reads used in generating consensus |

<img src="../doc/img/isoseq.png"/>
65 changes: 61 additions & 4 deletions docs/umi/cli-workflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -135,10 +135,54 @@ If you used more than one SMRT cells, merge all of your `<movie>.fltnc.bam` file

$ ls movie1.fltnc.bam movie2.fltnc.bam movieN.fltnc.bam > fltnc.fofn

## Step 5 - Deduplication

## Step 5 - Cell Barcode Correction
This step identifies 10x cell barcode errors and correct them. The tool uses the 10x cell barcode whitelist to reassign erroneous barcodes based on edit distance.


**Method**

First, the *correct* tool builds a Locality-Sensitive Hashing (LSH) index over the 10x whitelist barcode subsequences.
In the second step, *correct* uses the LSH index to map raw input barcodes to their nearest barcodes in the truth-set.

For each input HiFI read containing a 10x cell barcode:
- If the barcode is in the whitelist, it is unchanged.
- If the barcode is not found in the whitelist, the index is queried for the closest match in the whitelist.
- Edit distance is calculated between all retrieved whitelist cell barcodes and the input barcode.
- The barcode with the lowest edit distance and lowest hamming distance is output.
- By default, if the edit distance between the cell barcode and whitelist barcode is > 2, the read is marked as failing.
- If no candidates were found, the barcode is unchanged, and the read is marked as failing.

**Input** The input file for correct is one FLTNC file:
- <movie>.fltnc.bam

**Output** The following output files of correct contain reads with corrected cell barcodes:
- <prefix>.bam
- <prefix>.bam.pbi

Example invocation:
$ isoseq correct --barcodes barcode_set.txt flnc.bam flnc.corrected.bam


## Step 6 - Deduplication
This step performs PCR deduplicatation via clustering by UMI and cell barcodes (if available).
After deduplication, *dedup* generates one consensus sequence per founder molecule,
using a QV guided consensus approach.

We provide two methods: *dedup* and *groupdedup*.

They perform nearly identical functionality. The key difference is that *groupdedup* only deduplicates
reads sharing a cell barcode and *groupdedup* requires both barcode correction with the *correct* tool and sorting by cell barcode (tag "CB").
(Sorting a BAM by cell barcode may be efficiently accomplished by `samtools sort -t CB`.)

This is because sequencing errors introduce erroneous barcodes, yielding spurious reads.
*dedup* allows for barcode errors through pairwise barcode alignment, but *groupdedup* assumes that barcodes are correct.
Performing this correction step allows this faster *groupdedup* step to reasonably make this assumption while
also allowing for mismatches using the index.

This can provide over 200x speed-ups, as well as substantially reducing RAM requirements.


After deduplication, *dedup* and *groupdedup* generate one consensus sequence per founder molecule,
using a QV guided consensus.

**Method**

Expand All @@ -148,11 +192,16 @@ Perform all vs all comparison and cluster two reads if:
* pairwise concordance is at least 97%
* alignment starts/ends within 5 bp of the other read
* no more than 5 bps are deleted or inserted in a window of 20 bp (like in isoseq cluster)
* *groupdedup* only: these reads have the same cell barcode

**Input**
The input file for *dedup* is one FLTNC file:
- `<movie>.fltnc.bam` or `fltnc.fofn`

The input file for *groupdedup* is one FLTNC file, sorted by 10x cell barcode tag:
- `<movie>.tagsort.bam`


**Output**
The following output files of *dedup* contain polished isoforms:
- `<prefix>.bam`
Expand All @@ -161,6 +210,14 @@ The following output files of *dedup* contain polished isoforms:
- `<prefix>.bam.pbi`
- `<prefix>.transcriptset.xml`

Example invocation:
The following output files of *groupdedup* contain polished isoforms:
- `<prefix>.bam`
- `<prefix>.bam.pbi`

Example invocation (*dedup*):

$ isoseq dedup fltnc.fofn dedup.bam --verbose

Example invocation (*groupdedup*):

$ isoseq groupdedup fltnc.tagsort.bam dedup.bam
29 changes: 29 additions & 0 deletions docs/umi/isoseq-bcstats.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
layout: default
parent: Single cell
title: Barcode Statistics
nav_order: 7
---

***

`isoseq3 bcstats` emits statistics for each barcode:

1. Barcode sequence
2. Number of reads matching the barcode
3. Frequency Rank (within barcodes)
4. Number of unique molecular barcodes matching this barcode
5. Whether the barcode is Group/Cell barcode or a Molecular Barcode/UMI

If `--json` is unset, JSON summary information is written to stderr ("/dev/stderr").
Similarly, if '-o' is unset, output TSV information is written to stdout ("/dev/stdout").

```bash
# Example:
isoseq3 bcstats --json sample.bcstats.json -o sample.bcstats.tsv sample.bam
```

In default behavior, the program only emits stats on group barcodes.
Adding `--umi` will cause stats for the full molecular barcodes to be emitted as well.

<img src="../../doc/img/isoseq.png"/>
62 changes: 62 additions & 0 deletions docs/umi/isoseq-correct.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
---
layout: default
parent: Single cell
title: Barcode Correction Documentation via correct
nav_order: 6
---

## Barcode Correction Documentation

### Why Barcode Correction?

Single-cell, spatially-resolved, and other barcoded sequencing applications
rely on the accuracy of the cell or group barcode, which is typically chosen from a set of
known candidates, often referred to as a "whitelist".

This contrasts with the uniformly randomly-generated molecular barcodes (a.k.a. UMIs, "Unique molecular identifiers").

This tool uses the set of known candidates to correct sequencing errors in cell barcode identification. There are two primary benefits:

1. Increased yield
2. Improved accuracy in downstream deduplication.

By correcting errors in cell barcodes, the total number of usable reads is increased (typically ~5%).

And, once cell barcodes are corrected, the downstream groupdedup software tool can perform deduplication much more efficiently
than standard deduplication. This is because only reads sharing a cell barcode are compared, which dramatically reduces the search space compared to exhaustive pairwise comparisons.

### What does Barcode Correction do?

The tool takes a list of true barcodes and builds a locality-sensitive hashing (LSH) index over that set to facilitate fast nearest-neighbor queries.

This remaps reads with cell barcodes to their nearest-neighbors within the truth set.

### When would a user call this tool?

Run this tool on barcode-tagged BAM files before deduplication (`isoseq3 groupdedup`).
This provides substantial runtime improvements compared to `isoseq3 dedup`.

## Usage

### (with barcode-set in barcodes.txt)
```
isoseq3 correct --barcodes barcodes.txt input.bam output.bam
```

#### Tags
This requires the existance of XC and XU barcode tags.
The program will fail if either are missing.

We also add or update the following tags:

| Tag | Type | Short Name | Value |
| --- | ---- | ---------- | ----- |
|CR| string | Cell Raw | Raw (uncorrected) barcode. |
|CB| string | Cell Barcode | Corrected cell/group barcode. |
|XC| string | Cell Barcode | Original Cell barcode. |
|nc| int | Number of Candidates | Number of candidate barcodes. |
|oc| string | Other Choices | String representation of other potential barcodes. |
|gp| int | Group Passes | Flag specifying whether or not the barcode for the given read passes filters. 1 for passing, 0 for failing. |
|nb| int | Number of Barcode Mismatches | Edit distance from the barcode for the read to the barcode to which it was reassigned. This is -1 if the barcode could not be corrected, and the edit distance otherwise. (This means 0 for an exact match.) |

<img src="../../doc/img/isoseq.png"/>

0 comments on commit f8184bc

Please sign in to comment.