Skip to content

Commit

Permalink
Version 4.0.0
Browse files Browse the repository at this point in the history
  • Loading branch information
armintoepfer committed Jun 7, 2023
1 parent 8a8f915 commit 3a42cd4
Show file tree
Hide file tree
Showing 17 changed files with 108 additions and 104 deletions.
28 changes: 15 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,25 @@
<h1 align="center"><img width="300px" src="doc/img/isoseq.png"/></h1>
<h1 align="center">IsoSeq v3</h1>
<h1 align="center">IsoSeq</h1>
<p align="center">Scalable De Novo Isoform Discovery</p>

***

*IsoSeq v3* contains the newest tools to identify transcripts in
PacBio single-molecule sequencing data.
Starting in SMRT Link v6.0.0, those tools power the
*IsoSeq GUI-based analysis* application.
A composable workflow of existing tools and algorithms, combined with
a new clustering technique, allows to process the ever-increasing yield of PacBio
machines with similar performance to *IsoSeq* versions 1 and 2.
Starting with version 3.4, support for UMI and cell barcode based deduplication
has been added.
*IsoSeq* contains the newest tools to identify transcripts in PacBio
single-molecule sequencing data. Starting in SMRT Link v6.0.0, those tools power
the *IsoSeq GUI-based analysis* application. A composable workflow of existing
tools and algorithms, combined with new clustering techniques, allows to process
the ever-increasing yield of PacBio machines. Starting with version 3.4, support
for UMI and cell barcode based deduplication has been added. Version 4.0 adds a
new `cluster2` tool that enables clustering of hundreds of millions of HiFi
reads.

## Announcement
The binary has been renamed from `isoseq3` to `isoseq` to enable major version
changes. Bioconda will still generate a `isoseq3` softlink. The old bioconda
`isoseq3` package will automatically install the latest `isoseq` package.

## Availability
Latest version can be installed via bioconda package `isoseq3`.
Latest version can be installed via bioconda package `isoseq`.

Please refer to our [official pbbioconda page](https://github.com/PacificBiosciences/pbbioconda)
for information on Installation, Support, License, Copyright, and Disclaimer.
Expand All @@ -24,8 +28,6 @@ for information on Installation, Support, License, Copyright, and Disclaimer.

* Visit [isoseq.how](https://isoseq.how) for the latest documentation



## DISCLAIMER

THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES.
13 changes: 8 additions & 5 deletions docs/changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,22 +5,25 @@ nav_order: 99
---

# Version changelog
* **4.0.0**
* Rename `isoseq3` to `isoseq`
* Add new tool `cluster2`

* **3.8.2**
* 3.8.2
* Update `groupdedup` to output consistent molecular IDs across runs
* Bug fix updating `rc` and `gp` tags to passing for subset of `correct` reads

* 3.8.1
* Real-cell `--method` and `--percentile` options added to `correct`

* 3.8.0
* `collapse` allows isoforms with 5p degradation to collapse by default
* `--do-not-collapse-extra-5exons` added to `collapse`
* `collapse` max 5p and 3p distances can be set in CLI using
* `collapse` max 5p and 3p distances can be set in CLI using
`--max-5p-diff` and `--max-3p-diff`
* Real-cell annotation in `correct` using `rc` tag
* Real-cell filtering in `groupdedup`, `dedup`, and `collapse`

* 3.7.0
* Adding `bcstats`, `correct`, and `groupdedup` to CLI
* `bcstats` emits frequency statistics for 10x barcodes
Expand Down
10 changes: 5 additions & 5 deletions docs/classification/isoseq-collapse.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ nav_order: 2

# IsoSeq Collapse

After transcript sequences are mapped to a reference genome, `isoseq3 collapse` can be used to collapse redundant transcripts (based on exonic structures) into unique isoforms. Output consists of unique isoforms in GFF format and secondary files containing information about the number of reads supporting each unique isoform.
After transcript sequences are mapped to a reference genome, `isoseq collapse` can be used to collapse redundant transcripts (based on exonic structures) into unique isoforms. Output consists of unique isoforms in GFF format and secondary files containing information about the number of reads supporting each unique isoform.

### Collapse Examples

Expand All @@ -24,10 +24,10 @@ pbmm2 align --preset ISOSEQ --sort <input.bam> <ref.fa> <mapped.bam>
Collapse mapped reads into unique isoforms using _isoseq collapse_.

```
isoseq3 collapse <mapped.bam> <collapse.gff>
isoseq collapse <mapped.bam> <collapse.gff>
```

Note: `isoseq3 collapse` by default will collapse isoforms containing 5p degradation as of version `3.8.0`. To turn this off `--do-not-collapse-extra-5exons` should be used. This option is recommended for bulk IsoSeq.
Note: `collapse` by default will collapse isoforms containing 5p degradation as of version `3.8.0`. To turn this off `--do-not-collapse-extra-5exons` should be used. This option is recommended for bulk IsoSeq.

### Ouptut

Expand Down Expand Up @@ -56,7 +56,7 @@ Note: `isoseq3 collapse` by default will collapse isoforms containing 5p degrada
# Collapse FAQ
As of *isoseq3 v3.8.0* `isoseq3 collapse` has algorithmic updates.
As of *isoseq3 v3.8.0* `collapse` has algorithmic updates.
These updates include performance improvements and updates to isoform collapse logic.
## What is new in *v3.8.0* and later?
Expand Down Expand Up @@ -93,5 +93,5 @@ New *v3.8.0* `collapse` maximum junction difference parameters:
The legacy `collapse` logic can be recreated using the following parameters:
```
isoseq3 collapse --do-not-collapse-extra-5exons --max-5p-diff 5 --max-3p-diff 5 <mapped.bam> <collapsed.gff>
isoseq collapse --do-not-collapse-extra-5exons --max-5p-diff 5 --max-3p-diff 5 <mapped.bam> <collapsed.gff>
```
6 changes: 3 additions & 3 deletions docs/classification/workflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,12 @@ Collapse redundant transcripts into unique isoforms based on exonic structures u

Single-cell IsoSeq:
```
isoseq3 collapse <mapped.bam> <collapsed.gff>
isoseq collapse <mapped.bam> <collapsed.gff>
```

Bulk IsoSeq:
```
isoseq3 collapse --do-not-collapse-extra-5exons <mapped.bam> <collapsed.gff>
isoseq collapse --do-not-collapse-extra-5exons <mapped.bam> <collapsed.gff>
```

### Sort input transcript GFF
Expand Down Expand Up @@ -106,7 +106,7 @@ Output files that are compatible with the downstream [Seurat](https://satijalab.
pigeon make-seurat --dedup <dedup.fasta> --group <collapse.group.txt> -d <output_dir> <classification.filtered_lite_classification.txt>
```

The `dedup.fasta` file is obtained after running `isoseq3 groupdedup` or `isoseq3 dedup`. The `collapse.group.txt` file is obtained after running `isoseq3 collapse`.
The `dedup.fasta` file is obtained after running `isoseq groupdedup` or `isoseq dedup`. The `collapse.group.txt` file is obtained after running `isoseq collapse`.

The output will consist of:
```
Expand Down
18 changes: 5 additions & 13 deletions docs/clustering/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,18 +9,10 @@ nav_order: 4

### Single sample
This is an example of an end-to-end cmd-line-only workflow to get from
subreads to transcripts. It's a 1% subsampled Alzheimer dataset.
You can either download the subreads and call HiFi on your own or skip this step
and download the HiFi reads generated by CCS v4.2:
HiFi reads to transcripts. It's a 1% subsampled Alzheimer dataset.
You can download the HiFi reads generated by CCS v4.2:

$ wget https://downloads.pacbcloud.com/public/dataset/IsoSeq_sandbox/2020_Alzheimer8M_subset/alz.1perc.subreads.bam

$ ccs --version
ccs 4.0.0

$ ccs alz.1perc.subreads.bam alz.1perc.ccs.bam --min-rq 0.9

# Or download the pre-computed HiFi reads
# Download the pre-computed HiFi reads
$ wget https://downloads.pacbcloud.com/public/dataset/IsoSeq_sandbox/2020_Alzheimer8M_subset/alz.1perc.ccs.bam

$ cat primers.fasta
Expand Down Expand Up @@ -86,6 +78,6 @@ and download the HiFi reads generated by CCS v4.2:
$ ls fl.bc1001_5p--bc1001_3p.bam fl.bc1002_5p--bc1002_3p.bam > all.fofn

# Remove poly(A) tails and concatemer
$ isoseq3 refine all.fofn NEB_barcode16.fasta flnc.bam --require-polya --log-level DEBUG
$ isoseq refine all.fofn NEB_barcode16.fasta flnc.bam --require-polya

$ isoseq3 cluster flnc.bam clustered.bam --use-qvs --verbose
$ isoseq cluster flnc.bam clustered.bam --use-qvs --verbose
22 changes: 11 additions & 11 deletions docs/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,14 @@ nav_order: 3
| Command | Description | Output format |
| --- | --- | --- |
| *lima* | Remove cDNA primers | `fl.bam` |
| *isoseq3 refine* | Remove polyA tail and artificial concatemers | `flnc.bam` |
| *isoseq3 cluster* | *De novo* isoform-level clustering | `unpolished.bam` |
| *isoseq refine* | Remove polyA tail and artificial concatemers | `flnc.bam` |
| *isoseq cluster* | *De novo* isoform-level clustering | `unpolished.bam` |
| *pbmm2* | Align to the genome | `mapped.bam` |
| *isoseq3 collapse* | Collapse redundant transcripts based on exonic structures | `collapsed.gff` |
| *isoseq collapse* | Collapse redundant transcripts based on exonic structures | `collapsed.gff` |
| *pigeon classify* | Classify transcripts against annotation | GFF and TXT files |
| *pigeon filter* | Filter transcripts for potential artifacts | GFF and TXT files |

Begin with the [bulk workflow](https://isoseq.how/clustering/) which ends at `isoseq3 cluster`, then continue to [pigeon workflow](https://isoseq.how/classification/) for transcript mapping, collapse, and classification.
Begin with the [bulk workflow](https://isoseq.how/clustering/) which ends at `isoseq cluster`, then continue to [pigeon workflow](https://isoseq.how/classification/) for transcript mapping, collapse, and classification.



Expand All @@ -29,15 +29,15 @@ Begin with the [bulk workflow](https://isoseq.how/clustering/) which ends at `is
| Command | Description | Output format |
| --- | --- | --- |
| *lima* | Remove cDNA primers | `fl.bam` |
| *isoseq3 tag* | Extract UMI and cell barcodes | `flt.bam` |
| *isoseq3 refine* | Remove polyA tail and artificial concatemers | `flnc.bam` |
| *isoseq3 correct* | Correct cell barcodes and tag reads that are real cells | `corrected.bam` |
| *isoseq3 bcstats* | Summarize barcode statistics for real/non-real cells | `bcstats_report.tsv` |
| *isoseq3 groupdedup* | Deduplicate reads | `dedup.bam` |
| *isoseq tag* | Extract UMI and cell barcodes | `flt.bam` |
| *isoseq refine* | Remove polyA tail and artificial concatemers | `flnc.bam` |
| *isoseq correct* | Correct cell barcodes and tag reads that are real cells | `corrected.bam` |
| *isoseq bcstats* | Summarize barcode statistics for real/non-real cells | `bcstats_report.tsv` |
| *isoseq groupdedup* | Deduplicate reads | `dedup.bam` |
| *pbmm2* | Align to the genome | `mapped.bam` |
| *isoseq3 collapse* | Collapse redundant transcripts based on exonic structures | `collapsed.gff` |
| *isoseq collapse* | Collapse redundant transcripts based on exonic structures | `collapsed.gff` |
| *pigeon classify* | Classify transcripts against annotation | GFF and TXT files |
| *pigeon filter* | Filter transcripts for potential artifacts | GFF and TXT files |
| *pigeon make-seurat* | Make gene- and isoform-level matrices | MTX and TSV files |

Begin with the [single cell-specific worfklow](https://isoseq.how/umi/) which ends at `isoseq3 groupdedup`, then continue to [pigeon workflow](https://isoseq.how/classification/) for transcript mapping, collapse, and classification.
Begin with the [single cell-specific worfklow](https://isoseq.how/umi/) which ends at `isoseq groupdedup`, then continue to [pigeon workflow](https://isoseq.how/classification/) for transcript mapping, collapse, and classification.
Binary file modified docs/img/isoseq-clustering-workflow.pdf
Binary file not shown.
23 changes: 15 additions & 8 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,23 +12,30 @@ permalink: /

***

*Iso-Seq* contains the newest tools to identify transcripts in PacBio
*IsoSeq* contains the newest tools to identify transcripts in PacBio
single-molecule sequencing data. Starting in SMRT Link v6.0.0, those tools power
the *Iso-Seq GUI-based analysis* application. A composable workflow of existing
tools and algorithms, combined with a new clustering technique, allows to
process the ever-increasing yield of PacBio. Starting with version 3.4, support
for UMI and cell barcode based deduplication has been added.
tools and algorithms, combined with new clustering techniques, allows to process
the ever-increasing yield of PacBio machines. Starting with version 3.4, support
for UMI and cell barcode based deduplication has been added. Version 4.0 adds a
new `cluster2` tool that enables clustering of hundreds of millions of HiFi
reads.

## Availability
Latest version can be installed via bioconda package `isoseq3`.
Latest version can be installed via bioconda package `isoseq`.

Please refer to our [official pbbioconda page](https://github.com/PacificBiosciences/pbbioconda)
for information on Installation, Support, License, Copyright, and Disclaimer.

## Latest Version
Version **3.8.2**: [Full changelog here](/changelog)
Version **4.0.0**: [Full changelog here](/changelog)

## What's new!
New documentation is up, a 1:1 port from the original GitHub docs with minor
enhancements.
Version 4.0 adds a new `cluster2` tool that enables clustering of hundreds of
millions of HiFi reads. This is an early access version. Additional
documentation will follow soon.

The binary has been renamed from `isoseq3` to `isoseq` to enable major version
changes. Bioconda will still generate a `isoseq3` softlink. The old bioconda
`isoseq3` package will automatically install the latest `isoseq` package.

18 changes: 9 additions & 9 deletions docs/umi/cell-calling.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Typically there is a steep decline in the number UMIS per cell barcode that indi

## Cell Calling Methods

There are two methods for determining real cells in *isoseq3*.
There are two methods for determining real cells in *isoseq*.

### Knee Finding Method (default)

Expand All @@ -30,24 +30,24 @@ The percentile method approximates real cells based on a percentile cutoff of UM

### Which tools use cell calling?

Both [*isoseq3 correct*](https://isoseq.how/umi/isoseq-bcstats.html) and [*isoseq3 bcstats*](https://isoseq.how/umi/isoseq-correct.html) use cell calling.
Both [*isoseq correct*](https://isoseq.how/umi/isoseq-bcstats.html) and [*isoseq bcstats*](https://isoseq.how/umi/isoseq-correct.html) use cell calling.
In addition to cell barcode correction, *isoseq correct* labels the bam records from real cells with the `rc` tag.
After correction, *isoseq3 bcstats* can be used to generate a tsv file that can be used to plot the barcode rank plot.
After correction, *isoseq bcstats* can be used to generate a tsv file that can be used to plot the barcode rank plot.

The knee finding cell calling method is the default for both *correct* and *bcstats*. To change the cell calling method, the `--method` option should be added. The cutoff percentile can be changed from the default value of 99 to another value using the `--percentile` option.

To use the percentile method at the default cutoff (99):

```
isoseq3 correct --method percentile ...
isoseq3 bcstats --method percentile ...
isoseq correct --method percentile ...
isoseq bcstats --method percentile ...
```

To lower the percentile cutoff to 97:

```
isoseq3 correct --method percentile --percentile 97 ...
isoseq3 bcstats --method percentile --percentile 97 ...
isoseq correct --method percentile --percentile 97 ...
isoseq bcstats --method percentile --percentile 97 ...
```


Expand All @@ -72,14 +72,14 @@ Additional information about interpreting barcode rank plots can be found in thi
### Determinining the correct percentile cutoff

There is a python script available that can be used to determine the correct percentile cutoff to use.
The barcode rank plot can be generated with various percentile cutoffs from the [*isoseq3 bcstats*](https://isoseq.how/umi/isoseq-bcstats.html) tsv output.
The barcode rank plot can be generated with various percentile cutoffs from the [*isoseq bcstats*](https://isoseq.how/umi/isoseq-bcstats.html) tsv output.
The pink line shows the real cells labeled from *bcstats* in the tsv file.

If not already run, *bcstasts* can be run as follows:

```
# Run bcstats on the corrected bam
$ isoseq3 bcstats --json bcstats_report.json -o bcstats_report.tsv <corrected.bam>
$ isoseq bcstats --json bcstats_report.json -o bcstats_report.tsv <corrected.bam>
```

Download plotting script and install dependencies:
Expand Down
10 changes: 5 additions & 5 deletions docs/umi/cli-workflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ The following output files of *tag* contain full-length tagged:

Insert your own design or pick a preset:

$ isoseq3 tag <mvie>.fl.5p--3p.bam <movie>.flt.bam --design XXX
$ isoseq tag <mvie>.fl.5p--3p.bam <movie>.flt.bam --design XXX

Refer to the [UMI and BC design page](https://isoseq.how/umi/umi-barcode-design.html) for how to specify `--design`.

Expand All @@ -99,7 +99,7 @@ The following output files of *refine* contain full-length non-concatemer (FLNC)

Actual command to refine:

$ isoseq3 refine <movie>.fl.5p--3p.bam primers.fasta <movie>.fltnc.bam --require-polya
$ isoseq refine <movie>.fl.5p--3p.bam primers.fasta <movie>.fltnc.bam --require-polya

If your sample has poly(A) tails, use `--require-polya`.

Expand Down Expand Up @@ -129,7 +129,7 @@ For details on barcode correction, visit the [barcode correction](https://isoseq
- `<prefix>.bam.pbi`

Example:
$ isoseq3 correct --barcodes barcode_set.txt fltnc.bam fltnc.corrected.bam
$ isoseq correct --barcodes barcode_set.txt fltnc.bam fltnc.corrected.bam

Common single-cell whitelist (e.g. 10x whitelist for 3' kit) can be found in the [MAS-Seq dataset](https://downloads.pacbcloud.com/public/dataset/MAS-Seq/).

Expand Down Expand Up @@ -178,12 +178,12 @@ The following output files of *groupdedup* contain polished isoforms:

Example(*dedup*):

$ isoseq3 dedup fltnc.corrected.bam dedup.bam
$ isoseq dedup fltnc.corrected.bam dedup.bam

Example(*groupdedup*):

$ samtools sort -t CB fltnc.corrected.bam -o fltnc.corrected.sorted.bam
$ isoseq3 groupdedup fltnc.corrected.sorted.bam dedup.bam
$ isoseq groupdedup fltnc.corrected.sorted.bam dedup.bam



Expand Down
10 changes: 5 additions & 5 deletions docs/umi/dedup-faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@ nav_order: 5

## Dedup FAQ

*NOTE:* `isoseq3 groupdedup` is now the recommended deduplication tool that replaces the older, slower `isoseq3 dedup`. However some documentation figures might still refer to the old `isoseq3 dedup` tool as reference.
*NOTE:* `isoseq groupdedup` is now the recommended deduplication tool that replaces the older, slower `isoseq dedup`. However some documentation figures might still refer to the old `isoseq dedup` tool as reference.

This FAQ explains how `isoseq3 groupdedup` identifies two reads to be from the same founder molecule.
This FAQ explains how `isoseq groupdedup` identifies two reads to be from the same founder molecule.


### Adjusting maximum mismatches and shifts
Expand Down Expand Up @@ -51,15 +51,15 @@ While rare, it is possible to have different transcript molecules share the same

### groupdedup only: cell barcode and real cells

If using `isoseq3 groupdedup` (which is recommended over `isoseq3 dedup`), it can use the corrected cell barcodes from the `isoseq3 correct` step for grouping reads.
If using `isoseq groupdedup` (which is recommended over `isoseq dedup`), it can use the corrected cell barcodes from the `isoseq correct` step for grouping reads.

However, the BAM file must first be sorted by `CB` tag:
```
samtools sort –t CB corrected.bam –o corrected.sorted.bam
isoseq3 groupdedup corrected.bam dedup.bam
isoseq groupdedup corrected.bam dedup.bam
```

Additionally, `isoseq3 groupdedup` can use the `rc` tag from the `isoseq3 correct` step and apply to only real cells. This can be turned off with the option below (advanced, not recommended by default):
Additionally, `isoseq groupdedup` can use the `rc` tag from the `isoseq correct` step and apply to only real cells. This can be turned off with the option below (advanced, not recommended by default):
```
--keep-non-real-cells Do not skip reads with non-real cells.
```
Expand Down
Loading

0 comments on commit 3a42cd4

Please sign in to comment.