Version 4.0.0

PacificBiosciences · Jun 7, 2023 · 3a42cd4 · 3a42cd4
1 parent 8a8f915
commit 3a42cd4
Show file tree

Hide file tree

Showing 17 changed files with 108 additions and 104 deletions.
diff --git a/README.md b/README.md
@@ -1,21 +1,25 @@
 <h1 align="center"><img width="300px" src="doc/img/isoseq.png"/></h1>
-<h1 align="center">IsoSeq v3</h1>
+<h1 align="center">IsoSeq</h1>
 <p align="center">Scalable De Novo Isoform Discovery</p>
 
 ***
 
-*IsoSeq v3* contains the newest tools to identify transcripts in
-PacBio single-molecule sequencing data.
-Starting in SMRT Link v6.0.0, those tools power the
-*IsoSeq GUI-based analysis* application.
-A composable workflow of existing tools and algorithms, combined with
-a new clustering technique, allows to process the ever-increasing yield of PacBio
-machines with similar performance to *IsoSeq* versions 1 and 2.
-Starting with version 3.4, support for UMI and cell barcode based deduplication
-has been added.
+*IsoSeq* contains the newest tools to identify transcripts in PacBio
+single-molecule sequencing data. Starting in SMRT Link v6.0.0, those tools power
+the *IsoSeq GUI-based analysis* application. A composable workflow of existing
+tools and algorithms, combined with new clustering techniques, allows to process
+the ever-increasing yield of PacBio machines. Starting with version 3.4, support
+for UMI and cell barcode based deduplication has been added. Version 4.0 adds a
+new `cluster2` tool that enables clustering of hundreds of millions of HiFi
+reads.
+
+## Announcement
+The binary has been renamed from `isoseq3` to `isoseq` to enable major version
+changes. Bioconda will still generate a `isoseq3` softlink. The old bioconda
+`isoseq3` package will automatically install the latest `isoseq` package.
 
 ## Availability
-Latest version can be installed via bioconda package `isoseq3`.
+Latest version can be installed via bioconda package `isoseq`.
 
 Please refer to our [official pbbioconda page](https://github.com/PacificBiosciences/pbbioconda)
 for information on Installation, Support, License, Copyright, and Disclaimer.
@@ -24,8 +28,6 @@ for information on Installation, Support, License, Copyright, and Disclaimer.
 
  * Visit [isoseq.how](https://isoseq.how) for the latest documentation
 
-
-
 ## DISCLAIMER
 
 THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES.
diff --git a/docs/changelog.md b/docs/changelog.md
@@ -5,22 +5,25 @@ nav_order: 99
 ---
 
 # Version changelog
+ * **4.0.0**
+   * Rename `isoseq3` to `isoseq`
+   * Add new tool `cluster2`
 
- * **3.8.2**
+ * 3.8.2
    * Update `groupdedup` to output consistent molecular IDs across runs
    * Bug fix updating `rc` and `gp` tags to passing for subset of `correct` reads
-  
+
  * 3.8.1
    * Real-cell `--method` and `--percentile` options added to `correct`
-  
+
  * 3.8.0
    * `collapse` allows isoforms with 5p degradation to collapse by default
    * `--do-not-collapse-extra-5exons` added to `collapse`
-   * `collapse` max 5p and 3p distances can be set in CLI using 
+   * `collapse` max 5p and 3p distances can be set in CLI using
      `--max-5p-diff` and `--max-3p-diff`
    * Real-cell annotation in `correct` using `rc` tag
    * Real-cell filtering in `groupdedup`, `dedup`, and `collapse`
-  
+
  * 3.7.0
    * Adding `bcstats`, `correct`, and `groupdedup` to CLI
    * `bcstats` emits frequency statistics for 10x barcodes

diff --git a/docs/classification/isoseq-collapse.md b/docs/classification/isoseq-collapse.md
@@ -7,7 +7,7 @@ nav_order: 2
 
 # IsoSeq Collapse
 
-After transcript sequences are mapped to a reference genome, `isoseq3 collapse` can be used to collapse redundant transcripts (based on exonic structures) into unique isoforms. Output consists of unique isoforms in GFF format and secondary files containing information about the number of reads supporting each unique isoform.
+After transcript sequences are mapped to a reference genome, `isoseq collapse` can be used to collapse redundant transcripts (based on exonic structures) into unique isoforms. Output consists of unique isoforms in GFF format and secondary files containing information about the number of reads supporting each unique isoform.
 
 ### Collapse Examples
 
@@ -24,10 +24,10 @@ pbmm2 align --preset ISOSEQ --sort <input.bam> <ref.fa> <mapped.bam>
 Collapse mapped reads into unique isoforms using _isoseq collapse_.
 
 ```
-isoseq3 collapse <mapped.bam> <collapse.gff>
+isoseq collapse <mapped.bam> <collapse.gff>
 ```
 
-Note: `isoseq3 collapse` by default will collapse isoforms containing 5p degradation as of version `3.8.0`. To turn this off `--do-not-collapse-extra-5exons` should be used. This option is recommended for bulk IsoSeq.
+Note: `collapse` by default will collapse isoforms containing 5p degradation as of version `3.8.0`. To turn this off `--do-not-collapse-extra-5exons` should be used. This option is recommended for bulk IsoSeq.
 
 ### Ouptut
 
@@ -56,7 +56,7 @@ Note: `isoseq3 collapse` by default will collapse isoforms containing 5p degrada
 
 # Collapse FAQ
 
-As of *isoseq3 v3.8.0* `isoseq3 collapse` has algorithmic updates. 
+As of *isoseq3 v3.8.0* `collapse` has algorithmic updates. 
 These updates include performance improvements and updates to isoform collapse logic. 
 
 ## What is new in *v3.8.0* and later?
@@ -93,5 +93,5 @@ New *v3.8.0* `collapse` maximum junction difference parameters:
 The legacy `collapse` logic can be recreated using the following parameters:
 
 ```
-isoseq3 collapse --do-not-collapse-extra-5exons --max-5p-diff 5 --max-3p-diff 5 <mapped.bam> <collapsed.gff>
+isoseq collapse --do-not-collapse-extra-5exons --max-5p-diff 5 --max-3p-diff 5 <mapped.bam> <collapsed.gff>
 ```
diff --git a/docs/classification/workflow.md b/docs/classification/workflow.md
@@ -21,12 +21,12 @@ Collapse redundant transcripts into unique isoforms based on exonic structures u
 
 Single-cell IsoSeq:
 ```
-isoseq3 collapse <mapped.bam> <collapsed.gff>
+isoseq collapse <mapped.bam> <collapsed.gff>
 ```
 
 Bulk IsoSeq:
 ```
-isoseq3 collapse --do-not-collapse-extra-5exons <mapped.bam> <collapsed.gff>
+isoseq collapse --do-not-collapse-extra-5exons <mapped.bam> <collapsed.gff>
 ```
 
 ### Sort input transcript GFF
@@ -106,7 +106,7 @@ Output files that are compatible with the downstream [Seurat](https://satijalab.
 pigeon make-seurat --dedup <dedup.fasta> --group <collapse.group.txt> -d <output_dir> <classification.filtered_lite_classification.txt>
 ```
 
-The `dedup.fasta` file is obtained after running `isoseq3 groupdedup` or `isoseq3 dedup`. The `collapse.group.txt` file is obtained after running `isoseq3 collapse`. 
+The `dedup.fasta` file is obtained after running `isoseq groupdedup` or `isoseq dedup`. The `collapse.group.txt` file is obtained after running `isoseq collapse`. 
 
 The output will consist of:
 ```

diff --git a/docs/clustering/examples.md b/docs/clustering/examples.md
@@ -9,18 +9,10 @@ nav_order: 4
 
 ### Single sample
 This is an example of an end-to-end cmd-line-only workflow to get from
-subreads to transcripts. It's a 1% subsampled Alzheimer dataset.
-You can either download the subreads and call HiFi on your own or skip this step
-and download the HiFi reads generated by CCS v4.2:
+HiFi reads to transcripts. It's a 1% subsampled Alzheimer dataset.
+You can download the HiFi reads generated by CCS v4.2:
 
-    $ wget https://downloads.pacbcloud.com/public/dataset/IsoSeq_sandbox/2020_Alzheimer8M_subset/alz.1perc.subreads.bam
-
-    $ ccs --version
-    ccs 4.0.0
-
-    $ ccs alz.1perc.subreads.bam alz.1perc.ccs.bam --min-rq 0.9
-
-    # Or download the pre-computed HiFi reads
+    # Download the pre-computed HiFi reads
     $ wget https://downloads.pacbcloud.com/public/dataset/IsoSeq_sandbox/2020_Alzheimer8M_subset/alz.1perc.ccs.bam
 
     $ cat primers.fasta
@@ -86,6 +78,6 @@ and download the HiFi reads generated by CCS v4.2:
     $ ls fl.bc1001_5p--bc1001_3p.bam fl.bc1002_5p--bc1002_3p.bam > all.fofn
 
     # Remove poly(A) tails and concatemer
-    $ isoseq3 refine all.fofn NEB_barcode16.fasta flnc.bam --require-polya --log-level DEBUG
+    $ isoseq refine all.fofn NEB_barcode16.fasta flnc.bam --require-polya
 
-    $ isoseq3 cluster flnc.bam clustered.bam --use-qvs --verbose
+    $ isoseq cluster flnc.bam clustered.bam --use-qvs --verbose
diff --git a/docs/getting-started.md b/docs/getting-started.md
@@ -13,14 +13,14 @@ nav_order: 3
 | Command | Description | Output format |
 | --- | --- | --- |
 | *lima* | Remove cDNA primers | `fl.bam` |
-| *isoseq3 refine* | Remove polyA tail and artificial concatemers | `flnc.bam` |
-| *isoseq3 cluster* | *De novo* isoform-level clustering | `unpolished.bam` |
+| *isoseq refine* | Remove polyA tail and artificial concatemers | `flnc.bam` |
+| *isoseq cluster* | *De novo* isoform-level clustering | `unpolished.bam` |
 | *pbmm2* | Align to the genome | `mapped.bam` |
-| *isoseq3 collapse* | Collapse redundant transcripts based on exonic structures | `collapsed.gff` |
+| *isoseq collapse* | Collapse redundant transcripts based on exonic structures | `collapsed.gff` |
 | *pigeon classify* | Classify transcripts against annotation | GFF and TXT files |
 | *pigeon filter* | Filter transcripts for potential artifacts | GFF and TXT files |
 
-Begin with the [bulk workflow](https://isoseq.how/clustering/) which ends at `isoseq3 cluster`, then continue to [pigeon workflow](https://isoseq.how/classification/) for transcript mapping, collapse, and classification.
+Begin with the [bulk workflow](https://isoseq.how/clustering/) which ends at `isoseq cluster`, then continue to [pigeon workflow](https://isoseq.how/classification/) for transcript mapping, collapse, and classification.
 
 
 
@@ -29,15 +29,15 @@ Begin with the [bulk workflow](https://isoseq.how/clustering/) which ends at `is
 | Command | Description | Output format |
 | --- | --- | --- |
 | *lima* | Remove cDNA primers | `fl.bam` |
-| *isoseq3 tag* | Extract UMI and cell barcodes | `flt.bam` |
-| *isoseq3 refine* | Remove polyA tail and artificial concatemers | `flnc.bam` |
-| *isoseq3 correct* | Correct cell barcodes and tag reads that are real cells | `corrected.bam` |
-| *isoseq3 bcstats* | Summarize barcode statistics for real/non-real cells | `bcstats_report.tsv` | 
-| *isoseq3 groupdedup* | Deduplicate reads | `dedup.bam` |
+| *isoseq tag* | Extract UMI and cell barcodes | `flt.bam` |
+| *isoseq refine* | Remove polyA tail and artificial concatemers | `flnc.bam` |
+| *isoseq correct* | Correct cell barcodes and tag reads that are real cells | `corrected.bam` |
+| *isoseq bcstats* | Summarize barcode statistics for real/non-real cells | `bcstats_report.tsv` | 
+| *isoseq groupdedup* | Deduplicate reads | `dedup.bam` |
 | *pbmm2* | Align to the genome | `mapped.bam` |
-| *isoseq3 collapse* | Collapse redundant transcripts based on exonic structures | `collapsed.gff` |
+| *isoseq collapse* | Collapse redundant transcripts based on exonic structures | `collapsed.gff` |
 | *pigeon classify* | Classify transcripts against annotation | GFF and TXT files |
 | *pigeon filter* | Filter transcripts for potential artifacts | GFF and TXT files |
 | *pigeon make-seurat* | Make gene- and isoform-level matrices | MTX and TSV files |
 
-Begin with the [single cell-specific worfklow](https://isoseq.how/umi/) which ends at `isoseq3 groupdedup`, then continue to [pigeon workflow](https://isoseq.how/classification/) for transcript mapping, collapse, and classification.
+Begin with the [single cell-specific worfklow](https://isoseq.how/umi/) which ends at `isoseq groupdedup`, then continue to [pigeon workflow](https://isoseq.how/classification/) for transcript mapping, collapse, and classification.
diff --git a/docs/img/isoseq-clustering-workflow.pdf b/docs/img/isoseq-clustering-workflow.pdf
diff --git a/docs/index.md b/docs/index.md
@@ -12,23 +12,30 @@ permalink: /
 
 ***
 
-*Iso-Seq* contains the newest tools to identify transcripts in PacBio
+*IsoSeq* contains the newest tools to identify transcripts in PacBio
 single-molecule sequencing data. Starting in SMRT Link v6.0.0, those tools power
 the *Iso-Seq GUI-based analysis* application. A composable workflow of existing
-tools and algorithms, combined with a new clustering technique, allows to
-process the ever-increasing yield of PacBio. Starting with version 3.4, support
-for UMI and cell barcode based deduplication has been added.
+tools and algorithms, combined with new clustering techniques, allows to process
+the ever-increasing yield of PacBio machines. Starting with version 3.4, support
+for UMI and cell barcode based deduplication has been added. Version 4.0 adds a
+new `cluster2` tool that enables clustering of hundreds of millions of HiFi
+reads.
 
 ## Availability
-Latest version can be installed via bioconda package `isoseq3`.
+Latest version can be installed via bioconda package `isoseq`.
 
 Please refer to our [official pbbioconda page](https://github.com/PacificBiosciences/pbbioconda)
 for information on Installation, Support, License, Copyright, and Disclaimer.
 
 ## Latest Version
-Version **3.8.2**: [Full changelog here](/changelog)
+Version **4.0.0**: [Full changelog here](/changelog)
 
 ## What's new!
-New documentation is up, a 1:1 port from the original GitHub docs with minor
-enhancements.
+Version 4.0 adds a new `cluster2` tool that enables clustering of hundreds of
+millions of HiFi reads. This is an early access version. Additional
+documentation will follow soon.
+
+The binary has been renamed from `isoseq3` to `isoseq` to enable major version
+changes. Bioconda will still generate a `isoseq3` softlink. The old bioconda
+`isoseq3` package will automatically install the latest `isoseq` package.
 
diff --git a/docs/umi/cell-calling.md b/docs/umi/cell-calling.md
@@ -18,7 +18,7 @@ Typically there is a steep decline in the number UMIS per cell barcode that indi
 
 ## Cell Calling Methods
 
-There are two methods for determining real cells in *isoseq3*.
+There are two methods for determining real cells in *isoseq*.
 
 ### Knee Finding Method (default)
 
@@ -30,24 +30,24 @@ The percentile method approximates real cells based on a percentile cutoff of UM
 
 ### Which tools use cell calling?
 
-Both [*isoseq3 correct*](https://isoseq.how/umi/isoseq-bcstats.html) and [*isoseq3 bcstats*](https://isoseq.how/umi/isoseq-correct.html) use cell calling. 
+Both [*isoseq correct*](https://isoseq.how/umi/isoseq-bcstats.html) and [*isoseq bcstats*](https://isoseq.how/umi/isoseq-correct.html) use cell calling. 
 In addition to cell barcode correction, *isoseq correct* labels the bam records from real cells with the `rc` tag. 
-After correction, *isoseq3 bcstats* can be used to generate a tsv file that can be used to plot the barcode rank plot. 
+After correction, *isoseq bcstats* can be used to generate a tsv file that can be used to plot the barcode rank plot. 
 
 The knee finding cell calling method is the default for both *correct* and *bcstats*. To change the cell calling method, the `--method` option should be added. The cutoff percentile can be changed from the default value of 99 to another value using the `--percentile` option. 
 
 To use the percentile method at the default cutoff (99):
 
 ```
-isoseq3 correct --method percentile ...
-isoseq3 bcstats --method percentile ...
+isoseq correct --method percentile ...
+isoseq bcstats --method percentile ...
 ```
 
 To lower the percentile cutoff to 97:
 
 ```
-isoseq3 correct --method percentile --percentile 97 ...
-isoseq3 bcstats --method percentile --percentile 97 ...
+isoseq correct --method percentile --percentile 97 ...
+isoseq bcstats --method percentile --percentile 97 ...
 ```
 
 
@@ -72,14 +72,14 @@ Additional information about interpreting barcode rank plots can be found in thi
 ### Determinining the correct percentile cutoff
 
 There is a python script available that can be used to determine the correct percentile cutoff to use. 
-The barcode rank plot can be generated with various percentile cutoffs from the [*isoseq3 bcstats*](https://isoseq.how/umi/isoseq-bcstats.html) tsv output. 
+The barcode rank plot can be generated with various percentile cutoffs from the [*isoseq bcstats*](https://isoseq.how/umi/isoseq-bcstats.html) tsv output. 
 The pink line shows the real cells labeled from *bcstats* in the tsv file. 
 
 If not already run, *bcstasts* can be run as follows:
 
 ```
 # Run bcstats on the corrected bam
-$ isoseq3 bcstats --json bcstats_report.json -o bcstats_report.tsv <corrected.bam>
+$ isoseq bcstats --json bcstats_report.json -o bcstats_report.tsv <corrected.bam>
 ```
 
 Download plotting script and install dependencies:

diff --git a/docs/umi/cli-workflow.md b/docs/umi/cli-workflow.md
@@ -74,7 +74,7 @@ The following output files of *tag* contain full-length tagged:
 
 Insert your own design or pick a preset:
 
-    $ isoseq3 tag <mvie>.fl.5p--3p.bam <movie>.flt.bam --design XXX
+    $ isoseq tag <mvie>.fl.5p--3p.bam <movie>.flt.bam --design XXX
 
 Refer to the [UMI and BC design page](https://isoseq.how/umi/umi-barcode-design.html) for how to specify `--design`.
 
@@ -99,7 +99,7 @@ The following output files of *refine* contain full-length non-concatemer (FLNC)
 
 Actual command to refine:
 
-    $ isoseq3 refine <movie>.fl.5p--3p.bam primers.fasta <movie>.fltnc.bam --require-polya
+    $ isoseq refine <movie>.fl.5p--3p.bam primers.fasta <movie>.fltnc.bam --require-polya
 
 If your sample has poly(A) tails, use `--require-polya`. 
 
@@ -129,7 +129,7 @@ For details on barcode correction, visit the [barcode correction](https://isoseq
  - `<prefix>.bam.pbi`
 
 Example:
-    $ isoseq3 correct --barcodes barcode_set.txt fltnc.bam fltnc.corrected.bam
+    $ isoseq correct --barcodes barcode_set.txt fltnc.bam fltnc.corrected.bam
 
 Common single-cell whitelist (e.g. 10x whitelist for 3' kit) can be found in the [MAS-Seq dataset](https://downloads.pacbcloud.com/public/dataset/MAS-Seq/).
 
@@ -178,12 +178,12 @@ The following output files of *groupdedup* contain polished isoforms:
 
 Example(*dedup*):
 
-    $ isoseq3 dedup fltnc.corrected.bam dedup.bam 
+    $ isoseq dedup fltnc.corrected.bam dedup.bam 
 
 Example(*groupdedup*):
 
     $ samtools sort -t CB fltnc.corrected.bam -o fltnc.corrected.sorted.bam
-    $ isoseq3 groupdedup fltnc.corrected.sorted.bam dedup.bam
+    $ isoseq groupdedup fltnc.corrected.sorted.bam dedup.bam
 
 
 

diff --git a/docs/umi/dedup-faq.md b/docs/umi/dedup-faq.md
@@ -7,9 +7,9 @@ nav_order: 5
 
 ## Dedup FAQ
 
-*NOTE:* `isoseq3 groupdedup` is now the recommended deduplication tool that replaces the older, slower `isoseq3 dedup`. However some documentation figures might still refer to the old `isoseq3 dedup` tool as reference.
+*NOTE:* `isoseq groupdedup` is now the recommended deduplication tool that replaces the older, slower `isoseq dedup`. However some documentation figures might still refer to the old `isoseq dedup` tool as reference.
 
-This FAQ explains how `isoseq3 groupdedup` identifies two reads to be from the same founder molecule.
+This FAQ explains how `isoseq groupdedup` identifies two reads to be from the same founder molecule.
 
 
 ### Adjusting maximum mismatches and shifts
@@ -51,15 +51,15 @@ While rare, it is possible to have different transcript molecules share the same
 
 ### groupdedup only: cell barcode and real cells
 
-If using `isoseq3 groupdedup` (which is recommended over `isoseq3 dedup`), it can use the corrected cell barcodes from the `isoseq3 correct` step for grouping reads.
+If using `isoseq groupdedup` (which is recommended over `isoseq dedup`), it can use the corrected cell barcodes from the `isoseq correct` step for grouping reads.
 
 However, the BAM file must first be sorted by `CB` tag:
 ```
 samtools sort –t CB corrected.bam –o corrected.sorted.bam
-isoseq3 groupdedup corrected.bam dedup.bam 
+isoseq groupdedup corrected.bam dedup.bam 
 ```
 
-Additionally, `isoseq3 groupdedup` can use the `rc` tag from the `isoseq3 correct` step and apply to only real cells. This can be turned off with the option below (advanced, not recommended by default):
+Additionally, `isoseq groupdedup` can use the `rc` tag from the `isoseq correct` step and apply to only real cells. This can be turned off with the option below (advanced, not recommended by default):
 ```
   --keep-non-real-cells           Do not skip reads with non-real cells.
   ```