meta-bam

This bash script produces a "meta-bam" file by sampling reads from multiple RNA-seq libraries. This is done in 2 steps :

randomly sample a given number of reads from a group of fastq(gz) files.
align those reads to a reference genome using the STAR aligner.

Figure 1 shows 4 meta-bam files generated from large sets of RNA-seq samples.

Figure 1. IGV view of 727 Lung adenocarcinoma (LUAD) RNA-seq libraries from 2 studies (TCGA-LUAD and Seo et al. LUAD) combined into 4 meta-bam files. Top 2 lanes: samples from normal tissues, bottom 2 lanes: samples from tumor tissues.

Dependencies

This script requires :

reformat.sh from BBMap, for random sampling of reads
STAR Version 2.7.0f was tested, but all versions should work
samtools Both dependencies can be downloaded as so, or directly installed on a conda environment, using :

conda install -c bioconda star
conda install -c bioconda bbmap
conda install -c bioconda samtools

Preparation

STAR needs an indexed genome. To create an index, follow the steps in the STAR manual (http://labshare.cshl.edu/shares/gingeraslab/www-data/dobin/STAR/STAR.posix/doc/STARmanual.pdf)

How to Run it ?

Once all dependencies are installed (or your conda environment activated), run the script as follows:

./meta-bam.sh dataType fastQList NbTotalReads referenceGenome

Command line parameters :

dataType = -SE or -PE, for single-end of paired-end data
fastQList = A text file containing paths to every fastq.gz to sample from. Can be single-end or paired-end data. Paired files should be consecutive in the list.
NbTotalReads = An integer: millions of reads to keep in the final BAM file (input 5 means 5,000,000). With option -PE: NbTotalReads are sampled in each R1 and R2 file, retaining read pairs.
referenceGenome = Path to the directory containing the genome index for STAR alignment

Example :

./meta-bam.sh -PE listFastQgz.txt 50 data/indexStar

Final folder hierarchy

Results can be found in the meta-bam-out folder :

├── meta-bam
│   ├── meta-bam.sh
│   ├── README.md
│   ├── listFastQgz.txt
│   ├── data
│   │   ├── indexStar
│   │   │   ├── SA
│   │   │   ├── SAindex
│   │   │   ├── {several other index files}
│   ├── meta-bam-out
│   │   ├── sample_R1.fq
│   │   ├── sample_R2.fq
│   │   ├── STARoutputLog.out
│   │   ├── STARoutputLog.progress.out
│   │   ├── STARoutputSJ.out.tab
│   │   ├── STARoutputLog.final.out
│   │   ├── STARoutputAligner.out.sam
│   │   ├── Aligned.bam
│   │   ├── Aligned.bam.bai

A typical fastQList file is provided in exampleFastqFile.txt. Note that the path to the actual fastq.gz files can be either relative or absolute. For paired-end data, each pair must be consecutive in the file.

Processing time

Runtime depends on the size of your input dataset (both individual size file and number of files) and sampling depth.

The random sampling needs to open and read each file entirely, so this step is usually the limiting one.

Example runtimes:

50M reads from 108 paired-end libraries of mean size 2Gb/file (gzipped) : Sampling : 5 hours, Alignment : 15 minutes
50M reads from 369 paired-end libraries of mean size 2Gb/file (gzipped) : Sampling : 15 hours, Alignment : 17 minutes

Sampling time for one file ranges from 4 minutes (2Gb fastq.gz) to 6 minutes (10Gb fastq.gz), with 1 million reads sampled.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
README.md		README.md
exampleListFastq.txt		exampleListFastq.txt
igv-metabam.png		igv-metabam.png
meta-bam.sh		meta-bam.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

meta-bam

Dependencies

Preparation

How to Run it ?

Final folder hierarchy

Processing time

About

Releases

Packages

Contributors 2

Languages

Transipedia/meta-bam

Folders and files

Latest commit

History

Repository files navigation

meta-bam

Dependencies

Preparation

How to Run it ?

Final folder hierarchy

Processing time

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages