MP_preSeq_FastQC

Description

Quality control for NGS reads using FastQC

Install and configure

conda is used to configure the pipeline environment.

# The conda environment can be named here
NID="pipeline_QC"
# Create a new environment
conda create -n $NID
# Activate the environment to install softwares
source activate $NID
conda install fastqc
# Deactivate it after finish installation
source deactivate $NID

Input files

NGS reads either SE or PE in FASTQ format with or without .gz suffix, such as (take the NA12878 as the sample ID for the example):

SE reads: NA12878.fq, NA12878.fq.gz ; NA12878.fastq, NA12878.fastq.gz
PE reads: NA12878_1.fq, NA12878_2.fq; NA12878_1.fq.gz, NA12878_2.fq.gz; NA12878_1.fastq, NA12878_2.fastq; NA12878_1.fastq.gz, NA12878_2.fastq.gz

Usage synopsis

Main usage

# Before working, the pipeline environemnt should be activated
source activate $NID
bash fastqc.sh $in $out $cpus
source deactivate $NID

Parameters explanation

$in:
- input folder containing FASTQ files, or
- input FASTQ files
$out: output folder
$cpus: threads to use

Output files

Two files will be output for each sample (take the NA12878 as the sample ID for the example):

NA12878_fastqc.html: the main output for view
NA12878_fastqc.zip: the file can be used for script parser

Key notes

QC

Every parts should be checked in the .html file, especially:

Basic Statistics: Encoding, Total sequences, Sequence length
Per base sequence quality
Per sequence quality scores
Sequence Length Distribution
Adapter Content

Tricks

FastQC set a thread for each file, so it has no real effect if the threads are set to an integer larger than file number.

FAQ

Others

Encoding in Basic Statistics should be checked especially for old NGS data.

Usage example

All input and output files are stored in test_data/pipeline_QC.

Inputs

NA12878_chr1_2Mb.fastq.gz: FASTQ Gzip file for SE reads

Run pipeline

source activate $NID
# make sure you are in the right working directory
bash fastqc.sh test_data/pipeline_QCNA12878_chr1_2Mb.fastq.gz test_data/pipeline_QC/output 1
source deactivate $NID

Outputs

NA12878_chr1_2Mb_fastqc.html: main output in HTML format
NA12878_chr1_2Mb_fastqc.zip: a zipped folder contain all data you will need
- summary.txt is a brief summary
- fastqc_data.txt is the result in text format
- fastqc_report.html is same to the main HTML output

Reference

FastQC: A quality control tool for high throughput sequence data.
Using FastQC to check the quality of high throughput sequence (YouTube)
FastQC documentation
FastQC (GitHub)
利用fastqc检测原始序列的质量
20160410 测序分析——使用 FastQC 做质控
用FastQC检查二代测序原始数据的质量

Author

Yi Xianfu (yixfbio AT gmail DOT com)

License

GPL v3 or later

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

MP_preSeq_FastQC

Description

Install and configure

Input files

Usage synopsis

Main usage

Parameters explanation

Output files

Key notes

QC

Tricks

FAQ

Others

Usage example

Inputs

Run pipeline

Outputs

Reference

Author

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

MP_preSeq_FastQC

Description

Install and configure

Input files

Usage synopsis

Main usage

Parameters explanation

Output files

Key notes

QC

Tricks

FAQ

Others

Usage example

Inputs

Run pipeline

Outputs

Reference

Author

License