Quality control for NGS reads using FastQC
conda
is used to configure the pipeline environment.
# The conda environment can be named here
NID="pipeline_QC"
# Create a new environment
conda create -n $NID
# Activate the environment to install softwares
source activate $NID
conda install fastqc
# Deactivate it after finish installation
source deactivate $NID
NGS reads either SE or PE in FASTQ
format with or without .gz
suffix, such as (take the NA12878
as the sample ID for the example):
- SE reads:
NA12878.fq, NA12878.fq.gz
;NA12878.fastq
,NA12878.fastq.gz
- PE reads:
NA12878_1.fq
,NA12878_2.fq
;NA12878_1.fq.gz
,NA12878_2.fq.gz
;NA12878_1.fastq
,NA12878_2.fastq
;NA12878_1.fastq.gz
,NA12878_2.fastq.gz
# Before working, the pipeline environemnt should be activated
source activate $NID
bash fastqc.sh $in $out $cpus
source deactivate $NID
$in
:- input folder containing FASTQ files, or
- input FASTQ files
$out
: output folder$cpus
: threads to use
Two files will be output for each sample (take the NA12878
as the sample ID for the example):
NA12878_fastqc.html
: the main output for viewNA12878_fastqc.zip
: the file can be used for script parser
Every parts should be checked in the .html
file, especially:
Basic Statistics
: Encoding, Total sequences, Sequence lengthPer base sequence quality
Per sequence quality scores
Sequence Length Distribution
Adapter Content
FastQC
set a thread for each file, so it has no real effect if the threads are set to an integer larger than file number.
Encoding
in Basic Statistics
should be checked especially for old NGS data.
All input and output files are stored in test_data/pipeline_QC
.
NA12878_chr1_2Mb.fastq.gz
: FASTQ Gzip file for SE reads
source activate $NID
# make sure you are in the right working directory
bash fastqc.sh test_data/pipeline_QCNA12878_chr1_2Mb.fastq.gz test_data/pipeline_QC/output 1
source deactivate $NID
-
NA12878_chr1_2Mb_fastqc.html
: main output in HTML format -
NA12878_chr1_2Mb_fastqc.zip
: a zipped folder contain all data you will needsummary.txt
is a brief summaryfastqc_data.txt
is the result in text formatfastqc_report.html
is same to the main HTML output
- FastQC: A quality control tool for high throughput sequence data.
- Using FastQC to check the quality of high throughput sequence (YouTube)
- FastQC documentation
- FastQC (GitHub)
- 利用fastqc检测原始序列的质量
- 20160410 测序分析——使用 FastQC 做质控
- 用FastQC检查二代测序原始数据的质量
Yi Xianfu (yixfbio AT gmail DOT com)
GPL v3 or later