Skip to content

Latest commit

 

History

History
233 lines (126 loc) · 6.7 KB

practical2.md

File metadata and controls

233 lines (126 loc) · 6.7 KB

Practical 2

In this practical you will get to know more tools for variant calling. Furthermore, VCF files will be annotated, filtered, and displayed. Please check out the Useful information section.

Varscan variant calling

(*) Convert to mpileup

samtools mpileup -B -f ${GEN_REF} deduprg.bam > <pileup.file>

(*) Call SNPs

java -jar <varscan.jar> mpileup2snp <pileup.file> --output-vcf 1 > varscan_snp.vcf

(*) Call Indels

java -jar <varscan.jar> mpileup2indel <pileup.file> --output-vcf 1 > varscan_indel.vcf

(*) Investigate result

#Perform the same procedures as done for samtools.
#How many SNPs and indels were called?

GATK variant calling

(*) Known indel sites are here specified as variables - either copy the whole path or use variables as well

KNOWN_INDELS_1="1000G_phase1.indels.hg19.vcf"
KNOWN_INDELS_2="Mills_and_1000G_gold_standard.indels.hg19.vcf"

(*) Realignment target creator

java -Xmx8g -jar /bcga2016/GATK-3.5/GenomeAnalysisTK.jar -T RealignerTargetCreator -R hg19.fasta -nt 8 -L target.bed \
-I deduprg.bam -known ${KNOWN_INDELS_1} -known ${KNOWN_INDELS_2} -o target_intervals.list

(*) Perform realignment

java -Xmx8g -jar /bcga2016/GATK-3.5/GenomeAnalysisTK.jar -T IndelRealigner -R hg19.fasta -I deduprg.bam \
-targetIntervals target_intervals.list -known ${KNOWN_INDELS_1} -known ${KNOWN_INDELS_2} -o dedup_rg_real.bam

(*) Base quality recalibration

java -Xmx8g -jar /bcga2016/GATK-3.5/GenomeAnalysisTK.jar -T BaseRecalibrator -R hg19.fasta -I dedup_rg_real.bam \
-knownSites ${KNOWN_INDELS_1} -knownSites ${KNOWN_INDELS_2} -o recal_data_table.txt -L target.bed 
--maximum_cycle_value 800

(*) Second pass of recalibration

 java -Xmx8g -jar /bcga2016/GATK-3.5/GenomeAnalysisTK.jar -T BaseRecalibrator -R hg19.fasta -I dedup_rg_real.bam \
 -knownSites ${KNOWN_INDELS_1} -knownSites ${KNOWN_INDELS_2} -o post_recal_data_table.txt \
 -L target.bed --maximum_cycle_value 800 -BQSR recal_data_table.txt 

(*) Generate before after plots (requires R and ggplot2)

Fix missing R packages
R
Inside R call
install.packages(c('reshape','gplots','gsalib'))

java -Xmx8g -jar /bcga2016/GATK-3.5/GenomeAnalysisTK.jar -T AnalyzeCovariates -R hg19.fasta -L target.bed 
-before recal_data_table.txt -after post_recal_data_table.txt -plots recalibration_plots.pdf

(*) Print recalibrated reads

java -Xmx8g -jar /bcga2016/GATK-3.5/GenomeAnalysisTK.jar -T PrintReads -R hg19.fasta -L target.bed -I dedup_rg_real.bam 
-BQSR recal_data_table.txt -o dedup_rg_real_recal.bam

(*) Now do variant calling

java -Xmx8g -jar /bcga2016/GATK-3.5/GenomeAnalysisTK.jar -T HaplotypeCaller -R hg19.fasta -nct 8 -L target.bed 
-I dedup_rg_real_recal.bam --genotyping_mode DISCOVERY -o gatk.vcf

(*) Questions

  • Check out the before and after plots.
  • How many variants were called?

Merge VCFs and VCF stats

(*) VCFlib - merge

vcfcombine freebayes.vcf gatk.vcf samtools.vcf > vcf_lib_merged.vcf

(*) VCFlib - stats - shown here for one VCF file - repeat for all 3

vcfstats freeb_call.vcf > freeb_call_vcf_lib_stats.txt

(*) VCFtools

export PERL5LIB=<full-path>/vcftools_0.1.12a/perl
export PATH=<full-path>/tabix/tabix-0.2.6:$PATH

## Index (tabix) and pack files
cp gatk.vcf gatk_tab.vcf
bgzip gatk_tab.vcf
tabix -p vcf gatk_tab.vcf.gz

## repeat for the other two VCF files

vcf-merge freebayes_tab.vcf.gz gatk_tab.vcf.gz samtools_tab.vcf.gz > vcf_tools_merged.vcf
vcf-stats freebayes_tab.vcf.gz > freebayes_tab_stats.txt


All commands at once
export PERL5LIB=/bcga2016/vcftools-0.1.14/share/perl5
export PATH=/bcga2016/tabix-0.2.6:$PATH

cp freebayes.vcf freebayes_tab.vcf
bgzip freebayes_tab.vcf 
tabix -p vcf freebayes_tab.vcf.gz 
cp samtools.vcf samtools_tab.vcf
bgzip samtools_tab.vcf 
tabix -p vcf samtools_tab.vcf.gz 
module add vcftools-0.1.14 
vcf-merge samtools_tab.vcf.gz freebayes_tab.vcf.gz > merged.vcf

Filter variants

(*) Using vcfutils

 <full-path>/bcftools/bin/vcfutils.pl varFilter -Q 20 -d 5 -D 200 samtools.vcf > samtools_filtered.vcf

(*) Questions

  • What other parameters can you specify for filtering variants?
  • How many variants were filtered?

Display files in IGV

(Download and open) IGV
Load the BAM file and the VCF files into IGV
Look at the mapping on Chr 11
Check out the results of the different variant calling programs.

Annovar

(*) First convert vcf into Annovar format

/bcga2016/annovar/convert2annovar.pl -format vcf4 -includeinfo freebayes.vcf > freebayes.avinput
or use
/bcga2016/annovar/convert2annovar.pl -format vcf4old -includeinfo merged.vcf > merged.avinput

(*) Annotate with Gene information

/bcga2016/annovar/annotate_variation.pl -geneanno -buildver hg19 freebayes.avinput /bcga2016/annovar/humandb/

(*) Download additional databases /bcga2016/annovar/annotate_variation.pl -buildver hg19 -downdb -webfrom annovar ljb26_all /bcga2016/annovar/humandb/ ... ...

(*) Annotate with Region information - ljb23

 <annovar-path>/annotate_variation.pl -regionanno -dbtype ljb23_all -buildver hg19 
 freebayes.avinput /home/stephan/bin/annovar/annovar/humandb/

(*) Annotate with Region information - snp138

 <annovar-path>/annotate_variation.pl -regionanno -dbtype snp138 -buildver hg19 
 freebayes.avinput /home/stephan/bin/annovar/annovar/humandb/

(*) Try converting the output files back to VCF (check if the appropriate columns are selected in cut)

 cat <annovar.file> | cut -f 9-18

(*) Notes

Use table_annovar.pl which supports VCF input and outputs annotations in VCF file
To create a nice human readable annovar output read the following:
http://annovar.openbioinformatics.org/en/latest/user-guide/startup/#table_annovarpl

(*) Questions

  • Look at the annotated VCF files.
  • What databases does "ljb23" include?

SeattleSeq Annotation

(*) Access
http://snp.gs.washington.edu/SeattleSeqAnnotation138/

(*) Annotate VCF file

Upload VCF file
Specify VCF as return type
submit
You should receive an annotated VCF file to the specified email address

Useful information

(*) Determine number of cores

cat /proc/cpuinfo  

(*) Make file executable

chmod +x <file.name>

(*) Extract information from a file

grep -i "info" <file.name>

(*) Extract information from a file excluding the pattern and display the first 100 lines

grep -v "^#" <file.name> | head -n 100