-
Notifications
You must be signed in to change notification settings - Fork 14
Genotyping Structural Variants
toil-vg can be used to estimate the genotypes of a given set of variants from short read alignment data from a single sample. The process is as follows:
- Create and index graph from VCF using
toil-vg construct
- Map reads to graph using
toil-vg map
- Compute genotypes for each variant in the input VCF using
toil-vg call
This example shows how to genotype chromosome 22 from HG002 using the GIAB VCF. It can be run locally on most systems. It takes roughly 40 minutes overall on my 8-core I7 desktop. It is straightforward to specify additional chromosomes on the command lines, though this is best done on the cloud. Some of the Toil options required for doing so are explained in the toil-vg README. Example scripts for whole-genome SV genotyping on AWS can be found here.
The sample we are genotyping, HG002, is in the original VCF. This is not a requirement, as any sample can be genotyped against any VCF. Using HG002 allows us to easily compute the accuracy (of vg versus the GIAB calls) in the final step, however.
In general it is best to represent insertions and deletions explicitly in the input VCF if possible. vg supports symbolic alleles (enabled with --handle_svs
in toil-vg construct
), but it often doesn't work in practice as the VCF spec is so loose, and the interface is not presently enabled in toil-vg call
. This is not an issue for GIAB as it does not use symbolic alleles.
We begin by constructing and indexing the GIAB SV chr22 graph. GCSA indexing requires several gigs of scratch space, so the --workDir .
option below my need to be adjusted accordingly:
rm -rf ./jobStore ; toil-vg construct ./jobStore . --vcf ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/HG002_SVs_Tier1_v0.6.vcf.gz --fasta ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz --regions 22 --out_name GIAB --xg_index --gcsa_index --realTimeLogging --pangenome --flat_alts --handle_svs --alt_path_gam_index --validate --normalize --merge_graphs --gcsa_index_cores $(nproc) --workDir .
Next, we extract 30X chromosome 22 reads from the GIAB BAM.
samtools view ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/NIST_HiSeq_HG002_Homogeneity-10953946/NHGRI_Illumina300X_AJtrio_novoalign_bams/HG002.hs37d5.300x.bam 22 -s 1.10 -h -O BAM > HG002_chr22_30X.bam
Then we map these reads to the graph to produce a GAM.
rm -rf jobStore ; toil-vg map ./jobStore HG002 ./GIAB.xg ./GIAB.gcsa . --bam_input_reads ./HG002_chr22_30X.bam --interleaved --alignment_cores $(nproc) --single_reads_chunk --realTimeLogging
We can then genotype the VCF. Note that the VCF we genotype (specified below with --genotype_vcf
) must be the same VCF, or a subset thereof, that was used to construct the graph. Running the command below without specifying the --genotype_vcf
or --alt_path_gam
options will call small variants from the reads in addition to SVs in the graph.
rm -rf jobStore ; toil-vg call ./jobStore ./GIAB.xg HG002 . --chroms 22 --gams HG002_default.gam --genotype_vcf ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/HG002_SVs_Tier1_v0.6.vcf.gz --alt_path_gam ./GIAB_alts.gam --realTimeLogging --call_chunk_cores $(nproc)
The output genotypes will be in HG002.vcf.gz. They can be compared back to the original GIAB genotypes as follows
wget ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/HG002_SVs_Tier1_v0.6.vcf.gz
bcftools view HG002_SVs_Tier1_v0.6.vcf.gz -t 22 -O z > giab_chr22.vcf.gz
tabix -f -p vcf giab_chr22.vcf.gz
rm -rf jobStore ; toil-vg vcfeval ./jobStore . --vcfeval_baseline ./giab_chr22.vcf.gz --call_vcf ./HG002.vcf.gz --genotype_eval --sveval --vcfeval_sample HG002 --realTimeLogging
The evaluation results will be in sv_accuracy.tsv
and sv_evaluation.tar.gz
cat sv_accuracy.tsv
type TP TP.baseline FP FN precision recall F1
Total 536 549 387 300 0.5807 0.6466 0.6119
INS 278 283 214 161 0.565 0.6374 0.599
DEL 258 266 173 139 0.5986 0.6568 0.6264