Overall Plan for MVP Whole Genome Structural Variant and PheWAS Analysis on Polaris and Frontier

We will have access to approximately 10,000 whole genome sequences from the MVP project. In the past the VA was able to perform mapping and SNP calling using the Trellis platform. However, SV calling at this scale has not been attempted before. Performing analysis on the whole genome sequences can be very computational and time intensive. We would like to use ALCF's Polaris machine to perform the workflows for NGS sequence analysis. We will have the opportunity to perform a population analysis so we can conclude with a Phewas analysis on the results from the whole genome analysis.

Below is a draft of what steps are needed to achieve the goals on this project. Please modify appropriately.

Get familiar with Polaris

Get user accounts
Login
```
ssh [email protected]
```

Submit interactive and script jobs

qsub -A covid-ct -I -l select=1 -l walltime=1:00:00 -l filesystems=home:eagle -q debug

Download data [progress]:
- 1KG Whole genome datasets
- Reference hg38
Gather a set of NGS and PheWAS/GWAS tools that will be tested on Polaris. Each tool will most likely encounter its own issues and will have to deal with it appropriately.

Alignment
- NVIDIA Clara-Parabricks
SNP Callers
- DeepVariant - included within parabricks
- GATK HaplotypeCaller - included with parabricks
SV Callers
- SVision - This tool is meant to be used with long read sequencers (i.e. PacBio, OxfordNanopore) for building the models which they already provide. We can test with the BAM file we generate from Parabricks.
- AstraZeneca tools
- GATK-sv
- DeepSV
- MANTA - consider modifications made here
- DeepSVFilter - Tool used to filter SV calls from other SV callers (i.e. Delly, Lumpy, Manta)
- Cue
- Delly
- svtools
- Absynthe
- Breakdancer
- BreakSeq
- CNVNator
- Lumpy
- Manta
- Smoove
- Tardis
- Whaning
- Graphtyper
- PAV
- ConsensusSV
- Parliament2
Annotations
- Annotate with VEP
Population analysis
- SAIGE

Test each of the tools from the previous set
- NVIDIA Clara-Parabricks - Succesfully ran workflow on low coverage and 30X whole-genome fastq sequences [Progress]
- SVision - Build tool and test on 30X whole-genome BAM file [Progress].
Evaluate outputs - Determine which set of tools are best for our analysis, but be dynamic enough that if new tools come up, we can shift focus.
Create/test submission engine (i.e. Parsl, Balsam, etc)
- Look here
- Parsl on Polaris
Create workflow for submitting the genomes
Generate statistics on rutime to determine how much our allocation on Polaris should be
Start process to move MVP whole-genome data
Start processing MVP whole-genome through workflow pipeline
Convert VCF to PGENs for access to SAIGE
Share VCFs as results become available
Perform post-process analysis on VCFs (i.e QC, annotations, etc)
Write paper on how VCFs were generated, what was found (computational and science?)
Setup SAIGE on Polaris
Run SAIGE for MVP WG PGEN data
- Use link to create phenotypes
Perform QC on SAIGE analysis
Post-process analysis - Will need Anurag and Jenny for this
- What is different compared to gwPhewas analysis?
- Novel SNPs, SVs (quantitities)
Share data
Write paper on findings

Summary of SV callers

Caller/web link	Types of SVs	AI based?	Actively developed?	Prg Env	GPU acceleration?
Breakdancer^*	Deletions, insertions, inversions, intra-chromosomal, inter-chromosomal translocations	N	N	C++	N
BreakSeq	Insertions, deletions, translocations, inversions, duplications	N	N	Python	N
ClipCrop	Insertions, deletions, translocations, inversions, duplications	N	N	Node.js	N
CREST	Insertions, deletions, translocations, inversions, duplications	N	N	Perl	N
DELLY	Deletions, inversions, duplications, interchromosomal translocations	N	Y	C++	N
GRIDSS	Insertions, deletions, translocations, inversions, duplications	N	N	Java/R	N
Gustaf	Deletions, inversions, duplications, translocation	N	N	C++	N
LUMPY	Deletions, duplications, inversions, translocations	N	Y (June 2022)	C++	N
Manta	Insertions, deletions, translocations, inversions, duplications	N	N	C++	N
Meerkat	Insertions, deletions, translocations, inversions, duplications	N	N	Perl	N
Pindel	Insertions, deletions, translocations, inversions, duplications	N	N	C++	N
TARDIS	Tandem and interspersed segmental duplications	N	N	C	N
TIGRA	Insertions and deletions	N	N	C++	N
Ulysses	Insertions, deletions, translocations, inversions, duplications	N	N	Python/R	N
SvABA	Insertion, deletions, somatic rearrangments	N	N	C++/R	N
Socrates		N	N	Java	N
SVSeq2		N	N	-	N
Cue	Deletions, tandem duplication, inversions, deletion-flanked inversions, inverted duplications larger than 5kbp	Y	Y	Python	Y
Strvctvre	Deletions and duplications	Y	N	Python	-
Dysgu		Y	Y	Python	-
CNNgeno	Deletions	Y	N	Python	Y
DeepSV	Deletions	Y	N	Python	Y
sv-channels	Deletions	Y	Y	Python/R	Y

^* BreakDancer has two modes, BreakDancerMax and BreakDancerMini. While the former is for large SVs, the latter is designed for calling small indels (of 10-100 base pairs) using normally mapped read pairs.

References

This is what others are doing:

Take a look at what the Broad folks are doing here. They are calling whole genomes using the Broad workflow and SV calling is being done in a consensus manner.
Genomics England's very first initiative – sequencing 100,000

Name		Name	Last commit message	Last commit date
Latest commit History 221 Commits
data		data
funcs		funcs
README.md		README.md
cue_readme.md		cue_readme.md
parabricks_1kg_setup.md		parabricks_1kg_setup.md
parabricks_readme.md		parabricks_readme.md
parliament2_readme.md		parliament2_readme.md
svision_readme.md		svision_readme.md
wgPheWAS_plan_1kg.md		wgPheWAS_plan_1kg.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overall Plan for MVP Whole Genome Structural Variant and PheWAS Analysis on Polaris and Frontier

Summary of SV callers

References

About

Releases

Packages

Contributors 3

Languages

exascale-genomics/mvp-wgs-sv

Folders and files

Latest commit

History

Repository files navigation

Overall Plan for MVP Whole Genome Structural Variant and PheWAS Analysis on Polaris and Frontier

Summary of SV callers

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages