OpenCRAVAT has a custom tab separated input file format, that can be used in place of vcf. Each row in a CRAVAT input file describes a genomic variant by the following sequential columns: Chromosome, Position, Strand, Reference-Base, Alternate-Base, and, optionally, Sample. The table below describes each field:
Column | Description | Example |
---|---|---|
Chromos ome | The
chromosome,
prefixed
with
'chr' . |
'chr22
' ,
``'chrX'
`` |
Positio n | The 1-based position of the first affected nucleotide. | 11250130 7, 1804372 |
Strand | The strand
the variant
is on.
Either
'+' or
'-' . |
'+' /
'-' |
Referen ce-Base | The affected
nucleotide(s
),
or a '-'
for an
insertion. |
'G' ,
'AG'
,
'TTCC'
``,\ ``'
-' |
Alterna te-Base | The
alternate
nucleotide(s
),
or '-'
for a
deletion. |
'A' ,
'TTC'`
`,
``'-' |
Sample | The sample identifier. | 's1'
,
``'s25'`
` |
Tag | Optional: Arbitrary identifiers or category tags associated with the variant - delimited by semi-colon. | 'var00
1' ,
``'TR93;
cancer'`
` |
The following is a basic example of a CRAVAT input file:
chr2 112501307 + C A s1 var001
chr14 104770363 + T A s1 var002
chrX 71127984 + A G s2 var003
chr14 91974629 + T G s3 var004
chr12 57094662 + G T s4 var005
...
OpenCRAVAT uses a variety of text based file formats to pass data
internally between modules. Most of these internal files are temporary,
and are deleted at the end of a successful run. They can be preserved by
passing the --temp-files
flag to oc run
.
In general, OpenCRAVAT files are tab separated tabular text files with self defined columns, similar to a vcf. They start with a series of comment lines describing the columns in the tabular section, then a header row for the table, then the table itself. A basic example can be seen here:
#column=0,Column0,col0,string
#column=1,Col 1,column_1,int
#column=2,Col-2,c2,float
#Column0 Col 1 Col-2
row1 1 1.0
row2 2 2.0
row3 3 3.0
The column definition lines define four commas separated values:
- Index: which column in the table this column definition refers to.
- Title: a display only title for the column. Used as a header when presenting data to the user. Can be changed at any point without affecting cravat.
- Name: the internal name of the column, used to refer to it in code. Should only be changed carefully.
- Type: The type of data in this column. Data will be cast to this type when read from the file.
This is a header row for the table, typically using the column titles. It is not needed for OpenCRAVAT to function, and is included for readability.
Tab separated values. Blank columns should be represented by an empty string.
crv files (.crv) are basic OpenCRAVAT files that describe variants based on their genomic position and effect. They are produced by OpenCRAVAT converters.
#column=0,UID,uid,int
#column=1,Chrom,chrom,string
#column=2,Position,pos,int
#column=3,Ref Base,ref_base,string
#column=4,Alt Base,alt_base,string
#UID Chrom Position Ref Base Alt Base
1 chr19 10156403 G C
2 chr7 140834746 A T
name | Description | Type | Example(s) |
---|---|---|---|
uid | Unique id of variant. | int | 13 |
chrom | Chromosome | string | chr1, chr17, chrX |
pos | Genomic position of first affected base (1-based) | int | 1234 |
ref_base | Reference base(s) | string | A, AT, - |
alt_base | Alternate base(s) | string | G, GC, - |
Deletions are written with an ref of the bases to be deleted, and an alt of '-'.
1 chr1 1234 A -
Insertions are written with an ref of '-' and an alt of the bases to be inserted.
1 chr1 1234 - A
crx files (.crx) are an extended version of .crv files. They describe variants based on their affect on the genome, but also on genes, transcripts, and proteins. They are produced by OpenCRAVAT mappers.
#column=0,UID,uid,int
#column=1,Chrom,chrom,string
#column=2,Position,pos,int
#column=3,Ref Base,ref_base,string
#column=4,Alt Base,alt_base,string
#column=5,Hugo,hugo,string
#column=6,Transcript,transcript,string
#column=7,All Mappings,all_mappings,string
#UID Chrom Position Ref Base Alt Base Hugo Transcript All Mappings
1 chr19 10156403 G C DNMT1 ENST00000340748.8 {"DNMT1":[["P26358","P447A","MIS","ENST00000340748.8","C1339G"]]}
2 chr7 140834746 A T BRAF ENST00000288602.10 {"BRAF":[["P15056","S123T","MIS","ENST00000288602.10","T367A"]]}
The all mappings column contains a json object describing the genes, transcripts, and proteins that a variant affected. It has the following schema,
{
"gene": [
[
"protein 1",
"amino acid change 1",
"sequence ontology 1",
"transcript 1",
"rna change 1"
],
[
"protein 2",
"amino acid change 2",
"sequence ontology 2",
"transcript 2",
"rna change 2"
]
]
}
Sequence ontologies are encoded with three letter abbreviations.
Abbv | Sequence Ontology |
---|---|
2KD | 2 Kb downstream from gene |
2KU | 2 Kb upstream from gene |
UT3 | In the 3' UTR |
UT5 | In the 5' UTR |
INT | In an intron |
UNK | Unknown sequence ontology |
SYN | Synonomous |
MIS | Missense |
CSS | Complex substitution |
IDV | Inframe deletion |
IIV | Inframe insertion |
STL | Stoploss |
SPL | Splice site affected |
STG | Stopgain |
FD2 | 2 base frameshift deletion |
FD1 | 1 base frameshift deletion |
FI2 | 2 base frameshift insertion |
FI1 | 1 base frameshift insertion |
Every OpenCRAVAT annotator will produce an output file with the suffix
[annotatorName].var
for a variant level annotator, and
[annotatorName].gen
for gene level.
As an example, running the vest
and go
annotators on
input.vcf
:
oc run input.vcf -a vest go --temp-files
Will produce input.vcf.vest.var
and input.vcf.go.gen
.
The annotator ouput files will contain a header that defines the annotator's internal name display name, and column definitions. Following the header will be rows of tab separated data values.
An example snippet from input.vcf.vest.var
is as follows:
#name=vest #displayname=VEST #column=0,UID,uid,int #column=1,VEST score transcript,transcript,string #column=2,VEST score,score,float #column=3,VEST p-value,pval,float #column=4,VEST score (missense),score_mis,float #column=5,VEST score (frameshift),score_fsv,float #column=6,VEST score (inframe indel),score_inv,float #column=7,VEST score (stop gain),score_stg,float #column=8,VEST score (stop loss),score_stl,float #column=9,VEST score (splice site),score_spl,float #column=10,All transcripts,all_results,string #column=11,HUGO,hugo,string #no_aggregate=hugo #UID VEST score transcript VEST score VEST p-value VEST score (missense) ... 1 ENST00000233336.6 0.773 0.0417 0.773 ... 2 ENST00000554848.5 0.707 0.06973 0.707 ... 3 ENST00000374080.7 0.143 0.65145 0.143 ... 4 ENST00000267622.8 0.541 0.16344 0.541 ... 5 ENST00000342556.6 0.321 0.31889 0.321 ...