Skip to content

Latest commit

 

History

History
328 lines (273 loc) · 11.3 KB

File-Formats.rst

File metadata and controls

328 lines (273 loc) · 11.3 KB

File Formats

CRAVAT Input Format

OpenCRAVAT has a custom tab separated input file format, that can be used in place of vcf. Each row in a CRAVAT input file describes a genomic variant by the following sequential columns: Chromosome, Position, Strand, Reference-Base, Alternate-Base, and, optionally, Sample. The table below describes each field:

Columns

Column Description Example
Chromos ome The chromosome, prefixed with 'chr'. 'chr22 ', ``'chrX' ``
Positio n The 1-based position of the first affected nucleotide. 11250130 7, 1804372
Strand The strand the variant is on. Either '+' or '-'. '+'/ '-'
Referen ce-Base The affected nucleotide(s ), or a '-' for an insertion. 'G', 'AG' , 'TTCC' ``,\ ``' -'
Alterna te-Base The alternate nucleotide(s ), or '-' for a deletion. 'A', 'TTC'` `, ``'-'
Sample The sample identifier. 's1' , ``'s25'` `
Tag Optional: Arbitrary identifiers or category tags associated with the variant - delimited by semi-colon. 'var00 1', ``'TR93; cancer'` `

Example

The following is a basic example of a CRAVAT input file:

chr2    112501307   +   C   A   s1    var001
chr14   104770363   +   T   A   s1    var002
chrX    71127984    +   A   G   s2    var003
chr14   91974629    +   T   G   s3    var004
chr12   57094662    +   G   T   s4    var005
...

Internal Files

OpenCRAVAT uses a variety of text based file formats to pass data internally between modules. Most of these internal files are temporary, and are deleted at the end of a successful run. They can be preserved by passing the --temp-files flag to oc run .

In general, OpenCRAVAT files are tab separated tabular text files with self defined columns, similar to a vcf. They start with a series of comment lines describing the columns in the tabular section, then a header row for the table, then the table itself. A basic example can be seen here:

#column=0,Column0,col0,string
#column=1,Col 1,column_1,int
#column=2,Col-2,c2,float
#Column0    Col 1   Col-2
row1    1   1.0
row2    2   2.0
row3    3   3.0

column definition

The column definition lines define four commas separated values:

  1. Index: which column in the table this column definition refers to.
  2. Title: a display only title for the column. Used as a header when presenting data to the user. Can be changed at any point without affecting cravat.
  3. Name: the internal name of the column, used to refer to it in code. Should only be changed carefully.
  4. Type: The type of data in this column. Data will be cast to this type when read from the file.

header row

This is a header row for the table, typically using the column titles. It is not needed for OpenCRAVAT to function, and is included for readability.

table

Tab separated values. Blank columns should be represented by an empty string.

.crv Files

crv files (.crv) are basic OpenCRAVAT files that describe variants based on their genomic position and effect. They are produced by OpenCRAVAT converters.

crv example

#column=0,UID,uid,int
#column=1,Chrom,chrom,string
#column=2,Position,pos,int
#column=3,Ref Base,ref_base,string
#column=4,Alt Base,alt_base,string
#UID    Chrom   Position    Ref Base    Alt Base
1   chr19   10156403    G   C
2   chr7    140834746   A   T

crv columns

name Description Type Example(s)
uid Unique id of variant. int 13
chrom Chromosome string chr1, chr17, chrX
pos Genomic position of first affected base (1-based) int 1234
ref_base Reference base(s) string A, AT, -
alt_base Alternate base(s) string G, GC, -

Deletions are written with an ref of the bases to be deleted, and an alt of '-'.

1  chr1    1234    A   -

Insertions are written with an ref of '-' and an alt of the bases to be inserted.

1  chr1    1234    -   A

.crx Files

crx files (.crx) are an extended version of .crv files. They describe variants based on their affect on the genome, but also on genes, transcripts, and proteins. They are produced by OpenCRAVAT mappers.

crx example

#column=0,UID,uid,int
#column=1,Chrom,chrom,string
#column=2,Position,pos,int
#column=3,Ref Base,ref_base,string
#column=4,Alt Base,alt_base,string
#column=5,Hugo,hugo,string
#column=6,Transcript,transcript,string
#column=7,All Mappings,all_mappings,string

#UID    Chrom   Position    Ref Base    Alt Base    Hugo    Transcript  All Mappings
1   chr19   10156403    G   C   DNMT1   ENST00000340748.8   {"DNMT1":[["P26358","P447A","MIS","ENST00000340748.8","C1339G"]]}
2   chr7    140834746   A   T   BRAF    ENST00000288602.10  {"BRAF":[["P15056","S123T","MIS","ENST00000288602.10","T367A"]]}

All Mappings

The all mappings column contains a json object describing the genes, transcripts, and proteins that a variant affected. It has the following schema,

{
  "gene": [
    [
      "protein 1",
      "amino acid change 1",
      "sequence ontology 1",
      "transcript 1",
      "rna change 1"
    ],
    [
      "protein 2",
      "amino acid change 2",
      "sequence ontology 2",
      "transcript 2",
      "rna change 2"
    ]
  ]
}

Sequence ontologies are encoded with three letter abbreviations.

Abbv Sequence Ontology
2KD 2 Kb downstream from gene
2KU 2 Kb upstream from gene
UT3 In the 3' UTR
UT5 In the 5' UTR
INT In an intron
UNK Unknown sequence ontology
SYN Synonomous
MIS Missense
CSS Complex substitution
IDV Inframe deletion
IIV Inframe insertion
STL Stoploss
SPL Splice site affected
STG Stopgain
FD2 2 base frameshift deletion
FD1 1 base frameshift deletion
FI2 2 base frameshift insertion
FI1 1 base frameshift insertion

.var/.gen Files

Every OpenCRAVAT annotator will produce an output file with the suffix [annotatorName].var for a variant level annotator, and [annotatorName].gen for gene level.

As an example, running the vest and go annotators on input.vcf:

oc run input.vcf -a vest go --temp-files

Will produce input.vcf.vest.var and input.vcf.go.gen.

The annotator ouput files will contain a header that defines the annotator's internal name display name, and column definitions. Following the header will be rows of tab separated data values.

An example snippet from input.vcf.vest.var is as follows:

#name=vest
#displayname=VEST
#column=0,UID,uid,int
#column=1,VEST score transcript,transcript,string
#column=2,VEST score,score,float
#column=3,VEST p-value,pval,float
#column=4,VEST score (missense),score_mis,float
#column=5,VEST score (frameshift),score_fsv,float
#column=6,VEST score (inframe indel),score_inv,float
#column=7,VEST score (stop gain),score_stg,float
#column=8,VEST score (stop loss),score_stl,float
#column=9,VEST score (splice site),score_spl,float
#column=10,All transcripts,all_results,string
#column=11,HUGO,hugo,string
#no_aggregate=hugo
#UID    VEST score transcript   VEST score  VEST p-value    VEST score (missense) ...
1   ENST00000233336.6   0.773   0.0417  0.773 ...
2   ENST00000554848.5   0.707   0.06973 0.707 ...
3   ENST00000374080.7   0.143   0.65145 0.143 ...
4   ENST00000267622.8   0.541   0.16344 0.541 ...
5   ENST00000342556.6   0.321   0.31889 0.321 ...