Skip to content

zhengxwen/gds2bgen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gds2bgen: Format Conversion from BGEN to GDS

GPLv3 GNU General Public License, GPLv3

Description

This package provides functions for format conversion from bgen files to SeqArray GDS files.

Version

v0.9.3

Package Maintainer

Dr. Xiuwen Zheng ([email protected])

Installation

Requires R (≥ v3.5.0), gdsfmt (≥ v1.20.0), SeqArray (≥ v1.24.0)

  • Installation from Github:
library("devtools")
install_github("zhengxwen/gds2bgen")

The install_github() approach requires that you build from source, i.e. make and compilers must be installed on your system -- see the R FAQ for your operating system; you may also need to install dependencies manually.

Or manually intall the package

git clone https://github.com/zhengxwen/gds2bgen
cd gds2bgen/src
unzip bgen_v1.1.8.zip
cd bgen_v1.1.8
python2 ./waf configure
python2 ./waf
cp build/libbgen.a ..
cp build/3rd_party/zstd-1.1.0/libzstd.a ..
rm -rf build
sleep 1; touch ../libbgen.a
cd ../../..
R CMD INSTALL gds2bgen

Copyright Notice

This package includes the sources of the bgen library (https://enkre.net/cgi-bin/code/bgen/dir?ci=trunk), Boost (the C++ libraries, https://www.boost.org) and Zstandard (https://zstd.net).

Citations for GDS

Zheng X, Levine D, Shen J, Gogarten SM, Laurie C, Weir BS (2012). A High-performance Computing Toolset for Relatedness and Principal Component Analysis of SNP Data. Bioinformatics. DOI: 10.1093/bioinformatics/bts606.

Zheng X, Gogarten S, Lawrence M, Stilp A, Conomos M, Weir BS, Laurie C, Levine D (2017). SeqArray -- A storage-efficient high-performance data format for WGS variant calls. Bioinformatics. DOI: 10.1093/bioinformatics/btx145.

Examples

library(gds2bgen)

seqBGEN_Info()  # bgen library version
## "bgen_lib_v1.1.8"

bgen_fn <- system.file("extdata", "example.8bits.bgen", package="gds2bgen")
# or bgen_fn <- "your_bgen_file.bgen"
seqBGEN_Info(bgen_fn)

## File: gds2bgen/extdata/example.8bits.bgen
## # of samples: 500
## # of variants: 199
## Compression method: zlib
## Layout version: v1.2
## Unphased: TRUE
## # of bits: 8
## Ploidy: 2
## sample id: sample_001, sample_002, sample_003, sample_004, ...


# example.8bits.bgen ==> example.gds, using 4 cores
seqBGEN2GDS(bgen_fn, "example.gds",
    storage.option="LZMA_RA",  # compression option, e.g., ZIP_RA for zlib or LZ4_RA for LZ4
    float.type="packed8",      # 8-bit packed real numbers
    geno=FALSE,     # 2-bit integer genotypes, stored in 'genotype/data'
    dosage=TRUE,    # numeric alternative allele dosages, stored in 'annotation/format/DS'
    prob=FALSE,     # numeric genotype probabilities, stored in 'annotation/format/GP'
    parallel=4      # the number of cores
)


# show file structure
library(SeqArray)
(f <- seqOpen("example.gds"))
seqClose(f)

## File: example.gds (137.7K)
## +    [  ] *
## |--+ description   [  ] *
## |--+ sample.id   { Str8 500 LZMA_ra(7.02%), 393B } *
## |--+ variant.id   { Int32 199 LZMA_ra(33.9%), 277B } *
## |--+ position   { Int32 199 LZMA_ra(60.6%), 489B } *
## |--+ chromosome   { Str8 199 LZMA_ra(15.7%), 101B } *
## |--+ allele   { Str8 199 LZMA_ra(11.8%), 101B } *
## |--+ genotype   [  ] *
## |--+ phase   [  ]
## |--+ annotation   [  ]
## |  |--+ id   { Str8 199 LZMA_ra(18.6%), 321B } *
## |  |--+ qual   { Float32 199 LZMA_ra(11.8%), 101B } *
## |  |--+ filter   { Int32 199 LZMA_ra(11.3%), 97B } *
## |  |--+ info   [  ]
## |  \--+ format   [  ]
## |     |--+ DS   [  ] *
## |     |  \--+ data   { PackedReal8U 500x199 LZMA_ra(55.6%), 54.0K } *
## \--+ sample.annotation   [  ]

Also See

seqVCF2GDS() in the SeqArray package, conversion from VCF files to GDS files.

seqBED2GDS() in the SeqArray package, conversion from PLINK BED files to GDS files.