Related Videos - Analysis-ready VCF at Biobank scale using Zarr

Eric Czech ^1,2, Will Tyler ³, Tom White ⁴, Ben Jeffery ⁵, Timothy R Millar ^6,7, Benjamin Elsworth ⁸, Jérémy Guez ^9,10, Jonny Hancox ¹¹, Konrad J Karczewski ^9,10,12, Alistair Miles ¹³, Sam Tallman ¹⁴, Per Unneberg ¹⁵, Rafal Wojdyla ¹, Shadi Zabad ¹⁶, Jeff Hammerbacher ^1,2, Jerome Kelleher ⁵

¹Open Athena AI Foundation, 1245 Broadway, 16th Floor, New York, NY 10001, USA.
²Related Sciences, 1312 17th St PMB 76870, Denver, CO 80202, USA.
³Independent researcher.
⁴Tom White Consulting Ltd.
⁵Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, OX3 7LF, UK.
⁶The New Zealand Institute for Plant & Food Research Ltd, 74 Gerald Street, Lincoln 7608, New Zealand.
⁷Department of Biochemistry, School of Biomedical Sciences, University of Otago. 710 Cumberland Street, Dunedin North, Dunedin 9016, New Zealand.
⁸Our Future Health, 2 New Bailey, 6 Stanley Street, Manchester M3 5GS, UK.
⁹Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
¹⁰Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA 02114, USA.
¹¹NVIDIA Ltd, 100 Brook Drive, Green Park, Reading RG2 6UJ, UK.
¹²Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
¹³Genomic Surveillance Unit, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, UK.
¹⁴Genomics England, One Canada Square, London, E14 5AB, UK.
¹⁵Department of Cell and Molecular Biology, National Bioinformatics Infrastructure Sweden, Science for Life Laboratory, Uppsala University, Husargatan 3, SE-752 37 Uppsala, Sweden.
¹⁶School of Computer Science, McGill University, Montreal, QC, H3A 2A7, Canada.

Abstract

BACKGROUND

Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of the VCF data model (either as text or packed binary) emphasizes efficient retrieval of all data for a given variant, but accessing data on a field or sample basis is inefficient. The Biobank-scale datasets currently available consist of hundreds of thousands of whole genomes and hundreds of terabytes of compressed VCF. Row-wise data storage is fundamentally unsuitable and a more scalable approach is needed.

RESULTS

Zarr is a format for storing multidimensional data that is widely used across the sciences, and is ideally suited to massively parallel processing. We present the VCF Zarr specification, an encoding of the VCF data model using Zarr, along with fundamental software infrastructure for efficient and reliable conversion at scale. We show how this format is far more efficient than standard VCF-based approaches, and competitive with specialized methods for storing genotype data in terms of compression ratios and single-threaded calculation performance. We present case studies on subsets of 3 large human datasets (Genomics England: $n$=78,195; Our Future Health: $n$=651,050; All of Us: $n$=245,394) along with whole genome datasets for Norway Spruce ($n$=1,063) and SARS-CoV-2 ($n$=4,484,157). We demonstrate the potential for VCF Zarr to enable a new generation of high-performance and cost-effective applications via illustrative examples using cloud computing and GPUs.

CONCLUSIONS

Large row-encoded VCF files are a major bottleneck for current research, and storing and processing these files incurs a substantial cost. The VCF Zarr specification, building on widely used, open-source technologies, has the potential to greatly reduce these costs, and may enable a diverse ecosystem of next-generation tools for analysing genetic variation data directly from cloud-based object stores, while maintaining compatibility with existing file-oriented workflows.

Keywords:

Analysis-ready VCF at Biobank scale using Zarr

Sample Preparation and Analysis of RNASeq-based Gene Expression Data from Zebrafish

An Analytical Tool-box for Comprehensive Biochemical, Structural and Transcriptome Evaluation of Oral Biofilms Mediated by Mutans Streptococci

Targeted Next-generation Sequencing and Bioinformatics Pipeline to Evaluate Genetic Determinants of Constitutional Disease

Abstract

BACKGROUND

RESULTS

CONCLUSIONS

Keywords:

Sample Preparation and Analysis of RNASeq-based Gene Expression Data from Zebrafish

An Analytical Tool-box for Comprehensive Biochemical, Structural and Transcriptome Evaluation of Oral Biofilms Mediated by Mutans Streptococci

Targeted Next-generation Sequencing and Bioinformatics Pipeline to Evaluate Genetic Determinants of Constitutional Disease

ABOUT JoVE

AUTHORS

LIBRARIANS

RESEARCH

EDUCATION

Analysis-ready VCF at Biobank scale using Zarr

Related Experiment Videos These videos have been matched automatically. Contact us if they are not relevant.

Sample Preparation and Analysis of RNASeq-based Gene Expression Data from Zebrafish

An Analytical Tool-box for Comprehensive Biochemical, Structural and Transcriptome Evaluation of Oral Biofilms Mediated by Mutans Streptococci

Targeted Next-generation Sequencing and Bioinformatics Pipeline to Evaluate Genetic Determinants of Constitutional Disease

Abstract

BACKGROUND

RESULTS

CONCLUSIONS

Keywords:

Related Experiment Videos These videos have been matched automatically. Contact us if they are not relevant.

Sample Preparation and Analysis of RNASeq-based Gene Expression Data from Zebrafish

An Analytical Tool-box for Comprehensive Biochemical, Structural and Transcriptome Evaluation of Oral Biofilms Mediated by Mutans Streptococci

Targeted Next-generation Sequencing and Bioinformatics Pipeline to Evaluate Genetic Determinants of Constitutional Disease

Related Concept Videos

Share

Related Experiment Videos

These videos have been matched automatically. Contact us if they are not relevant.

Related Experiment Videos

These videos have been matched automatically. Contact us if they are not relevant.