Analysis-ready VCF at Biobank scale using Zarr

  • 1Open Athena AI Foundation, 1245 Broadway, 16th Floor, New York, NY 10001, USA.
  • 2Related Sciences, 1312 17th St PMB 76870, Denver, CO 80202, USA.
  • 3Independent researcher.
  • 4Tom White Consulting Ltd.
  • 5Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, OX3 7LF, UK.
  • 6The New Zealand Institute for Plant & Food Research Ltd, 74 Gerald Street,  Lincoln 7608, New Zealand.
  • 7Department of Biochemistry, School of Biomedical Sciences, University of Otago. 710 Cumberland Street, Dunedin North, Dunedin 9016, New Zealand.
  • 8Our Future Health, 2 New Bailey, 6 Stanley Street, Manchester M3 5GS, UK.
  • 9Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
  • 10Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA 02114, USA.
  • 11NVIDIA Ltd, 100 Brook Drive, Green Park, Reading RG2 6UJ, UK.
  • 12Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
  • 13Genomic Surveillance Unit, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, UK.
  • 14Genomics England, One Canada Square, London, E14 5AB, UK.
  • 15Department of Cell and Molecular Biology, National Bioinformatics Infrastructure Sweden, Science for Life Laboratory, Uppsala University, Husargatan 3, SE-752 37 Uppsala, Sweden.
  • 16School of Computer Science, McGill University, Montreal, QC, H3A 2A7, Canada.

|

Abstract

BACKGROUND

Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of the VCF data model (either as text or packed binary) emphasizes efficient retrieval of all data for a given variant, but accessing data on a field or sample basis is inefficient. The Biobank-scale datasets currently available consist of hundreds of thousands of whole genomes and hundreds of terabytes of compressed VCF. Row-wise data storage is fundamentally unsuitable and a more scalable approach is needed.

RESULTS

Zarr is a format for storing multidimensional data that is widely used across the sciences, and is ideally suited to massively parallel processing. We present the VCF Zarr specification, an encoding of the VCF data model using Zarr, along with fundamental software infrastructure for efficient and reliable conversion at scale. We show how this format is far more efficient than standard VCF-based approaches, and competitive with specialized methods for storing genotype data in terms of compression ratios and single-threaded calculation performance. We present case studies on subsets of 3 large human datasets (Genomics England: $n$=78,195; Our Future Health: $n$=651,050; All of Us: $n$=245,394) along with whole genome datasets for Norway Spruce ($n$=1,063) and SARS-CoV-2 ($n$=4,484,157). We demonstrate the potential for VCF Zarr to enable a new generation of high-performance and cost-effective applications via illustrative examples using cloud computing and GPUs.

CONCLUSIONS

Large row-encoded VCF files are a major bottleneck for current research, and storing and processing these files incurs a substantial cost. The VCF Zarr specification, building on widely used, open-source technologies, has the potential to greatly reduce these costs, and may enable a diverse ecosystem of next-generation tools for analysing genetic variation data directly from cloud-based object stores, while maintaining compatibility with existing file-oriented workflows.

Related Concept Videos