Methods for constructing and evaluating consensus genomic interval sets

Affiliations
  • 1Department of Genome Sciences, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA.
  • 2Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA.
  • 3Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA.
  • 4Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA.
  • 5School of Data Science, University of Virginia, Charlottesville, VA 22904, USA.
  • 6Department of Public Health Sciences, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA.
  • 7Child Health Research Center, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA.

Published on:

Abstract

The amount of genomic region data continues to increase. Integrating across diverse genomic region sets requires consensus regions, which enable comparing regions across experiments, but also by necessity lose precision in region definitions. We require methods to assess this loss of precision and build optimal consensus region sets. Here, we introduce the concept of flexible intervals and propose three novel methods for building consensus region sets, or universes: a coverage cutoff method, a likelihood method, and a Hidden Markov Model. We then propose three novel measures for evaluating how well a proposed universe fits a collection of region sets: a base-level overlap score, a region boundary distance score, and a likelihood score. We apply our methods and evaluation approaches to several collections of region sets and show how these methods can be used to evaluate fit of universes and build optimal universes. We describe scenarios where the common approach of merging regions to create consensus leads to undesirable outcomes and provide principled alternatives that provide interoperability of interval data while minimizing loss of resolution.

Related Concept Videos

JoVE Research Video for Interpretation of Confidence Intervals 01:19

4.9K

A confidence interval is a better estimate of the population than a point estimate, as it uses a range of values from a sample instead of a single value.
Confidence intervals have confidence coefficients that are crucial for their interpretation. The most common confidence coefficients are 0.90, 0.95, and 0.99, which can be written as percentages–90%, 95%, and 99%, respectively.
Suppose a person calculates a confidence interval with a confidence coefficient of 0.95. In that case, they can…

JoVE Research Video for Multi-species Conserved Sequences 02:51

3.8K

Next-generation sequencing technologies have created large genomic databases of a variety of animals and plants. Ever since the human genome project was completed, scientists studied the genome of primates, mammals, and other phylogenetically distant living beings. Such large-scale  studies have provided new insights into the evolutionary relationship between organisms.
Although the genome of each species varies greatly from each other, a few sequences are highly conserved. Such conserved…

JoVE Research Video for Comparing Copy Number Variations and SNPs 02:26

13.4K

Sequencing of the human genome has opened up several best-kept secrets of the genome. Scientists have identified thousands of genome variations that exist within a population. These variations can be a single nucleotide or a larger chromosomal variation.
Copy number variations or CNVs are the structural variations that cover more than 1kb of DNA sequence. The single nucleotide polymorphism (SNP), on the other hand, is a single nucleotide change or a point mutation that is found in more than 1%…

JoVE Research Video for Confidence Intervals 01:21

5.3K

An unbiased point estimate is often insufficient to predict a population estimate, such as population mean or population proportion. In this scenario, a confidence interval is used. A confidence interval is an estimate similar to a  sample proportion. However, unlike the point estimate which is a single value, the confidence interval  contains a range of values. These values have lower and upper limits, known as confidence limits, and can be designated as L1 and L2, respectively.
A…

JoVE Research Video for Condensins 02:15

3.1K

Condensins are large protein complexes that use ATP to fuel the assembly of chromosomes during mitosis. They transform the tangled, shapeless mass of post-interphase DNA into individualized chromosomes by compacting, organizing, and segregating chromosomal DNA.
The plant and animal cells contain two types of condensin complexes—condensin I and condensin II. Both complexes have five subunits: two SMC (Structural Maintenance of Chromosomes) subunits, a kleisin subunit, and two HEAT-repeat…

JoVE Research Video for Evolutionary Relationships through Genome Comparisons 02:54

5.5K

Genome comparison is one of the excellent ways to interpret the evolutionary relationships between organisms. The basic principle of genome comparison is that if two species share a common feature, it is likely encoded by the DNA sequence conserved between both species. The advent of genome sequencing technologies in the late 20th century enabled scientists to understand the concept of conservation of domains between species and helped them to deduce evolutionary relationships across diverse…