Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Genomics

Genomics

Genomics is the science of genomes: it is the study of all the genetic material of an organism. In humans, the genome consists of information carried in 23 pairs of chromosomes in the nucleus, as well as mitochondrial DNA. In genomics, both coding and non-coding DNA is sequenced and analyzed. Genomics allows a better understanding of all living things, their evolution, and their diversity. It has a myriad of uses: for example, to build phylogenetic trees, to improve productivity and...

Evolutionary Relationships through Genome Comparisons

Evolutionary Relationships through Genome Comparisons

Genome comparison is one of the excellent ways to interpret the evolutionary relationships between organisms. The basic principle of genome comparison is that if two species share a common feature, it is likely encoded by the DNA sequence conserved between both species. The advent of genome sequencing technologies in the late 20th century enabled scientists to understand the concept of conservation of domains between species and helped them to deduce evolutionary relationships across diverse...

DNA Microarrays

DNA Microarrays

Microarrays are high-throughput and relatively inexpensive assays that can be automated to analyze large quantities of data at a time. They are used in genome-wide studies to compare gene or protein expression under two varied conditions, such as healthy and diseased states. Microarrays consist of glass or silica slides on which probe molecules are covalently attached through surface functionalization. Most commonly, the slides are prepared through the chemisorption of silanes to silica...

Comparing Copy Number Variations and SNPs

Comparing Copy Number Variations and SNPs

Sequencing of the human genome has opened up several best-kept secrets of the genome. Scientists have identified thousands of genome variations that exist within a population. These variations can be a single nucleotide or a larger chromosomal variation.
Copy number variations or CNVs are the structural variations that cover more than 1kb of DNA sequence. The single nucleotide polymorphism (SNP), on the other hand, is a single nucleotide change or a point mutation that is found in more than 1%...

Cluster Sampling Method

Cluster Sampling Method

Appropriate sampling methods ensure that samples are drawn without bias and accurately represent the population. Because measuring the entire population in a study is not practical, researchers use samples to represent the population of interest.
To choose a cluster sample, divide the population into clusters (groups) and then randomly select some of the clusters. All the members from these clusters are in the cluster sample. For example, if you randomly sample four departments from your...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Impact of microbial consortia and fertilization regimes on the soil microbiome in maize field trials.

Scientific reports·2026

Same author

Metagenomics-Toolkit: the flexible and efficient cloud-based metagenomics workflow featuring machine learning-enabled resource allocation.

NAR genomics and bioinformatics·2025

Same author

Breakdown of hardly degradable carbohydrates (lignocellulose) in a two-stage anaerobic digestion plant is favored in the main fermenter.

Water research·2023

Same author

Uncovering Microbiome Adaptations in a Full-Scale Biogas Plant: Insights from MAG-Centric Metagenomics and Metaproteomics.

Microorganisms·2023

Same author

Advances in the clinical use of metaproteomics.

Expert review of proteomics·2023

Same author

Abundance, classification and genetic potential of Thaumarchaeota in metagenomes of European agricultural soils: a meta-analysis.

Environmental microbiome·2023

Same journal

3DICE: Interpretable 3D Cross-Modal Learning for Drug-Target Interaction Prediction and Large-Scale Drug Discovery.

Bioinformatics (Oxford, England)·2026

Same journal

KASSPer: Kinase Active Site Structure Prediction using Protein and Ligand Language Models and Its Application to Virtual Screening.

Bioinformatics (Oxford, England)·2026

Same journal

IDR searcher: a search engine solution for public image resources.

Bioinformatics (Oxford, England)·2026

Same journal

KCFtools: Rapid alignment-free method for introgression screening and GWAS using k-mer profiles.

Bioinformatics (Oxford, England)·2026

Same journal

Meta2DB: Curated shotgun metagenomic feature sets and metadata for health state prediction.

Bioinformatics (Oxford, England)·2026

Same journal

conMItion: an R package adjusting confounding factors for associations in multi-omics.

Bioinformatics (Oxford, England)·2026

See all related articles

Search research articles

Related Experiment Video

Updated: Feb 16, 2026

Heuristic Mining of Hierarchical Genotypes and Accessory Genome Loci in Bacterial Populations

Heuristic Mining of Hierarchical Genotypes and Accessory Genome Loci in Bacterial Populations

Published on: December 7, 2021

Analyzing large scale genomic data on the cloud with Sparkhit.

Liren Huang^1,2,3, Jan Krüger^1,2, Alexander Sczyrba^1,2,3

¹Faculty of Technology, Bielefeld University, Bielefeld 33615, Germany.

Bioinformatics (Oxford, England)

|December 19, 2017

Summary

This summary is machine-generated.

Sparkhit is a new distributed bioinformatics framework that significantly speeds up large-scale genomic data analysis. It offers substantial performance improvements for processing massive sequencing datasets efficiently.

More Related Videos

A User-friendly and Powerful R Analysis of Large-scale Datasets

A User-friendly and Powerful R Analysis of Large-scale Datasets

Published on: November 4, 2025

Sample Preparation to Bioinformatics Analysis of DNA Methylation: Association Strategy for Obesity and Related Trait Studies

Sample Preparation to Bioinformatics Analysis of DNA Methylation: Association Strategy for Obesity and Related Trait Studies

Published on: May 6, 2022

Related Experiment Videos

Last Updated: Feb 16, 2026

Heuristic Mining of Hierarchical Genotypes and Accessory Genome Loci in Bacterial Populations

Heuristic Mining of Hierarchical Genotypes and Accessory Genome Loci in Bacterial Populations

Published on: December 7, 2021

A User-friendly and Powerful R Analysis of Large-scale Datasets

A User-friendly and Powerful R Analysis of Large-scale Datasets

Published on: November 4, 2025

Sample Preparation to Bioinformatics Analysis of DNA Methylation: Association Strategy for Obesity and Related Trait Studies

Sample Preparation to Bioinformatics Analysis of DNA Methylation: Association Strategy for Obesity and Related Trait Studies

Published on: May 6, 2022

Area of Science:

Genomics
Bioinformatics
Computational Biology

Background:

Large-scale genomic analytics face scalability challenges with increasing next-generation sequencing data.
Existing distributed computational platforms for bioinformatics workloads exhibit inefficiencies and heavy run-time overheads, especially during data pre-processing.

Purpose of the Study:

To develop a novel distributed bioinformatics framework to address the limitations of existing tools for large-scale genomic data analysis.
To improve the efficiency and speed of bioinformatics workloads on massive datasets.

Main Methods:

Developed Sparkhit, a distributed bioinformatics framework utilizing the Apache Spark platform.
Implemented Sparkhit using the Spark extended MapReduce model, integrating various analytical methods.

Main Results:

Sparkhit demonstrated significant speedups, running 92-157 times faster than MetaSpark for metagenomic fragment recruitment and 18-32 times faster than Crossbow for data pre-processing.
Successfully analyzed 100 terabytes of genomic data across four projects in the cloud within 21 hours, including system setup and data transfer.
Processed the entire Human Microbiome Project dataset in just 2 hours, showcasing efficient association of large public datasets with reference data.

Conclusions:

Sparkhit provides an efficient and scalable solution for large-scale genomic data analysis.
The framework offers substantial performance gains, enabling faster processing of massive sequencing datasets.
Sparkhit facilitates the integration and analysis of large public genomic datasets.