Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

Genomics02:02

Genomics

40.9K
Genomics is the science of genomes: it is the study of all the genetic material of an organism. In humans, the genome consists of information carried in 23 pairs of chromosomes in the nucleus, as well as mitochondrial DNA. In genomics, both coding and non-coding DNA is sequenced and analyzed. Genomics allows a better understanding of all living things, their evolution, and their diversity. It has a myriad of uses: for example, to build phylogenetic trees, to improve productivity and...
40.9K
Evolutionary Relationships through Genome Comparisons02:54

Evolutionary Relationships through Genome Comparisons

7.1K
Genome comparison is one of the excellent ways to interpret the evolutionary relationships between organisms. The basic principle of genome comparison is that if two species share a common feature, it is likely encoded by the DNA sequence conserved between both species. The advent of genome sequencing technologies in the late 20th century enabled scientists to understand the concept of conservation of domains between species and helped them to deduce evolutionary relationships across diverse...
7.1K
DNA Microarrays02:34

DNA Microarrays

21.3K
Microarrays are high-throughput and relatively inexpensive assays that can be automated to analyze large quantities of data at a time. They are used in genome-wide studies to compare gene or protein expression under two varied conditions, such as healthy and diseased states. Microarrays consist of glass or silica slides on which probe molecules are covalently attached through surface functionalization. Most commonly, the slides are prepared through the chemisorption of silanes to silica...
21.3K
Comparing Copy Number Variations and SNPs02:26

Comparing Copy Number Variations and SNPs

18.8K
Sequencing of the human genome has opened up several best-kept secrets of the genome. Scientists have identified thousands of genome variations that exist within a population. These variations can be a single nucleotide or a larger chromosomal variation.
Copy number variations or CNVs are the structural variations that cover more than 1kb of DNA sequence. The single nucleotide polymorphism (SNP), on the other hand, is a single nucleotide change or a point mutation that is found in more than 1%...
18.8K
Cluster Sampling Method01:20

Cluster Sampling Method

14.9K
Appropriate sampling methods ensure that samples are drawn without bias and accurately represent the population. Because measuring the entire population in a study is not practical, researchers use samples to represent the population of interest.
To choose a cluster sample, divide the population into clusters (groups) and then randomly select some of the clusters. All the members from these clusters are in the cluster sample. For example, if you randomly sample four departments from your...
14.9K

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

Impact of microbial consortia and fertilization regimes on the soil microbiome in maize field trials.

Scientific reports·2026
Same author

Metagenomics-Toolkit: the flexible and efficient cloud-based metagenomics workflow featuring machine learning-enabled resource allocation.

NAR genomics and bioinformatics·2025
Same author

Breakdown of hardly degradable carbohydrates (lignocellulose) in a two-stage anaerobic digestion plant is favored in the main fermenter.

Water research·2023
Same author

Uncovering Microbiome Adaptations in a Full-Scale Biogas Plant: Insights from MAG-Centric Metagenomics and Metaproteomics.

Microorganisms·2023
Same author

Advances in the clinical use of metaproteomics.

Expert review of proteomics·2023
Same author

Abundance, classification and genetic potential of Thaumarchaeota in metagenomes of European agricultural soils: a meta-analysis.

Environmental microbiome·2023
Same journal

3DICE: Interpretable 3D Cross-Modal Learning for Drug-Target Interaction Prediction and Large-Scale Drug Discovery.

Bioinformatics (Oxford, England)·2026
Same journal

KASSPer: Kinase Active Site Structure Prediction using Protein and Ligand Language Models and Its Application to Virtual Screening.

Bioinformatics (Oxford, England)·2026
Same journal

IDR searcher: a search engine solution for public image resources.

Bioinformatics (Oxford, England)·2026
Same journal

KCFtools: Rapid alignment-free method for introgression screening and GWAS using k-mer profiles.

Bioinformatics (Oxford, England)·2026
Same journal

Meta2DB: Curated shotgun metagenomic feature sets and metadata for health state prediction.

Bioinformatics (Oxford, England)·2026
Same journal

conMItion: an R package adjusting confounding factors for associations in multi-omics.

Bioinformatics (Oxford, England)·2026
See all related articles

Related Experiment Video

Updated: Feb 16, 2026

Heuristic Mining of Hierarchical Genotypes and Accessory Genome Loci in Bacterial Populations
08:03

Heuristic Mining of Hierarchical Genotypes and Accessory Genome Loci in Bacterial Populations

Published on: December 7, 2021

2.8K

Analyzing large scale genomic data on the cloud with Sparkhit.

Liren Huang1,2,3, Jan Krüger1,2, Alexander Sczyrba1,2,3

  • 1Faculty of Technology, Bielefeld University, Bielefeld 33615, Germany.

Bioinformatics (Oxford, England)
|December 19, 2017
PubMed
Summary
This summary is machine-generated.

Sparkhit is a new distributed bioinformatics framework that significantly speeds up large-scale genomic data analysis. It offers substantial performance improvements for processing massive sequencing datasets efficiently.

More Related Videos

A User-friendly and Powerful R Analysis of Large-scale Datasets
10:56

A User-friendly and Powerful R Analysis of Large-scale Datasets

Published on: November 4, 2025

408
Sample Preparation to Bioinformatics Analysis of DNA Methylation: Association Strategy for Obesity and Related Trait Studies
14:56

Sample Preparation to Bioinformatics Analysis of DNA Methylation: Association Strategy for Obesity and Related Trait Studies

Published on: May 6, 2022

5.2K

Related Experiment Videos

Last Updated: Feb 16, 2026

Heuristic Mining of Hierarchical Genotypes and Accessory Genome Loci in Bacterial Populations
08:03

Heuristic Mining of Hierarchical Genotypes and Accessory Genome Loci in Bacterial Populations

Published on: December 7, 2021

2.8K
A User-friendly and Powerful R Analysis of Large-scale Datasets
10:56

A User-friendly and Powerful R Analysis of Large-scale Datasets

Published on: November 4, 2025

408
Sample Preparation to Bioinformatics Analysis of DNA Methylation: Association Strategy for Obesity and Related Trait Studies
14:56

Sample Preparation to Bioinformatics Analysis of DNA Methylation: Association Strategy for Obesity and Related Trait Studies

Published on: May 6, 2022

5.2K

Area of Science:

  • Genomics
  • Bioinformatics
  • Computational Biology

Background:

  • Large-scale genomic analytics face scalability challenges with increasing next-generation sequencing data.
  • Existing distributed computational platforms for bioinformatics workloads exhibit inefficiencies and heavy run-time overheads, especially during data pre-processing.

Purpose of the Study:

  • To develop a novel distributed bioinformatics framework to address the limitations of existing tools for large-scale genomic data analysis.
  • To improve the efficiency and speed of bioinformatics workloads on massive datasets.

Main Methods:

  • Developed Sparkhit, a distributed bioinformatics framework utilizing the Apache Spark platform.
  • Implemented Sparkhit using the Spark extended MapReduce model, integrating various analytical methods.

Main Results:

  • Sparkhit demonstrated significant speedups, running 92-157 times faster than MetaSpark for metagenomic fragment recruitment and 18-32 times faster than Crossbow for data pre-processing.
  • Successfully analyzed 100 terabytes of genomic data across four projects in the cloud within 21 hours, including system setup and data transfer.
  • Processed the entire Human Microbiome Project dataset in just 2 hours, showcasing efficient association of large public datasets with reference data.

Conclusions:

  • Sparkhit provides an efficient and scalable solution for large-scale genomic data analysis.
  • The framework offers substantial performance gains, enabling faster processing of massive sequencing datasets.
  • Sparkhit facilitates the integration and analysis of large public genomic datasets.