Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Random Sampling Method

Random Sampling Method

Sampling is a technique to select a portion (or subset) of the larger population and study that portion (the sample) to gain information about the population. Data are the result of sampling from a population. The sampling method ensures that samples are drawn without bias and accurately represent the population. Because measuring the entire population in a study is not practical, researchers use samples to represent the population of interest. Among the various sampling methods used by...

Sampling Plans

Sampling Plans

Sampling is a crucial step in analytical chemistry, allowing researchers to collect representative data from a large population. Common sampling methods include random, judgmental, systematic, stratified, and cluster sampling.
Random sampling is a method where each member of the population has an equal chance of being selected for the sample. It involves selecting individuals randomly, often using random number generators or lottery-type methods. For example, when analyzing the properties of a...

Cluster Sampling Method

Cluster Sampling Method

Appropriate sampling methods ensure that samples are drawn without bias and accurately represent the population. Because measuring the entire population in a study is not practical, researchers use samples to represent the population of interest.
To choose a cluster sample, divide the population into clusters (groups) and then randomly select some of the clusters. All the members from these clusters are in the cluster sample. For example, if you randomly sample four departments from your...

Sampling Distribution

Sampling Distribution

Given simple random samples of size n from a given population with a measured characteristic such as mean, proportion, or standard deviation for each sample, the probability distribution of all the measured characteristics is called a sampling distribution. How much the statistic varies from one sample to another is known as the sampling variability of a statistic. You typically measure the sampling variability of a statistic by its standard error. The standard error of the mean is an example...

Stratified Sampling Method

Stratified Sampling Method

Sampling is a technique to select a portion (or subset) of the larger population and study that portion (the sample) to gain information about the population. The sampling method ensures that samples are drawn without bias and accurately represent the population. Because measuring the entire population in a study is not practical, researchers use samples to represent the population of interest.
To choose a stratified sample, divide the population into groups called strata and then take a...

Random Variables

Random Variables

A random variable is a single numerical value that indicates the outcome of a procedure. The concept of random variables is fundamental to the probability theory and was introduced by a Russian mathematician, Pafnuty Chebyshev, in the mid-nineteenth century.
Uppercase letters such as X or Y denote a random variable. Lowercase letters like x or y denote the value of a random variable. If X is a random variable, then X is written in words, and x is given as a number.
For example, let X = the...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Estimation of substitution and indel rates via <i>k</i>-mer statistics.

Algorithms in bioinformatics : ... International Workshop, WABI ..., proceedings. WABI (Workshop)·2026

Same author

Leveraging FracMinHash Containment for Genomic <math><msub><mrow><mi>d</mi></mrow> <mrow><mi>N</mi></mrow></msub> <mo>/</mo> <msub><mrow><mi>d</mi></mrow> <mrow><mi>S</mi></mrow></msub></math>.

bioRxiv : the preprint server for biology·2025

Same author

Announcing the Biomedical Data Translator: Initial Public Release.

Clinical and translational science·2025

Same author

Estimation of substitution and indel rates via <i>k</i>-mer statistics.

bioRxiv : the preprint server for biology·2025

Same author

Estimating similarity and distance using FracMinHash.

Algorithms for molecular biology : AMB·2025

Same author

CAMI Benchmarking Portal: online evaluation and ranking of metagenomic software.

Nucleic acids research·2025

Same journal

Genetic Impacts on Variability of Body Fat Distribution Uncover Gene-Environment and Gene-Gene Interactions.

bioRxiv : the preprint server for biology·2026

Same journal

16S ribosomal RNA modification drives transcript-specific translation efficiency.

bioRxiv : the preprint server for biology·2026

Same journal

FlcE latches onto the FliL-stator complex to turbocharge flagellar motility in <i>Borrelia burgdorferi</i>.

bioRxiv : the preprint server for biology·2026

Same journal

Synaptic pruning, myelination and the emergence of psychiatric disorders in late adolescence.

bioRxiv : the preprint server for biology·2026

Same journal

Structural and functional insights into the Rcs phosphorelay.

bioRxiv : the preprint server for biology·2026

Same journal

The structural basis of RanGAP1 regulation and catalysis in nuclear transport.

bioRxiv : the preprint server for biology·2026

See all related articles

Search research articles

Related Experiment Video

Updated: Jan 10, 2026

Heuristic Mining of Hierarchical Genotypes and Accessory Genome Loci in Bacterial Populations

Heuristic Mining of Hierarchical Genotypes and Accessory Genome Loci in Bacterial Populations

Published on: December 7, 2021

MaxGeomHash: An Algorithm for Variable-Size Random Sampling of Distinct Elements.

Mahmudur Rahman Hera^1,2, David Koslicki³, Conrado Martinez⁴

¹Center for Advanced Biotechnology & Medicine, Rutgers University, NJ, USA.

Biorxiv : the Preprint Server for Biology

|November 26, 2025

Summary

This summary is machine-generated.

A new sketching algorithm, MaxGeomHash, offers a balance between computational efficiency and accuracy for analyzing large sequencing datasets. It provides sub-linear sketches, outperforming existing methods like MinHash and FracMinHash in constructing phylogenetic trees.

Keywords:

FracMinHash MinHash Random sampling dimensionality reduction k-mers similarity estimation sketching

More Related Videos

Computation of Atmospheric Concentrations of Molecular Clusters from ab initio Thermochemistry

Computation of Atmospheric Concentrations of Molecular Clusters from ab initio Thermochemistry

Published on: April 8, 2020

A Novel Bayesian Change-point Algorithm for Genome-wide Analysis of Diverse ChIPseq Data Types

A Novel Bayesian Change-point Algorithm for Genome-wide Analysis of Diverse ChIPseq Data Types

Published on: December 10, 2012

Related Experiment Videos

Last Updated: Jan 10, 2026

Heuristic Mining of Hierarchical Genotypes and Accessory Genome Loci in Bacterial Populations

Heuristic Mining of Hierarchical Genotypes and Accessory Genome Loci in Bacterial Populations

Published on: December 7, 2021

Computation of Atmospheric Concentrations of Molecular Clusters from ab initio Thermochemistry

Computation of Atmospheric Concentrations of Molecular Clusters from ab initio Thermochemistry

Published on: April 8, 2020

A Novel Bayesian Change-point Algorithm for Genome-wide Analysis of Diverse ChIPseq Data Types

A Novel Bayesian Change-point Algorithm for Genome-wide Analysis of Diverse ChIPseq Data Types

Published on: December 10, 2012

Area of Science:

Bioinformatics and Computational Biology
Genomics and Sequence Analysis
Algorithm Design and Analysis

Background:

The exponential growth of sequencing data necessitates scalable computational methods.
K-mer sketching is a key technique for summarizing large sequence datasets.
Existing methods like MinHash and FracMinHash have limitations in sketch size and scalability.

Purpose of the Study:

To introduce MaxGeomHash, a novel sketching algorithm for efficient large-scale sequence analysis.
To develop a permutation-invariant and parallelizable sketching algorithm producing sub-linear sketches.
To provide a method that balances sketch size, storage, processing efficiency, and accuracy.

Main Methods:

Developed the MaxGeomHash algorithm, producing sketches of size O(b log(n/b)) for parameter b.
Introduced a variant, α-MaxGeomHash, generating sketches of size Θ(n^α).
Studied algorithm properties, analyzed sample sizes, and empirically verified theoretical results.

Main Results:

MaxGeomHash generates sub-linear, permutation-invariant, and parallelizable sketches.
Empirical validation confirmed theoretical sample size predictions and similarity estimation quality.
MaxGeomHash sketches enabled more accurate similarity tree construction than MinHash and more efficient than FracMinHash on genomic datasets.

Conclusions:

MaxGeomHash offers an effective intermediate-sized sketching approach, balancing efficiency and accuracy.
The algorithm provides a valuable new tool for large-scale genomic data analysis and comparison.
The implementation is publicly available, facilitating further research and application.