Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

Random Sampling Method01:09

Random Sampling Method

14.1K
Sampling is a technique to select a portion (or subset) of the larger population and study that portion (the sample) to gain information about the population. Data are the result of sampling from a population. The sampling method ensures that samples are drawn without bias and accurately represent the population. Because measuring the entire population in a study is not practical, researchers use samples to represent the population of interest. Among the various sampling methods used by...
14.1K
Sampling Plans01:23

Sampling Plans

868
Sampling is a crucial step in analytical chemistry, allowing researchers to collect representative data from a large population. Common sampling methods include random, judgmental, systematic, stratified, and cluster sampling.
Random sampling is a method where each member of the population has an equal chance of being selected for the sample. It involves selecting individuals randomly, often using random number generators or lottery-type methods. For example, when analyzing the properties of a...
868
Cluster Sampling Method01:20

Cluster Sampling Method

13.9K
Appropriate sampling methods ensure that samples are drawn without bias and accurately represent the population. Because measuring the entire population in a study is not practical, researchers use samples to represent the population of interest.
To choose a cluster sample, divide the population into clusters (groups) and then randomly select some of the clusters. All the members from these clusters are in the cluster sample. For example, if you randomly sample four departments from your...
13.9K
Sampling Distribution01:12

Sampling Distribution

16.5K
Given simple random samples of size n from a given population with a measured characteristic such as mean, proportion, or standard deviation for each sample, the probability distribution of all the measured characteristics is called a sampling distribution. How much the statistic varies from one sample to another is known as the sampling variability of a statistic. You typically measure the sampling variability of a statistic by its standard error. The standard error of the mean is an example...
16.5K
Stratified Sampling Method01:16

Stratified Sampling Method

14.4K
Sampling is a technique to select a portion (or subset) of the larger population and study that portion (the sample) to gain information about the population. The sampling method ensures that samples are drawn without bias and accurately represent the population. Because measuring the entire population in a study is not practical, researchers use samples to represent the population of interest.
To choose a stratified sample, divide the population into groups called strata and then take a...
14.4K
Random Variables01:09

Random Variables

17.2K
A random variable is a single numerical value that indicates the outcome of a procedure. The concept of random variables is fundamental to the probability theory and was introduced by a Russian mathematician, Pafnuty Chebyshev, in the mid-nineteenth century.
Uppercase letters such as X or Y denote a random variable. Lowercase letters like x or y denote the value of a random variable. If X is a random variable, then X is written in words, and x is given as a number.
For example, let X = the...
17.2K

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

Estimation of substitution and indel rates via <i>k</i>-mer statistics.

Algorithms in bioinformatics : ... International Workshop, WABI ..., proceedings. WABI (Workshop)·2026
Same author

Leveraging FracMinHash Containment for Genomic <math><msub><mrow><mi>d</mi></mrow> <mrow><mi>N</mi></mrow></msub> <mo>/</mo> <msub><mrow><mi>d</mi></mrow> <mrow><mi>S</mi></mrow></msub></math>.

bioRxiv : the preprint server for biology·2025
Same author

Announcing the Biomedical Data Translator: Initial Public Release.

Clinical and translational science·2025
Same author

Estimation of substitution and indel rates via <i>k</i>-mer statistics.

bioRxiv : the preprint server for biology·2025
Same author

Estimating similarity and distance using FracMinHash.

Algorithms for molecular biology : AMB·2025
Same author

CAMI Benchmarking Portal: online evaluation and ranking of metagenomic software.

Nucleic acids research·2025
Same journal

Genetic Impacts on Variability of Body Fat Distribution Uncover Gene-Environment and Gene-Gene Interactions.

bioRxiv : the preprint server for biology·2026
Same journal

16S ribosomal RNA modification drives transcript-specific translation efficiency.

bioRxiv : the preprint server for biology·2026
Same journal

FlcE latches onto the FliL-stator complex to turbocharge flagellar motility in <i>Borrelia burgdorferi</i>.

bioRxiv : the preprint server for biology·2026
Same journal

Synaptic pruning, myelination and the emergence of psychiatric disorders in late adolescence.

bioRxiv : the preprint server for biology·2026
Same journal

Structural and functional insights into the Rcs phosphorelay.

bioRxiv : the preprint server for biology·2026
Same journal

The structural basis of RanGAP1 regulation and catalysis in nuclear transport.

bioRxiv : the preprint server for biology·2026
See all related articles

Related Experiment Video

Updated: Jan 10, 2026

Heuristic Mining of Hierarchical Genotypes and Accessory Genome Loci in Bacterial Populations
08:03

Heuristic Mining of Hierarchical Genotypes and Accessory Genome Loci in Bacterial Populations

Published on: December 7, 2021

2.7K

MaxGeomHash: An Algorithm for Variable-Size Random Sampling of Distinct Elements.

Mahmudur Rahman Hera1,2, David Koslicki3, Conrado Martinez4

  • 1Center for Advanced Biotechnology & Medicine, Rutgers University, NJ, USA.

Biorxiv : the Preprint Server for Biology
|November 26, 2025
PubMed
Summary
This summary is machine-generated.

A new sketching algorithm, MaxGeomHash, offers a balance between computational efficiency and accuracy for analyzing large sequencing datasets. It provides sub-linear sketches, outperforming existing methods like MinHash and FracMinHash in constructing phylogenetic trees.

Keywords:
FracMinHashMinHashRandom samplingdimensionality reductionk-merssimilarity estimationsketching

More Related Videos

Computation of Atmospheric Concentrations of Molecular Clusters from ab initio Thermochemistry
12:11

Computation of Atmospheric Concentrations of Molecular Clusters from ab initio Thermochemistry

Published on: April 8, 2020

8.6K
A Novel Bayesian Change-point Algorithm for Genome-wide Analysis of Diverse ChIPseq Data Types
12:39

A Novel Bayesian Change-point Algorithm for Genome-wide Analysis of Diverse ChIPseq Data Types

Published on: December 10, 2012

11.7K

Related Experiment Videos

Last Updated: Jan 10, 2026

Heuristic Mining of Hierarchical Genotypes and Accessory Genome Loci in Bacterial Populations
08:03

Heuristic Mining of Hierarchical Genotypes and Accessory Genome Loci in Bacterial Populations

Published on: December 7, 2021

2.7K
Computation of Atmospheric Concentrations of Molecular Clusters from ab initio Thermochemistry
12:11

Computation of Atmospheric Concentrations of Molecular Clusters from ab initio Thermochemistry

Published on: April 8, 2020

8.6K
A Novel Bayesian Change-point Algorithm for Genome-wide Analysis of Diverse ChIPseq Data Types
12:39

A Novel Bayesian Change-point Algorithm for Genome-wide Analysis of Diverse ChIPseq Data Types

Published on: December 10, 2012

11.7K

Area of Science:

  • Bioinformatics and Computational Biology
  • Genomics and Sequence Analysis
  • Algorithm Design and Analysis

Background:

  • The exponential growth of sequencing data necessitates scalable computational methods.
  • K-mer sketching is a key technique for summarizing large sequence datasets.
  • Existing methods like MinHash and FracMinHash have limitations in sketch size and scalability.

Purpose of the Study:

  • To introduce MaxGeomHash, a novel sketching algorithm for efficient large-scale sequence analysis.
  • To develop a permutation-invariant and parallelizable sketching algorithm producing sub-linear sketches.
  • To provide a method that balances sketch size, storage, processing efficiency, and accuracy.

Main Methods:

  • Developed the MaxGeomHash algorithm, producing sketches of size O(b log(n/b)) for parameter b.
  • Introduced a variant, α-MaxGeomHash, generating sketches of size Θ(n^α).
  • Studied algorithm properties, analyzed sample sizes, and empirically verified theoretical results.

Main Results:

  • MaxGeomHash generates sub-linear, permutation-invariant, and parallelizable sketches.
  • Empirical validation confirmed theoretical sample size predictions and similarity estimation quality.
  • MaxGeomHash sketches enabled more accurate similarity tree construction than MinHash and more efficient than FracMinHash on genomic datasets.

Conclusions:

  • MaxGeomHash offers an effective intermediate-sized sketching approach, balancing efficiency and accuracy.
  • The algorithm provides a valuable new tool for large-scale genomic data analysis and comparison.
  • The implementation is publicly available, facilitating further research and application.