Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Probability Histograms

Probability Histograms

A probability histogram is a visual representation of a probability distribution. Similar a typical histogram, the probability histogram consists of contiguous (adjoining) boxes. It has both a horizontal axis and a vertical axis. The horizontal axis is labeled with what the data represents. The vertical axis is labeled with probability. Each rectangular bar in the histogram is 1 unit wide, which suggests that the area under each bar equals the probability, P(x), where x is 1, 2, 3, and so on.

Probability Distributions

Probability Distributions

The probability of a random variable x is the likelihood of its occurrence. A probability distribution represents the probabilities of a random variable using a formula, graph, or table. There are two types of probability distribution– discrete probability distribution and continuous probability distribution.
A discrete probability distribution is a probability distribution of discrete random variables. It can be categorized into binomial probability distribution and Poisson...

Poisson Probability Distribution

Poisson Probability Distribution

A Poisson probability distribution is a discrete probability distribution. It gives the probability of a number of events occurring in a fixed interval of time or space if these events happen at a known average rate and independently of the time since the last event. For example, a book editor might be interested in the number of words spelled incorrectly in a particular book. It might be that, on average, there are five words spelled incorrectly in 100 pages. The interval is 100 pages.
The...

Law of Independent Assortment

Law of Independent Assortment

While Mendel’s Law of Segregation states that the two alleles for one gene are separated into different gametes, a different question of how different genes are inherited remains. For example, is the gene for tall plants inherited with the gene for green peas? Mendel asked this question by experimenting with a dihybrid cross; a cross in which both parents are homozygous for two distinct traits resulting in an F1 generation that are heterozygous for both traits.

Probability Laws

Probability Laws

Cluster Sampling Method

Cluster Sampling Method

Appropriate sampling methods ensure that samples are drawn without bias and accurately represent the population. Because measuring the entire population in a study is not practical, researchers use samples to represent the population of interest.
To choose a cluster sample, divide the population into clusters (groups) and then randomly select some of the clusters. All the members from these clusters are in the cluster sample. For example, if you randomly sample four departments from your...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Kaminari: a frugal colored index for approximate <i>k</i>-mer queries.

Bioinformatics advances·2026

Same author

Kaminari: a resource-frugal index for approximate colored <i>k</i>-mer queries.

bioRxiv : the preprint server for biology·2025

Same author

Efficient and robust search of microbial genomes via phylogenetic compression.

Nature methods·2025

Same author

Efficient and Robust Search of Microbial Genomes via Phylogenetic Compression.

bioRxiv : the preprint server for biology·2023

Same author

Space-efficient representation of genomic k-mer count tables.

Algorithms for molecular biology : AMB·2022

Same author

Fast and compact matching statistics analytics.

Bioinformatics (Oxford, England)·2022

Same journal

GMSA: A Graph Matching and Point Cloud Registration-Based Method for Spatial Transcriptomics Data Alignment.

Journal of computational biology : a journal of computational molecular cell biology·2026

Same journal

Investigations on Multiple Protein Scaffold Filling.

Journal of computational biology : a journal of computational molecular cell biology·2026

Same journal

Cell Type Prediction for Single-Cell RNA Sequencing Utilizing Unsupervised Domain Adaptation and Semi-Supervised Learning.

Journal of computational biology : a journal of computational molecular cell biology·2026

Same journal

PPIGAN: Prediction of Protein-Protein Interactions Using Generative Adversarial Networks.

Journal of computational biology : a journal of computational molecular cell biology·2026

Same journal

Deep Structure-Enhanced Cell Clustering Model for Single-Cell RNA Sequencing Data.

Journal of computational biology : a journal of computational molecular cell biology·2026

Same journal

Asymmetric Drug-Drug Interaction Prediction Based on Generative Adversarial Networks and Knowledge Graph.

Journal of computational biology : a journal of computational molecular cell biology·2026

See all related articles

Search research articles

Related Experiment Video

Updated: Oct 6, 2025

A Novel Bayesian Change-point Algorithm for Genome-wide Analysis of Diverse ChIPseq Data Types

A Novel Bayesian Change-point Algorithm for Genome-wide Analysis of Diverse ChIPseq Data Types

Published on: December 10, 2012

Set-Min Sketch: A Probabilistic Map for Power-Law Distributions with Application to k-Mer Annotation.

Yoshihiro Shibuya¹, Djamal Belazzougui², Gregory Kucherov^3,4

¹LIGM, Modèles et Algorithmes Group, Université Gustave Eiffel, Marne-la-Vallée, France.

Journal of Computational Biology : a Journal of Computational Molecular Cell Biology

|January 20, 2022

Summary

This summary is machine-generated.

Set-Min sketch offers a memory-efficient way to store k-mer counts without explicitly listing k-mers. This bioinformatics tool provides high accuracy with minimal memory overhead, outperforming existing methods.

Keywords:

k-mer counting k-mer spectrum max-min sketch power-law distribution set-min sketch sketching

More Related Videos

Informatic Analysis of Sequence Data from Batch Yeast 2-Hybrid Screens

Informatic Analysis of Sequence Data from Batch Yeast 2-Hybrid Screens

Published on: June 28, 2018

Detection of Rare Genomic Variants from Pooled Sequencing Using SPLINTER

Detection of Rare Genomic Variants from Pooled Sequencing Using SPLINTER

Published on: June 23, 2012

Related Experiment Videos

Last Updated: Oct 6, 2025

A Novel Bayesian Change-point Algorithm for Genome-wide Analysis of Diverse ChIPseq Data Types

A Novel Bayesian Change-point Algorithm for Genome-wide Analysis of Diverse ChIPseq Data Types

Published on: December 10, 2012

Informatic Analysis of Sequence Data from Batch Yeast 2-Hybrid Screens

Informatic Analysis of Sequence Data from Batch Yeast 2-Hybrid Screens

Published on: June 28, 2018

Detection of Rare Genomic Variants from Pooled Sequencing Using SPLINTER

Detection of Rare Genomic Variants from Pooled Sequencing Using SPLINTER

Published on: June 23, 2012

Area of Science:

Bioinformatics
Data Structures
Computational Biology

Background:

K-mer counts are crucial features in bioinformatics pipelines.
Current methods prioritize time or memory, resulting in large k-mer count tables.
Storing explicit k-mers is unnecessary when the set is known, enabling focus on counters.

Purpose of the Study:

Introduce Set-Min sketch, a novel technique for representing associative maps.
Apply Set-Min sketch to the problem of representing k-mer count tables.
Compare Set-Min sketch's accuracy and memory efficiency against Count-Min and Max-Min sketches.

Main Methods:

Developed Set-Min sketch, inspired by Count-Min sketch.
Defined Max-Min sketch as an improved variant of Count-Min for static datasets.
Evaluated Set-Min sketch's performance on k-mer count tables, particularly for genomic datasets.

Main Results:

Set-Min sketch demonstrates provably higher accuracy than Count-Min and Max-Min sketches.
The technique achieves a very low error rate (probability and size) with only a moderate memory increase.
Set-Min sketches require up to an order of magnitude less space than Minimal Perfect Hash Function (MPHF)-based solutions for large k and assembled genomes.

Conclusions:

Set-Min sketch is a highly accurate and memory-efficient method for representing k-mer count tables.
Its space efficiency is particularly advantageous for large genomic datasets due to the power-law distribution of k-mer counts.
Set-Min sketch offers a superior alternative to existing methods like MPHFs and Count-Min sketches for specific bioinformatics applications.