Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Search research articles

Related Experiment Videos

Estimating dataset size requirements for classifying DNA microarray data.

Sayan Mukherjee¹, Pablo Tamayo, Simon Rogers

¹Whitehead Institute/Massachusetts Institute of Technology Center for Genome Research, Cambridge, MA 02139, USA. sayan@genome.wi.mit.edu

Journal of Computational Biology : a Journal of Computational Molecular Cell Biology

|June 14, 2003

Summary

This summary is machine-generated.

Related Concept Videos

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

The Offender Personality Disorder Pathway for Men: Staff Perceptions About Possible Impact on Re-offending in High-Risk Individuals with Personality Disorder.

International journal of offender therapy and comparative criminology·2026

Same author

igv-reports: embedding interactive genomic visualizations in HTML reports to aid variant review.

Bioinformatics (Oxford, England)·2026

Same author

<i>CanDrivR-CS</i>: a cancer-specific machine learning framework for distinguishing recurrent and rare variants.

Bioinformatics advances·2026

Same author

Living Cells Employ Ubiquitin-Proteasomal System and Nucleotide Excision Repair Pathways to Remove Reactive Oxygen Species-Induced DNA-Protein Crosslinks (ROS-DPCs).

bioRxiv : the preprint server for biology·2026

Same author

Phase IB/II Trial with Correlative Analyses of Doxorubicin plus Durvalumab Combination in Patients with Advanced Soft Tissue Sarcoma.

Clinical cancer research : an official journal of the American Association for Cancer Research·2026

Same author

Activated ATF6α is a hepatic tumour driver restricting immunosurveillance.

Nature·2026

Same journal

GMSA: A Graph Matching and Point Cloud Registration-Based Method for Spatial Transcriptomics Data Alignment.

Journal of computational biology : a journal of computational molecular cell biology·2026

Same journal

Investigations on Multiple Protein Scaffold Filling.

Journal of computational biology : a journal of computational molecular cell biology·2026

Same journal

Cell Type Prediction for Single-Cell RNA Sequencing Utilizing Unsupervised Domain Adaptation and Semi-Supervised Learning.

Journal of computational biology : a journal of computational molecular cell biology·2026

Same journal

PPIGAN: Prediction of Protein-Protein Interactions Using Generative Adversarial Networks.

Journal of computational biology : a journal of computational molecular cell biology·2026

Same journal

Deep Structure-Enhanced Cell Clustering Model for Single-Cell RNA Sequencing Data.

Journal of computational biology : a journal of computational molecular cell biology·2026

Same journal

Asymmetric Drug-Drug Interaction Prediction Based on Generative Adversarial Networks and Knowledge Graph.

Journal of computational biology : a journal of computational molecular cell biology·2026

See all related articles

This study introduces a statistical method using learning curves to determine the necessary dataset size for accurate microarray data classification. It helps estimate future data needs and assess the impact of additional data on classifier performance and significance.

Area of Science:

Bioinformatics
Statistical Learning
Computational Biology

Background:

Accurate classification of microarray data is crucial for understanding complex biological systems.
Determining optimal dataset size is essential for robust and reliable machine learning models in genomics.
Existing methods may not adequately address the statistical significance of classification performance with varying data sizes.

Purpose of the Study:

To develop a statistical methodology for estimating dataset size requirements in microarray data classification.
To evaluate the impact of increasing dataset size on classifier accuracy and statistical significance.
To provide a framework for planning future data collection and analysis in genomic studies.

Main Methods:

Utilizing learning curves based on fitting inverse power-law models to existing classification results.

Related Experiment Videos

Implementing a permutation test procedure to assess the statistical significance of classifier performance at different dataset sizes.

Applying the methodology to diverse molecular classification problems with varying complexity.

Main Results:

The developed methodology effectively estimates dataset size requirements for microarray classification.
Empirical learning curves accurately predict the gain in accuracy and significance with additional data.
The permutation test provides a robust measure of statistical significance for classifiers across different dataset sizes.

Conclusions:

The proposed statistical approach offers a reliable method for determining optimal dataset sizes in microarray studies.
This methodology aids in efficient resource allocation for data collection and enhances the interpretability of classification results.
The findings have broad applicability in bioinformatics and computational biology for designing effective genomic experiments.