Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Experiment Videos

Estimating dataset size requirements for classifying DNA microarray data.

Sayan Mukherjee1, Pablo Tamayo, Simon Rogers

  • 1Whitehead Institute/Massachusetts Institute of Technology Center for Genome Research, Cambridge, MA 02139, USA. sayan@genome.wi.mit.edu

Journal of Computational Biology : a Journal of Computational Molecular Cell Biology
|June 14, 2003
PubMed
Summary
This summary is machine-generated.

Related Concept Videos

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

The Offender Personality Disorder Pathway for Men: Staff Perceptions About Possible Impact on Re-offending in High-Risk Individuals with Personality Disorder.

International journal of offender therapy and comparative criminology·2026
Same author

igv-reports: embedding interactive genomic visualizations in HTML reports to aid variant review.

Bioinformatics (Oxford, England)·2026
Same author

<i>CanDrivR-CS</i>: a cancer-specific machine learning framework for distinguishing recurrent and rare variants.

Bioinformatics advances·2026
Same author

Living Cells Employ Ubiquitin-Proteasomal System and Nucleotide Excision Repair Pathways to Remove Reactive Oxygen Species-Induced DNA-Protein Crosslinks (ROS-DPCs).

bioRxiv : the preprint server for biology·2026
Same author

Phase IB/II Trial with Correlative Analyses of Doxorubicin plus Durvalumab Combination in Patients with Advanced Soft Tissue Sarcoma.

Clinical cancer research : an official journal of the American Association for Cancer Research·2026
Same author

Activated ATF6α is a hepatic tumour driver restricting immunosurveillance.

Nature·2026
Same journal

GMSA: A Graph Matching and Point Cloud Registration-Based Method for Spatial Transcriptomics Data Alignment.

Journal of computational biology : a journal of computational molecular cell biology·2026
Same journal

Investigations on Multiple Protein Scaffold Filling.

Journal of computational biology : a journal of computational molecular cell biology·2026
Same journal

Cell Type Prediction for Single-Cell RNA Sequencing Utilizing Unsupervised Domain Adaptation and Semi-Supervised Learning.

Journal of computational biology : a journal of computational molecular cell biology·2026
Same journal

PPIGAN: Prediction of Protein-Protein Interactions Using Generative Adversarial Networks.

Journal of computational biology : a journal of computational molecular cell biology·2026
Same journal

Deep Structure-Enhanced Cell Clustering Model for Single-Cell RNA Sequencing Data.

Journal of computational biology : a journal of computational molecular cell biology·2026
Same journal

Asymmetric Drug-Drug Interaction Prediction Based on Generative Adversarial Networks and Knowledge Graph.

Journal of computational biology : a journal of computational molecular cell biology·2026
See all related articles

This study introduces a statistical method using learning curves to determine the necessary dataset size for accurate microarray data classification. It helps estimate future data needs and assess the impact of additional data on classifier performance and significance.

Area of Science:

  • Bioinformatics
  • Statistical Learning
  • Computational Biology

Background:

  • Accurate classification of microarray data is crucial for understanding complex biological systems.
  • Determining optimal dataset size is essential for robust and reliable machine learning models in genomics.
  • Existing methods may not adequately address the statistical significance of classification performance with varying data sizes.

Purpose of the Study:

  • To develop a statistical methodology for estimating dataset size requirements in microarray data classification.
  • To evaluate the impact of increasing dataset size on classifier accuracy and statistical significance.
  • To provide a framework for planning future data collection and analysis in genomic studies.

Main Methods:

  • Utilizing learning curves based on fitting inverse power-law models to existing classification results.

Related Experiment Videos

  • Implementing a permutation test procedure to assess the statistical significance of classifier performance at different dataset sizes.
  • Applying the methodology to diverse molecular classification problems with varying complexity.
  • Main Results:

    • The developed methodology effectively estimates dataset size requirements for microarray classification.
    • Empirical learning curves accurately predict the gain in accuracy and significance with additional data.
    • The permutation test provides a robust measure of statistical significance for classifiers across different dataset sizes.

    Conclusions:

    • The proposed statistical approach offers a reliable method for determining optimal dataset sizes in microarray studies.
    • This methodology aids in efficient resource allocation for data collection and enhances the interpretability of classification results.
    • The findings have broad applicability in bioinformatics and computational biology for designing effective genomic experiments.