Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Survival Tree

Survival Tree

Survival trees are a non-parametric method used in survival analysis to model the relationship between a set of covariates and the time until an event of interest occurs, often referred to as the "time-to-event" or "survival time." This method is particularly useful when dealing with censored data, where the event has not occurred for some individuals by the end of the study period, or when the exact time of the event is unknown.
Building a Survival Tree
Constructing a survival tree begins...

Statistical Methods for Analyzing Epidemiological Data

Statistical Methods for Analyzing Epidemiological Data

Epidemiological data primarily involves information on specific populations' occurrence, distribution, and determinants of health and diseases. This data is crucial for understanding disease patterns and impacts, aiding public health decision-making and disease prevention strategies. The analysis of epidemiological data employs various statistical methods to interpret health-related data effectively. Here are some commonly used methods:

Prediction Intervals

Prediction Intervals

The interval estimate of any variable is known as the prediction interval. It helps decide if a point estimate is dependable.
However, the point estimate is most likely not the exact value of the population parameter, but close to it. After calculating point estimates, we construct interval estimates, called confidence intervals or prediction intervals. This prediction interval comprises a range of values unlike the point estimate and is a better predictor of the observed sample value, y.
The...

Steps in Outbreak Investigation

Steps in Outbreak Investigation

In the ever-evolving field of public health, statistical analysis serves as a cornerstone for understanding and managing disease outbreaks. By leveraging various statistical tools, health professionals can predict potential outbreaks, analyze ongoing situations, and devise effective responses to mitigate impact. For that to happen, there are a few possible stages of the analysis:

Relative Risk

Relative Risk

Relative risk (RR) is a statistical measure commonly used in epidemiology to compare the likelihood of a particular event occurring between two groups. This metric is important for evaluating the relationship between exposure to a specific risk factor and the probability of a particular outcome. It plays a crucial role in medical research, public health studies, and risk assessment. Relative risk quantifies how much more (or less) likely an event is to occur in an exposed group compared to an...

Regression Toward the Mean

Regression Toward the Mean

Regression toward the mean (“RTM”) is a phenomenon in which extremely high or low values—for example, and individual’s blood pressure at a particular moment—appear closer to a group’s average upon remeasuring. Although this statistical peculiarity is the result of random error and chance, it has been problematic across various medical, scientific, financial and psychological applications. In particular, RTM, if not taken into account, can interfere when researchers try to extrapolate results...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Combining Token Classification With Large Language Model Revision for Age-Friendly 4M Entity Recognition From Nursing Home Text Messages: Development and Evaluation Study.

medRxiv : the preprint server for health sciences·2026

Same author

Characterizing nursing home care team communication via text messaging: A social network analysis.

International journal of medical informatics·2026

Same author

Using novel natural language processing approaches to examine age-friendly communication about nursing Nome residents with dementia.

The Gerontologist·2026

Same author

Heartbeat Detection from Ballistocardiogram Signals Using a Transformer Network.

Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference·2025

Same author

Applications of Large Language Models and Prompt Optimization for Knowledge Extraction from Biological Pathway Figures.

IEEE journal of biomedical and health informatics·2025

Same author

Caregiver inclusion influence on adolescent acceptance and engagement of an mHealth app, a randomized controlled trial.

mHealth·2025

Same journal

Interpretable SHAP-based machine learning framework for patient satisfaction prediction: a case study in Thammasat University Hospital.

BMC medical informatics and decision making·2026

Same journal

Automated generation of structured breast ultrasound reports using BreastViT and ChatGPT.

BMC medical informatics and decision making·2026

Same journal

Shared decision-making and medication adherence among community adults with chronic diseases: a cross-sectional study in Hubei Province, China.

BMC medical informatics and decision making·2026

Same journal

Classification of periapical radiographic findings for root canal therapy decision support using deep neural networks.

BMC medical informatics and decision making·2026

Same journal

Machine learning-based risk assessment of neonatal perinatal adverse outcomes of anemia during pregnancy: a modeling study.

BMC medical informatics and decision making·2026

Same journal

Intelligent differentiation between Parkinson's disease and essential tremor using wearable sensors and machine learning: a temporal validation study.

BMC medical informatics and decision making·2026

See all related articles

Search research articles

Related Experiment Videos

Predicting disease risks from highly imbalanced data using random forest.

Mohammed Khalilia¹, Sounak Chakraborty, Mihail Popescu

¹Department of Computer Science, University of Missouri, Columbia, Missouri, USA.

BMC Medical Informatics and Decision Making

|August 2, 2011

Summary

This summary is machine-generated.

We developed a novel method using Healthcare Cost and Utilization Project (HCUP) data to predict disease risk. This approach effectively addresses data imbalance, outperforming other models in chronic disease prediction.

Related Experiment Videos

Area of Science:

Health Informatics
Machine Learning in Healthcare
Predictive Analytics

Background:

Healthcare data, such as the Healthcare Cost and Utilization Project (HCUP) dataset, contains valuable information for predicting individual disease risk.
Accurate disease risk prediction can enhance healthcare management, personalized communication, and clinical decision support systems.

Purpose of the Study:

To develop and evaluate a machine learning methodology for predicting disease risk using historical medical diagnosis data.
To assess the performance of various classification algorithms in identifying individuals at risk for chronic diseases.

Main Methods:

Utilized the National Inpatient Sample (NIS) data from HCUP for training predictive models.
Employed an ensemble learning approach with repeated random sub-sampling to manage highly imbalanced HCUP data.
Compared Random Forest (RF) classifiers against Support Vector Machine (SVM), bagging, and boosting for predicting eight chronic diseases.

Main Results:

The Random Forest (RF) ensemble learning method demonstrated superior performance compared to SVM, bagging, and boosting.
RF achieved a higher Area Under the Receiver Operating Characteristic curve (AUC) for predicting eight disease categories.
RF offers the advantage of calculating variable importance, providing insights into predictive factors.

Conclusions:

Combining repeated random sub-sampling with RF effectively addresses class imbalance issues in healthcare datasets.
The proposed method achieved promising results, predicting eight disease categories with an average AUC of 88.79% using national HCUP data.