Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Data Reporting and Recording

Data Reporting and Recording

Reporting and recording are crucial in data documentation. The timely, thorough, and accurate documentation of facts is essential when recording patient data. Failure to record findings during an assessment or interpretation of a problem will result in loss of information and make the patient document unreliable. The reader is left with general impressions if the information is not specific. A recording is documenting data of the individual's health information in a traceable, secure, and...

Analysis of Population Pharmacokinetic Data

Analysis of Population Pharmacokinetic Data

Analysis of population pharmacokinetic data involves studying the behavior of drugs within diverse populations to understand their pharmacokinetic parameters. Traditional pharmacokinetic methods typically involve collecting samples from a few individuals and estimating these parameters. While these methods are commonly used, they have limitations in capturing the variability in drug response among individuals or heterogeneous populations. Population pharmacokinetics is employed to address these...

Analysis Methods of Pharmacokinetic Data: Model and Model-Independent Approaches

Analysis Methods of Pharmacokinetic Data: Model and Model-Independent Approaches

Drug disposition in the body is a complex process and can be studied using two major approaches: the model and the model-independent approaches.
The model approach uses mathematical models to describe changes in drug concentration over time. Pharmacokinetic models help characterize drug behavior in patients, predict drug concentration in the body fluids, calculate optimum dosage regimens, and evaluate the risk of toxicity. However, ensuring that the model fits the experimental data accurately...

How Data are Classified: Numerical Data

How Data are Classified: Numerical Data

Data that are countable or measurable in specific units are called numerical or quantitative data. Quantitative data are always numbers. Quantitative data are the result of counting or measuring the attributes of a population. Amount of money, pulse rate, weight, number of people living in a town, and number of students who opt for statistics are examples of quantitative data.
Quantitative data may be either discrete or continuous. All quantitative data that take on only specific numerical...

Model-Independent Approaches for Pharmacokinetic Data: Noncompartmental Analysis

Model-Independent Approaches for Pharmacokinetic Data: Noncompartmental Analysis

Noncompartmental analyses offer an alternative method for describing drug pharmacokinetics without relying on a specific compartmental model. In this approach, the drug's pharmacokinetics are assumed to be linear, with the terminal phase log-linear. This assumption allows for simplified analysis and interpretation of the drug's behavior in the body.
One important characteristic of noncompartmental analyses is that drug exposure increases proportionally with increasing doses. This...

How Data are Classified: Categorical Data

How Data are Classified: Categorical Data

A variable, usually notated by capital letters such as X and Y, is a characteristic or measurement that can be determined for each member of a population. Data are the actual values of variables. They may be numbers, or they may be words. Datum is a single value.
Data are classified based on whether they are measurable or not. Categorical data cannot be measured; instead, it can be divided into categories. For example, if Y denotes a person's party affiliation, some examples of Y include...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Loss function influence on hyperparameter optimization for observational healthcare prediction models.

Journal of the American Medical Informatics Association : JAMIA·2026

Same author

Real-world evidence for comparative safety of second-line antihyperglycemic agents in older adults with type 2 diabetes.

Nature communications·2026

Same author

Risk of prostatitis in patients with type 2 diabetes mellitus: An observational retrospective cohort study of canagliflozin versus other antihyperglycemic agents using propensity score matching.

PloS one·2026

Same author

Trust in Observational Research.

Journal of the American College of Cardiology·2026

Same author

A lossless one-shot distributed algorithm for addressing heterogeneity in multi-site generalized linear models.

Journal of the American Medical Informatics Association : JAMIA·2025

Same author

Macrolide prescribing and preemptive electrocardiograms in asthma, COPD, ACO, and general population: a drug-utilization study.

The Journal of asthma : official journal of the Association for the Care of Asthma·2025

Same journal

Evaluation of temporal preservation in synthetic longitudinal patient data.

Journal of biomedical informatics·2026

Same journal

ARKE: An ontology-driven framework for automated mapping of local radiology procedure terms to the LOINC-RadLex playbook using large language model.

Journal of biomedical informatics·2026

Same journal

A validation-driven training controller for cross-lingual biomedical NER via reinforcement learning-based adaptive loss weighting.

Journal of biomedical informatics·2026

Same journal

ASP-HR: An Adaptive Spatial Perception and Hierarchical Reasoning mechanism for document-level biomedical relation extraction.

Journal of biomedical informatics·2026

Same journal

Beyond Accuracy: Safety-Centered guidelines for the evaluation of LLM-based therapy recommendation systems for chronic multimorbidity patients.

Journal of biomedical informatics·2026

Same journal

DeepEN: A deep reinforcement learning framework for personalized enteral nutrition in critical care.

Journal of biomedical informatics·2026

See all related articles

Search research articles

Related Experiment Video

Updated: Jan 21, 2026

The Participant-Reported Implementation Update and Score PRIUS: A Novel Method for Capturing Implementation-Related Data Over Time

The Participant-Reported Implementation Update and Score PRIUS: A Novel Method for Capturing Implementation-Related Data Over Time

Published on: February 19, 2021

Supplementing claims data analysis using self-reported data to develop a probabilistic phenotype model for current

Jenna M Reps¹, Peter R Rijnbeek², Patrick B Ryan¹

¹Janssen Research and Development, Titusville, NJ, USA.

Journal of Biomedical Informatics

|August 7, 2019

Summary

This summary is machine-generated.

A new model, CROSS, accurately predicts current smoking status using US claims data. This tool helps impute missing smoking information in epidemiological studies, improving research accuracy.

Keywords:

Claims data Imputation Patient-level prediction Probabilistic phenotype Risk Smoking

More Related Videos

Data Collection on Marine Litter Ingestion in Sea Turtles and Thresholds for Good Environmental Status

Data Collection on Marine Litter Ingestion in Sea Turtles and Thresholds for Good Environmental Status

Published on: May 18, 2019

Basics of Multivariate Analysis in Neuroimaging Data

Basics of Multivariate Analysis in Neuroimaging Data

Published on: July 24, 2010

Related Experiment Videos

Last Updated: Jan 21, 2026

The Participant-Reported Implementation Update and Score PRIUS: A Novel Method for Capturing Implementation-Related Data Over Time

The Participant-Reported Implementation Update and Score PRIUS: A Novel Method for Capturing Implementation-Related Data Over Time

Published on: February 19, 2021

Data Collection on Marine Litter Ingestion in Sea Turtles and Thresholds for Good Environmental Status

Data Collection on Marine Litter Ingestion in Sea Turtles and Thresholds for Good Environmental Status

Published on: May 18, 2019

Basics of Multivariate Analysis in Neuroimaging Data

Basics of Multivariate Analysis in Neuroimaging Data

Published on: July 24, 2010

Area of Science:

Health Informatics
Epidemiology
Data Science

Background:

Smoking status is often missing in US health insurance claims data.
Accurate smoking data is crucial for epidemiological studies and confounder adjustment.
The IBM MarketScan Commercial database offers a potential source for smoking status imputation.

Purpose of the Study:

To develop a generalizable smoking status phenotype model using US claims data.
To investigate the utility of a subset of patients with self-reported smoking status for model training.
To create a model that calculates the probability of being a current smoker.

Main Methods:

A subset of 1,966,174 patients with linked health risk assessments was used.
A regularized logistic regression model, Current Risk of Smoking Status (CROSS), was trained.
CROSS utilized 53,027 covariates from the prior 365 days, including demographics, conditions, drugs, and procedures.

Main Results:

The CROSS model achieved an internal AUC of 0.76 and was well-calibrated.
External validation across three US claims databases yielded AUCs between 0.82 and 0.87.
The model demonstrated transportability across different claims data sources.

Conclusions:

The CROSS model effectively predicts current smoking status from prior year claims data.
CROSS can be implemented with OMOP common data model-mapped US insurance claims.
This model is valuable for imputing smoking status in epidemiological research where it's a known confounder.