Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

Variation01:19

Variation

8.4K
An important characteristic of any set of data is the variation in the data. In some data sets, the data values are concentrated closely near the mean; in other data sets, the data values are more widely spread out from the mean. The most common measure of variation, or spread, is the standard deviation, which is the square root of variance.
When independent and dependent variables are plotted on a scatter plot, the slope of a line is a value that describes the rate of change between the two...
8.4K
Regression Analysis01:11

Regression Analysis

9.1K
Regression analysis is a statistical tool that describes a mathematical relationship between a dependent variable and one or more independent variables.
In regression analysis, a regression equation is determined based on the line of best fit– a line that best fits the data points plotted in a graph. This line is also called the regression line. The algebraic equation for the regression line is called the regression equation. It is represented as:
9.1K
Residuals and Least-Squares Property01:11

Residuals and Least-Squares Property

9.9K
The vertical distance between the actual value of y and the estimated value of y. In other words, it measures the vertical distance between the actual data point and the predicted point on the line
If the observed data point lies above the line, the residual is positive, and the line underestimates the actual data value for y. If the observed data point lies below the line, the residual is negative, and the line overestimates the actual data value for y.
The process of fitting the best-fit...
9.9K
Multiple Regression01:25

Multiple Regression

4.4K
Multiple regression assesses a linear relationship between one response or dependent variable and two or more independent variables. It has many practical applications.
Farmers can use multiple regression to determine the crop yield based on more than one factor, such as water availability, fertilizer, soil properties, etc. Here, the crop yield is the response or dependent variable as it depends on the other independent variables. The analysis requires the construction of a scatter plot...
4.4K
Survival Tree01:19

Survival Tree

502
Survival trees are a non-parametric method used in survival analysis to model the relationship between a set of covariates and the time until an event of interest occurs, often referred to as the "time-to-event" or "survival time." This method is particularly useful when dealing with censored data, where the event has not occurred for some individuals by the end of the study period, or when the exact time of the event is unknown.
 Building a Survival Tree
Constructing a...
502
Prediction Intervals01:03

Prediction Intervals

3.6K
The interval estimate of any variable is known as the prediction interval. It helps decide if a point estimate is dependable.
However, the point estimate is most likely not the exact value of the population parameter, but close to it. After calculating point estimates, we construct interval estimates, called confidence intervals or prediction intervals. This prediction interval comprises a range of values unlike the point estimate and is a better predictor of the observed sample value, y. 
3.6K

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

Multi-modal feature learning to prioritize ADCs with favorable half-life in mice.

mAbs·2026
Same author

Comparing massively-multitask regression algorithms for drug discovery.

Journal of computer-aided molecular design·2026
Same author

[Preparing Biochars from Antibiotic Mycelial Residue for Adsorptive Removal of Oxytetracycline in Water].

Huan jing ke xue= Huanjing kexue·2025
Same author

[Adsorption Characteristics of Tetracycline by CuFeO<sub>2</sub>-modified Biochar].

Huan jing ke xue= Huanjing kexue·2023
Same author

Formation of iodinated aromatic DBPs at different molar ratios of chlorine and nitrogen in iodide-containing water.

The Science of the total environment·2021
Same author

Collaborative Profile-QSAR: A Natural Platform for Building Collaborative Models among Competing Companies.

Journal of chemical information and modeling·2021
Same journal

Correction to "AstraMEV (AI-Guided Structural Assembly of Multi-Epitope Vaccines) Against Infectious Bronchitis Virus".

Journal of chemical information and modeling·2026
Same journal

MolPy: A Large Language Model-Friendly Toolkit for Reactive Topology Editing in Polymer Simulations.

Journal of chemical information and modeling·2026
Same journal

Molecular Mechanisms of KIT Receptor Dimerization and Oncogenic Activation Revealed by Multiscale Simulations.

Journal of chemical information and modeling·2026
Same journal

Structural and Thermodynamic Discrimination between Agonists and Antagonists of Retinoic Acid Receptor γ and the Vitamin D Receptor.

Journal of chemical information and modeling·2026
Same journal

PACEff Builder: An Efficient Platform for Constructing PACE Hybrid-Resolution Models for Molecular Dynamics Simulations of Aqueous Protein, Peptide Assembly, and Membrane Protein Systems.

Journal of chemical information and modeling·2026
Same journal

TransKla: A Local-Global Cross-Attention Based Transformer Approach for Prediction of Lysine Lactylation Sites.

Journal of chemical information and modeling·2026
See all related articles

Related Experiment Video

Updated: Apr 16, 2026

Comparison of Predictive Performance of Three Lymph Node Staging Systems in Colorectal Signet Ring Cell Carcinoma Based on Machine Learning Model
07:13

Comparison of Predictive Performance of Three Lymph Node Staging Systems in Colorectal Signet Ring Cell Carcinoma Based on Machine Learning Model

Published on: April 18, 2025

874

Recursive Random Forests Enable Better Predictive Performance and Model Interpretation than Variable Selection by

Xiang-Wei Zhu, Yan-Jun Xin, Hui-Lin Ge1

  • 1§Hainan Provincial Key Laboratory of Quality and Safety for Tropical Fruits and Vegetables, Analysis and Testing Center, Chinese Academy of Tropical Agricultural Sciences, Haikou, 571101 Hainan, China.

Journal of Chemical Information and Modeling
|March 10, 2015
PubMed
Summary
This summary is machine-generated.

Variable selection is key for predictive modeling. Random Forests (RF) effectively identify important predictors, outperforming the Least Absolute Shrinkage and Selection Operator (LASSO) by avoiding the exclusion of correlated variables, thus enhancing model performance.

More Related Videos

Development of an Individual-Tree Basal Area Increment Model using a Linear Mixed-Effects Approach
04:35

Development of an Individual-Tree Basal Area Increment Model using a Linear Mixed-Effects Approach

Published on: July 3, 2020

3.8K
Machine Learning-Based Cough Tone Classification: Diagnostic Exploration of Chronic Obstructive Pulmonary Disease and Respiratory Tract Infections
06:22

Machine Learning-Based Cough Tone Classification: Diagnostic Exploration of Chronic Obstructive Pulmonary Disease and Respiratory Tract Infections

Published on: September 19, 2025

712

Related Experiment Videos

Last Updated: Apr 16, 2026

Comparison of Predictive Performance of Three Lymph Node Staging Systems in Colorectal Signet Ring Cell Carcinoma Based on Machine Learning Model
07:13

Comparison of Predictive Performance of Three Lymph Node Staging Systems in Colorectal Signet Ring Cell Carcinoma Based on Machine Learning Model

Published on: April 18, 2025

874
Development of an Individual-Tree Basal Area Increment Model using a Linear Mixed-Effects Approach
04:35

Development of an Individual-Tree Basal Area Increment Model using a Linear Mixed-Effects Approach

Published on: July 3, 2020

3.8K
Machine Learning-Based Cough Tone Classification: Diagnostic Exploration of Chronic Obstructive Pulmonary Disease and Respiratory Tract Infections
06:22

Machine Learning-Based Cough Tone Classification: Diagnostic Exploration of Chronic Obstructive Pulmonary Disease and Respiratory Tract Infections

Published on: September 19, 2025

712

Area of Science:

  • Quantitative Structure-Activity Relationship (QSAR) modeling
  • Cheminformatics
  • Machine Learning

Background:

  • Variable selection is critical for enhancing predictive accuracy and reducing noise in QSAR models.
  • Choosing the appropriate variables is more challenging than developing the predictive models themselves.
  • Existing methods require careful evaluation for their effectiveness in descriptor selection.

Purpose of the Study:

  • To explore the applicability of two distinct variable selection methods: Random Forests (RF) and Least Absolute Shrinkage and Selection Operator (LASSO).
  • To compare the performance of RF and LASSO in selecting optimal variables for QSAR modeling across diverse datasets.
  • To propose and utilize novel metrics for a comprehensive evaluation of variable selection strategies.

Main Methods:

  • Applied recursive RF to iteratively remove less important descriptors.
  • Employed LASSO with 10-fold inner cross-validation to determine the optimal penalty parameter (λ).
  • Utilized highest pairwise correlation rate, average Pearson's correlation coefficient, and Tanimoto coefficient for evaluation.

Main Results:

  • Variable selection significantly reduced noisy descriptors (up to 96% with RF) and improved predictive performance.
  • RF selected important predictors without restricting pairwise correlations, unlike LASSO which excludes highly correlated variables.
  • LASSO's tendency to exclude correlated variables can lead to the omission of important predictors, potentially undermining model performance.
  • Optimal variable sets from RF and LASSO showed low similarity (Tanimoto coefficients < 0.20 in 7/8 datasets).

Conclusions:

  • The strategy for variable selection, rather than the learning algorithm itself, primarily drives differences in predictive performance.
  • Effective variable selection is more critical than the choice of learning algorithm for successful QSAR modeling.
  • The study advocates for developing a standardized procedure using proposed metrics to identify truly important variables for model interpretation and application in drug discovery and environmental toxicity assessment.