Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Goodness-of-Fit Test

Goodness-of-Fit Test

The goodness-of-fit test is a type of hypothesis test which determines whether the data "fits" a particular distribution. For example, one may suspect that some anonymous data may fit a binomial distribution. A chi-square test (meaning the distribution for the hypothesis test is chi-square) can be used to determine if there is a fit. The null and alternative hypotheses may be written in sentences or stated as equations or inequalities. The test statistic for a goodness-of-fit test is given as...

Expected Frequencies in Goodness-of-Fit Tests

Expected Frequencies in Goodness-of-Fit Tests

A goodness-of-fit test is conducted to determine whether the observed frequency values are statistically similar to the frequencies expected for the dataset. Suppose the expected frequencies for a dataset are equal such as when predicting the frequency of any number appearing when casting a die. In that case, the expected frequency is the ratio of the total number of observations (n) to the number of categories (k).

Regression Toward the Mean

Regression Toward the Mean

Regression toward the mean (“RTM”) is a phenomenon in which extremely high or low values—for example, and individual’s blood pressure at a particular moment—appear closer to a group’s average upon remeasuring. Although this statistical peculiarity is the result of random error and chance, it has been problematic across various medical, scientific, financial and psychological applications. In particular, RTM, if not taken into account, can interfere when...

Improving Translational Accuracy

Improving Translational Accuracy

Base complementarity between the three base pairs of mRNA codon and the tRNA anticodon is not a failsafe mechanism. Inaccuracies can range from a single mismatch to no correct base pairing at all. The free energy difference between the correct and nearly correct base pairs can be as small as 3 kcal/ mol. With complementarity being the only proofreading step, the estimated error frequency would be one wrong amino acid in every 100 amino acids incorporated. However, error frequencies observed in...

Multiple Regression

Multiple Regression

Multiple regression assesses a linear relationship between one response or dependent variable and two or more independent variables. It has many practical applications.
Farmers can use multiple regression to determine the crop yield based on more than one factor, such as water availability, fertilizer, soil properties, etc. Here, the crop yield is the response or dependent variable as it depends on the other independent variables. The analysis requires the construction of a scatter plot...

Prediction Intervals

Prediction Intervals

The interval estimate of any variable is known as the prediction interval. It helps decide if a point estimate is dependable.
However, the point estimate is most likely not the exact value of the population parameter, but close to it. After calculating point estimates, we construct interval estimates, called confidence intervals or prediction intervals. This prediction interval comprises a range of values unlike the point estimate and is a better predictor of the observed sample value, y.

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

SangsterLogP - the largest publicly available dataset of logP values.

Scientific data·2026

Same author

2nd EUOS/SLAS joint challenge: Prediction of spectral properties of compounds.

SLAS technology·2025

Same author

Introducing the Inaugural Early Career Board for <i>Chemical Research in Toxicology</i>.

Chemical research in toxicology·2025

Same author

Advanced machine learning for innovative drug discovery.

Journal of cheminformatics·2025

Same author

Advancing Human and Environmental Safety Science Using <i>In Silico</i> Methods.

Chemical research in toxicology·2025

Same author

Which Modern AI Methods Provide Accurate Predictions of Toxicological End Points? Analysis of Tox24 Challenge Results.

Chemical research in toxicology·2025

Same journal

OpenStats: how to combine statistics and research data management (RDM) to leverage efficient scientific data analysis by guided statistics.

Journal of cheminformatics·2026

Same journal

Unified heterogeneity-aware benchmark of drug synergy prediction: a cross-study analysis of traditional machine learning and graph deep learning models.

Journal of cheminformatics·2026

Same journal

Count your bits: fingerprint benchmarking to assess broad chemical space representation.

Journal of cheminformatics·2026

Same journal

Sampling out-of-distribution chemical spaces via Bayesian flow.

Journal of cheminformatics·2026

Same journal

Hold on tight: the kinetic profiling of opioid receptor ligands using the CORAL-MD.

Journal of cheminformatics·2026

Same journal

Transformer-accelerated discovery of inhibitors targeting the RpsA<sub>Δ438</sub> deletion in PZA-resistant tuberculosis.

Journal of cheminformatics·2026

See all related articles

Search research articles

Related Experiment Video

Updated: Jun 5, 2025

ARL Spectral Fitting as an Application to Augment Spectral Data via Franck-Condon Lineshape Analysis and Color Analysis

ARL Spectral Fitting as an Application to Augment Spectral Data via Franck-Condon Lineshape Analysis and Color Analysis

Published on: August 19, 2021

Be aware of overfitting by hyperparameter optimization!

Igor V Tetko^1,2, Ruud van Deursen³, Guillaume Godin⁴

¹Institute of Structural Biology, Molecular Targets and Therapeutics Center, Helmholtz Munich - Deutsches Forschungszentrum Für Gesundheit Und Umwelt (GmbH), 86764, Neuherberg, Germany. igor.tetko@helmholtz-munich.de.

Journal of Cheminformatics

|December 9, 2024

Summary

This summary is machine-generated.

Hyperparameter optimization in machine learning may cause overfitting. Using pre-set hyperparameters offers similar results, significantly reducing computational time and improving model accuracy with Transformer CNN.

More Related Videos

Selecting Multiple Biomarker Subsets with Similarly Effective Binary Classification Performances

Selecting Multiple Biomarker Subsets with Similarly Effective Binary Classification Performances

Published on: October 11, 2018

Author Spotlight: Advancing Alzheimer's Research – Exploring Early Detection and Multi-Omics Approaches

Author Spotlight: Advancing Alzheimer's Research – Exploring Early Detection and Multi-Omics Approaches

Published on: December 15, 2023

Related Experiment Videos

Last Updated: Jun 5, 2025

ARL Spectral Fitting as an Application to Augment Spectral Data via Franck-Condon Lineshape Analysis and Color Analysis

ARL Spectral Fitting as an Application to Augment Spectral Data via Franck-Condon Lineshape Analysis and Color Analysis

Published on: August 19, 2021

Selecting Multiple Biomarker Subsets with Similarly Effective Binary Classification Performances

Selecting Multiple Biomarker Subsets with Similarly Effective Binary Classification Performances

Published on: October 11, 2018

Author Spotlight: Advancing Alzheimer's Research – Exploring Early Detection and Multi-Omics Approaches

Author Spotlight: Advancing Alzheimer's Research – Exploring Early Detection and Multi-Omics Approaches

Published on: December 15, 2023

Area of Science:

Computational Chemistry
Machine Learning
Drug Discovery

Background:

Hyperparameter optimization is common in machine learning for tasks like solubility prediction.
Previous studies utilized graph-based methods on diverse solubility datasets.
Concerns exist regarding potential overfitting during extensive hyperparameter tuning.

Purpose of the Study:

To investigate the impact of hyperparameter optimization on model performance in solubility prediction.
To compare the efficiency and accuracy of pre-set hyperparameters versus optimized ones.
To evaluate a novel Natural Language Processing-based representation learning method, Transformer CNN.

Main Methods:

Analysis of seven thermodynamic and kinetic solubility datasets.
Comparison of state-of-the-art graph-based methods with hyperparameter optimization and pre-set hyperparameters.
Implementation and evaluation of Transformer CNN, a Natural Language Processing approach using SMILES strings.

Main Results:

Hyperparameter optimization did not consistently improve model performance and could lead to overfitting.
Models with pre-set hyperparameters achieved comparable results to optimized models, reducing computational cost by approximately 10,000 times.
Transformer CNN outperformed graph-based methods in 26 out of 28 comparisons, demonstrating superior accuracy and efficiency.

Conclusions:

Pre-optimized hyperparameters can negatively impact model generalization due to overfitting.
Utilizing pre-set hyperparameters is a computationally efficient strategy yielding comparable predictive performance.
Transformer CNN represents a significant advancement in solubility prediction accuracy and speed.