Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Search research articles

Related Experiment Videos

Counting clusters using R-NN curves.

Rajarshi Guha¹, Debojyoti Dutta, David J Wild

¹School of Informatics, Indiana University, Bloomington, Indiana 47406, USA. rguha@indiana.edu

Journal of Chemical Information and Modeling

|July 3, 2007

Summary

This summary is machine-generated.

Related Concept Videos

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Developing Predictive Models by Sharing Predictions - An Investigation of a Federated Learning Approach for ADMET Predictions.

Journal of medicinal chemistry·2026

Same author

Paths to cheminformatics: Q&A with Rajarshi Guha.

Journal of cheminformatics·2026

Same author

Enhanced transport behavior of small molecules in polymer solutions.

Soft matter·2025

Same author

Nonbonded Molecular Interaction Controls Aggregation Kinetics of Hydrophobic Molecules in Water.

Langmuir : the ACS journal of surfaces and colloids·2025

Same author

Computational drug repositioning identifies niclosamide and tribromsalan as inhibitors of Mycobacterium tuberculosis and Mycobacterium abscessus.

Tuberculosis (Edinburgh, Scotland)·2024

Same author

Are new ideas harder to find? A note on incremental research and Journal of Cheminformatics' Scientific Contribution Statement.

Journal of cheminformatics·2024

Same journal

QSAR in the Browser: An Interactive Cheminformatics Web Application.

Journal of chemical information and modeling·2026

Same journal

FoldDoF: Utilizing the Primary Degrees of Freedom of Protein Backbone for Geometric Modeling and Generation.

Journal of chemical information and modeling·2026

Same journal

Derisking Affinity Optimization for Macrocycles and Cyclic Peptides: High-Precision Free Energy Simulations across Five Diverse Targets.

Journal of chemical information and modeling·2026

Same journal

An End-User Audit of Reproducibility, Data Leakage, and Overfitting of the Top-Ranked ADMET Prediction Models in TDC Leaderboards.

Journal of chemical information and modeling·2026

Same journal

PFASGroups: An Open-Source Framework for Automated Identification, Structural Classification, and Prioritization of Per- and Polyfluoroalkyl Substances.

Journal of chemical information and modeling·2026

Same journal

DeepKbhb: Context-Aware Prediction of Human Lysine β-Hydroxybutyrylation Sites.

Journal of chemical information and modeling·2026

See all related articles

This study introduces the R-NN curve algorithm to determine the optimal number of clusters (k) for k-means clustering in cheminformatics. The R-NN curve method accurately estimates k, aligning with cluster quality measures.

Area of Science:

Cheminformatics
Computational Chemistry
Data Mining

Background:

Nonhierarchical clustering, like k-means, requires specifying the number of clusters (k).
Traditional methods involve iterative clustering with varying k values to find the optimum.
Determining the optimal k a priori is crucial for efficient and accurate clustering.

Purpose of the Study:

To introduce and evaluate the R-NN curve algorithm for a priori selection of k in clustering.
To assess the algorithm's ability to estimate the natural number of clusters.
To compare the R-NN curve algorithm's results with established cluster quality measures.

Main Methods:

Utilized the R-NN curve algorithm, based on nearest-neighbor analysis, to characterize compound spatial distributions.

Related Experiment Videos

Generated and analyzed R-NN curves to estimate the natural number of clusters.

Performed k-means clustering using the predicted k and compared results with average silhouette width.

Main Results:

The R-NN curve algorithm successfully determined the natural number of clusters for various datasets.
Results showed general agreement between the R-NN curve algorithm and average silhouette width in identifying optimal k.
The algorithm demonstrated effectiveness on both simulated and real chemical data.

Conclusions:

The R-NN curve algorithm provides a reliable method for a priori determination of k in clustering.
This approach simplifies and enhances the efficiency of clustering in cheminformatics.
The R-NN curve algorithm is a valuable tool for selecting optimal cluster numbers, complementing existing quality metrics.