Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Sign Test for Matched Pairs

Sign Test for Matched Pairs

The sign test for matched pairs offers a robust method for comparing two paired samples, often for the effects of an intervention in one of them. This method is very useful in situations where the underlying distribution of the data is unknown. The test compares two related samples—often pre- and post-treatment measurements on the same subjects—to determine if there are significant differences in their median values.
To conduct the sign test, we first calculate the differences in...

Chunking

Chunking

Chunking is a powerful cognitive technique that improves short-term memory retention by organizing information into smaller, more manageable units. The brain, limited by working memory capacity, can more easily process and store information when it is divided into "chunks" rather than presented as discrete, unrelated elements. Chunking is especially useful when dealing with large amounts of information, such as numerical sequences, words, or complex ideas.
The principle behind chunking...

Stereotype Content Model

Stereotype Content Model

The Stereotype Content Model (SCM) was first proposed by Susan Fiske and her colleagues (Fiske, Cuddy, Glick & Xu, 2002; see also Fiske, 2012 and Fiske, 2017). The SCM specifies that when someone encounters a new group, they will stereotype them based on two metrics: warmth—or that group’s perceived intent, and how likely they are to provide help or inflict harm—and competence—or their ability to carry out that objective. Depending on the warmth-competence...

Neural Circuits

Neural Circuits

Neural circuits and neuronal pools are two of the main structures found in the nervous system. Neural circuits are networks of neurons that work together to carry out a specific task or process. They consist of interconnected neurons and glial cells, which provide structural and metabolic support.
Neuronal pools are collections of nerve cells with similar functions and interact through chemical and electrical signals. These pools include both interneurons (the central neural circuit nodes that...

Per-Unit Sequence Models

Per-Unit Sequence Models

An ideal Y-Y transformer, grounded through neutral impedances, displays per-unit sequence networks akin to those of a single-phase ideal transformer when subjected to balanced positive- or negative-sequence currents. These currents do not produce neutral currents, and their associated voltage drops.
Zero-sequence currents, which are identical in magnitude and phase, generate a neutral current, resulting in voltage drops across the neutral impedance and the low-voltage winding. If the...

Improving Translational Accuracy

Improving Translational Accuracy

Base complementarity between the three base pairs of mRNA codon and the tRNA anticodon is not a failsafe mechanism. Inaccuracies can range from a single mismatch to no correct base pairing at all. The free energy difference between the correct and nearly correct base pairs can be as small as 3 kcal/ mol. With complementarity being the only proofreading step, the estimated error frequency would be one wrong amino acid in every 100 amino acids incorporated. However, error frequencies observed in...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Targeted Protein Degradation for Agricultural Applications: Rationale, Challenges, and Outlook.

ACS bio & med chem Au·2025

Same author

Pioneering protein degradation for agricultural applications.

Communications biology·2025

Same author

Inductive transfer learning for molecular activity prediction: Next-Gen QSAR Models with MolPMoFiT.

Journal of cheminformatics·2021

Same author

Cheminformatics Analysis of Fluoroquinolones and their Inhibition Potency Against Four Pathogens.

Molecular informatics·2020

Same author

Benchmarking 2D/3D/MD-QSAR Models for Imatinib Derivatives: How Far Can We Predict?

Journal of chemical information and modeling·2020

Same author

Correction: QSAR without borders.

Chemical Society reviews·2020

Same journal

Correction to "AstraMEV (AI-Guided Structural Assembly of Multi-Epitope Vaccines) Against Infectious Bronchitis Virus".

Journal of chemical information and modeling·2026

Same journal

MolPy: A Large Language Model-Friendly Toolkit for Reactive Topology Editing in Polymer Simulations.

Journal of chemical information and modeling·2026

Same journal

Molecular Mechanisms of KIT Receptor Dimerization and Oncogenic Activation Revealed by Multiscale Simulations.

Journal of chemical information and modeling·2026

Same journal

Structural and Thermodynamic Discrimination between Agonists and Antagonists of Retinoic Acid Receptor γ and the Vitamin D Receptor.

Journal of chemical information and modeling·2026

Same journal

PACEff Builder: An Efficient Platform for Constructing PACE Hybrid-Resolution Models for Molecular Dynamics Simulations of Aqueous Protein, Peptide Assembly, and Membrane Protein Systems.

Journal of chemical information and modeling·2026

Same journal

TransKla: A Local-Global Cross-Attention Based Transformer Approach for Prediction of Lysine Lactylation Sites.

Journal of chemical information and modeling·2026

See all related articles

Search research articles

Related Experiment Video

Updated: Nov 13, 2025

Decoding Natural Behavior from Neuroethological Embedding

Decoding Natural Behavior from Neuroethological Embedding

Published on: October 3, 2025

SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning.

Xinhao Li¹, Denis Fourches¹

¹Department of Chemistry, Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina 27695, United States.

Journal of Chemical Information and Modeling

|March 15, 2021

Summary

This summary is machine-generated.

SMILES Pair Encoding (SPE) enhances deep learning models by tokenizing molecules using frequent substrings, improving molecular generation and QSAR prediction performance over atom-level methods.

More Related Videos

Author Spotlight: Advancing Alzheimer's Research – Exploring Early Detection and Multi-Omics Approaches

Author Spotlight: Advancing Alzheimer's Research – Exploring Early Detection and Multi-Omics Approaches

Published on: December 15, 2023

Dynamic Digital Biomarkers of Motor and Cognitive Function in Parkinson's Disease

Dynamic Digital Biomarkers of Motor and Cognitive Function in Parkinson's Disease

Published on: July 24, 2019

Related Experiment Videos

Last Updated: Nov 13, 2025

Decoding Natural Behavior from Neuroethological Embedding

Decoding Natural Behavior from Neuroethological Embedding

Published on: October 3, 2025

Author Spotlight: Advancing Alzheimer's Research – Exploring Early Detection and Multi-Omics Approaches

Author Spotlight: Advancing Alzheimer's Research – Exploring Early Detection and Multi-Omics Approaches

Published on: December 15, 2023

Dynamic Digital Biomarkers of Motor and Cognitive Function in Parkinson's Disease

Dynamic Digital Biomarkers of Motor and Cognitive Function in Parkinson's Disease

Published on: July 24, 2019

Area of Science:

Cheminformatics
Computational Chemistry
Machine Learning in Chemistry

Background:

Deep learning models utilizing Simplified Molecular Input Line Entry System (SMILES) are gaining traction in cheminformatics.
Current atom-level tokenization methods for SMILES may limit model performance.
A need exists for improved tokenization strategies to enhance chemical data representation.

Purpose of the Study:

To introduce and evaluate SMILES Pair Encoding (SPE), a novel data-driven tokenization algorithm for SMILES.
To compare the performance of SPE against traditional atom-level tokenization in deep learning tasks.
To provide an open-source implementation of the SPE algorithm.

Main Methods:

Developed SPE, a tokenization algorithm that learns a vocabulary of high-frequency SMILES substrings from large chemical datasets.
Applied SPE to tokenize SMILES strings for training deep learning models.
Evaluated SPE-based models on molecular generation and quantitative structure-activity relationship (QSAR) prediction tasks, comparing against atom-level tokenization.

Main Results:

SPE-based generative models demonstrated superior novelty, diversity, and training set distribution resemblance compared to atom-level tokenization.
SPE-based QSAR prediction models consistently matched or outperformed atom-level and k-mer tokenization across 24 benchmark datasets.
The developed open-source Python package, SmilesPE, facilitates the implementation of this algorithm.

Conclusions:

SMILES Pair Encoding (SPE) is a promising tokenization method for enhancing SMILES-based deep learning models in cheminformatics.
SPE offers significant performance improvements in both molecular generation and QSAR prediction.
The availability of the SmilesPE package promotes wider adoption and further research in this area.