Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

Sign Test for Matched Pairs01:17

Sign Test for Matched Pairs

271
The sign test for matched pairs offers a robust method for comparing two paired samples, often for the effects of an intervention in one of them. This method is very useful in situations where the underlying distribution of the data is unknown. The test compares two related samples—often pre- and post-treatment measurements on the same subjects—to determine if there are significant differences in their median values.
To conduct the sign test, we first calculate the differences in...
271
Chunking01:12

Chunking

261
Chunking is a powerful cognitive technique that improves short-term memory retention by organizing information into smaller, more manageable units. The brain, limited by working memory capacity, can more easily process and store information when it is divided into "chunks" rather than presented as discrete, unrelated elements. Chunking is especially useful when dealing with large amounts of information, such as numerical sequences, words, or complex ideas.
The principle behind chunking...
261
Stereotype Content Model02:16

Stereotype Content Model

15.1K
The Stereotype Content Model (SCM) was first proposed by Susan Fiske and her colleagues (Fiske, Cuddy, Glick & Xu, 2002; see also Fiske, 2012 and Fiske, 2017). The SCM specifies that when someone encounters a new group, they will stereotype them based on two metrics: warmth—or that group’s perceived intent, and how likely they are to provide help or inflict harm—and competence—or their ability to carry out that objective. Depending on the warmth-competence...
15.1K
Neural Circuits01:25

Neural Circuits

2.1K
Neural circuits and neuronal pools are two of the main structures found in the nervous system. Neural circuits are networks of neurons that work together to carry out a specific task or process. They consist of interconnected neurons and glial cells, which provide structural and metabolic support.
Neuronal pools are collections of nerve cells with similar functions and interact through chemical and electrical signals. These pools include both interneurons (the central neural circuit nodes that...
2.1K
Per-Unit Sequence Models01:26

Per-Unit Sequence Models

322
An ideal Y-Y transformer, grounded through neutral impedances, displays per-unit sequence networks akin to those of a single-phase ideal transformer when subjected to balanced positive- or negative-sequence currents. These currents do not produce neutral currents, and their associated voltage drops.
Zero-sequence currents, which are identical in magnitude and phase, generate a neutral current, resulting in voltage drops across the neutral impedance and the low-voltage winding. If the...
322
Improving Translational Accuracy02:07

Improving Translational Accuracy

12.2K
Base complementarity between the three base pairs of mRNA codon and the tRNA anticodon is not a failsafe mechanism. Inaccuracies can range from a single mismatch to no correct base pairing at all. The free energy difference between the correct and nearly correct base pairs can be as small as 3 kcal/ mol. With complementarity being the only proofreading step, the estimated error frequency would be one wrong amino acid in every 100 amino acids incorporated. However, error frequencies observed in...
12.2K

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

Targeted Protein Degradation for Agricultural Applications: Rationale, Challenges, and Outlook.

ACS bio & med chem Au·2025
Same author

Pioneering protein degradation for agricultural applications.

Communications biology·2025
Same author

Inductive transfer learning for molecular activity prediction: Next-Gen QSAR Models with MolPMoFiT.

Journal of cheminformatics·2021
Same author

Cheminformatics Analysis of Fluoroquinolones and their Inhibition Potency Against Four Pathogens.

Molecular informatics·2020
Same author

Benchmarking 2D/3D/MD-QSAR Models for Imatinib Derivatives: How Far Can We Predict?

Journal of chemical information and modeling·2020
Same author

Correction: QSAR without borders.

Chemical Society reviews·2020
Same journal

Correction to "AstraMEV (AI-Guided Structural Assembly of Multi-Epitope Vaccines) Against Infectious Bronchitis Virus".

Journal of chemical information and modeling·2026
Same journal

MolPy: A Large Language Model-Friendly Toolkit for Reactive Topology Editing in Polymer Simulations.

Journal of chemical information and modeling·2026
Same journal

Molecular Mechanisms of KIT Receptor Dimerization and Oncogenic Activation Revealed by Multiscale Simulations.

Journal of chemical information and modeling·2026
Same journal

Structural and Thermodynamic Discrimination between Agonists and Antagonists of Retinoic Acid Receptor γ and the Vitamin D Receptor.

Journal of chemical information and modeling·2026
Same journal

PACEff Builder: An Efficient Platform for Constructing PACE Hybrid-Resolution Models for Molecular Dynamics Simulations of Aqueous Protein, Peptide Assembly, and Membrane Protein Systems.

Journal of chemical information and modeling·2026
Same journal

TransKla: A Local-Global Cross-Attention Based Transformer Approach for Prediction of Lysine Lactylation Sites.

Journal of chemical information and modeling·2026
See all related articles

Related Experiment Video

Updated: Nov 13, 2025

Decoding Natural Behavior from Neuroethological Embedding
08:00

Decoding Natural Behavior from Neuroethological Embedding

Published on: October 3, 2025

248

SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning.

Xinhao Li1, Denis Fourches1

  • 1Department of Chemistry, Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina 27695, United States.

Journal of Chemical Information and Modeling
|March 15, 2021
PubMed
Summary
This summary is machine-generated.

SMILES Pair Encoding (SPE) enhances deep learning models by tokenizing molecules using frequent substrings, improving molecular generation and QSAR prediction performance over atom-level methods.

More Related Videos

Author Spotlight: Advancing Alzheimer's Research – Exploring Early Detection and Multi-Omics Approaches
09:47

Author Spotlight: Advancing Alzheimer's Research – Exploring Early Detection and Multi-Omics Approaches

Published on: December 15, 2023

1.5K
Dynamic Digital Biomarkers of Motor and Cognitive Function in Parkinson's Disease
10:28

Dynamic Digital Biomarkers of Motor and Cognitive Function in Parkinson's Disease

Published on: July 24, 2019

15.7K

Related Experiment Videos

Last Updated: Nov 13, 2025

Decoding Natural Behavior from Neuroethological Embedding
08:00

Decoding Natural Behavior from Neuroethological Embedding

Published on: October 3, 2025

248
Author Spotlight: Advancing Alzheimer's Research – Exploring Early Detection and Multi-Omics Approaches
09:47

Author Spotlight: Advancing Alzheimer's Research – Exploring Early Detection and Multi-Omics Approaches

Published on: December 15, 2023

1.5K
Dynamic Digital Biomarkers of Motor and Cognitive Function in Parkinson's Disease
10:28

Dynamic Digital Biomarkers of Motor and Cognitive Function in Parkinson's Disease

Published on: July 24, 2019

15.7K

Area of Science:

  • Cheminformatics
  • Computational Chemistry
  • Machine Learning in Chemistry

Background:

  • Deep learning models utilizing Simplified Molecular Input Line Entry System (SMILES) are gaining traction in cheminformatics.
  • Current atom-level tokenization methods for SMILES may limit model performance.
  • A need exists for improved tokenization strategies to enhance chemical data representation.

Purpose of the Study:

  • To introduce and evaluate SMILES Pair Encoding (SPE), a novel data-driven tokenization algorithm for SMILES.
  • To compare the performance of SPE against traditional atom-level tokenization in deep learning tasks.
  • To provide an open-source implementation of the SPE algorithm.

Main Methods:

  • Developed SPE, a tokenization algorithm that learns a vocabulary of high-frequency SMILES substrings from large chemical datasets.
  • Applied SPE to tokenize SMILES strings for training deep learning models.
  • Evaluated SPE-based models on molecular generation and quantitative structure-activity relationship (QSAR) prediction tasks, comparing against atom-level tokenization.

Main Results:

  • SPE-based generative models demonstrated superior novelty, diversity, and training set distribution resemblance compared to atom-level tokenization.
  • SPE-based QSAR prediction models consistently matched or outperformed atom-level and k-mer tokenization across 24 benchmark datasets.
  • The developed open-source Python package, SmilesPE, facilitates the implementation of this algorithm.

Conclusions:

  • SMILES Pair Encoding (SPE) is a promising tokenization method for enhancing SMILES-based deep learning models in cheminformatics.
  • SPE offers significant performance improvements in both molecular generation and QSAR prediction.
  • The availability of the SmilesPE package promotes wider adoption and further research in this area.