Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

Vision01:24

Vision

48.6K
Vision is the result of light being detected and transduced into neural signals by the retina of the eye. This information is then further analyzed and interpreted by the brain. First, light enters the front of the eye and is focused by the cornea and lens onto the retina—a thin sheet of neural tissue lining the back of the eye. Because of refraction through the convex lens of the eye, images are projected onto the retina upside-down and reversed.
48.6K
Force Classification01:22

Force Classification

2.8K
Forces play a crucial role in the study of physics and engineering. They are essential in describing the motion, behavior, and equilibrium of objects in the physical world. Forces can be classified based on their origin, type, and direction of action.
Contact and non-contact forces are two of the most widely used categories of forces. As the name suggests, contact forces require physical contact between two objects to act upon each other. Examples of contact forces include frictional,...
2.8K

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

Chemistry-Informed Machine Learning Framework for Predicting Structural Properties in Osmabenzene Complexes.

The journal of physical chemistry letters·2026
Same author

Mask-Guided Self-Supervised Video Object Segmentation.

IEEE transactions on pattern analysis and machine intelligence·2026
Same author

Gut microbiota and bile acids profiles study of ulcerative colitis and Crohn's disease patients.

Frontiers in microbiology·2026
Same author

ChatLeafDisease: a chain-of-thought prompting approach for crop disease classification using large language models.

Plant phenomics (Washington, D.C.)·2025
Same author

SOX2 induces LPCAT1 expression to promote cholesterol metabolic reprogramming-mediated invasion and metastasis in osteosarcoma.

Frontiers in molecular biosciences·2025
Same author

Identification of routine blood derived hematological and lipid indices in ILD through machine learning; a retrospective case-control study.

Frontiers in medicine·2025
Same journal

Change-Prior-Guided Unsupervised Change Detection of Heterogeneous Remote Sensing Images.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026
Same journal

AgonicDreamer: Enhancing Multi-View Consistency in Text-to-3D Generation via Rectified Score Distillation.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026
Same journal

BiCM-Prompt: Bidirectional Cross-Modal Prompt Tuning for Class-Incremental Learning on Multisource Remote Sensing Images.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026
Same journal

GoP-based Quality Enhancement on Video Compression.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026
Same journal

Align then Tensorize: Multi-Level Consistent Anchor Graph Learning for Scalable Multi-View Clustering.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026
Same journal

Beyond Fidelity: Diverse Image Synthesis via Retrieval-Augmented Diffusion.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026
See all related articles

Related Experiment Video

Updated: Apr 30, 2026

Using Eye Movements Recorded in the Visual World Paradigm to Explore the Online Processing of Spoken Language
09:27

Using Eye Movements Recorded in the Visual World Paradigm to Explore the Online Processing of Spoken Language

Published on: October 13, 2018

9.9K

CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model.

Shuai Zhao, Ruijie Quan, Linchao Zhu

    IEEE Transactions on Image Processing : a Publication of the IEEE Signal Processing Society
    |March 3, 2025
    PubMed
    Summary
    This summary is machine-generated.

    CLIP4STR leverages vision-language models (VLMs) for scene text recognition (STR). This method enhances text recognition accuracy by combining visual and cross-modal features, setting a new benchmark.

    More Related Videos

    Foreign Accent and Forensic Speaker Identification in Voice Lineups: The Influence of Acoustic Features Based on Prosody
    09:09

    Foreign Accent and Forensic Speaker Identification in Voice Lineups: The Influence of Acoustic Features Based on Prosody

    Published on: September 27, 2024

    385
    Author Spotlight: Investigating the Impact of Emotional Prosodies on Voice Recognition and Perception
    05:48

    Author Spotlight: Investigating the Impact of Emotional Prosodies on Voice Recognition and Perception

    Published on: August 9, 2024

    1.4K

    Related Experiment Videos

    Last Updated: Apr 30, 2026

    Using Eye Movements Recorded in the Visual World Paradigm to Explore the Online Processing of Spoken Language
    09:27

    Using Eye Movements Recorded in the Visual World Paradigm to Explore the Online Processing of Spoken Language

    Published on: October 13, 2018

    9.9K
    Foreign Accent and Forensic Speaker Identification in Voice Lineups: The Influence of Acoustic Features Based on Prosody
    09:09

    Foreign Accent and Forensic Speaker Identification in Voice Lineups: The Influence of Acoustic Features Based on Prosody

    Published on: September 27, 2024

    385
    Author Spotlight: Investigating the Impact of Emotional Prosodies on Voice Recognition and Perception
    05:48

    Author Spotlight: Investigating the Impact of Emotional Prosodies on Voice Recognition and Perception

    Published on: August 9, 2024

    1.4K

    Area of Science:

    • Computer Vision
    • Natural Language Processing
    • Artificial Intelligence

    Background:

    • Vision-language models (VLMs) are foundational for many tasks, yet scene text recognition (STR) predominantly uses single-modality pre-trained models.
    • VLMs like CLIP demonstrate robust capabilities in identifying diverse text types, including regular and irregular formats.
    • Adapting VLMs for STR offers potential for improved performance beyond traditional visual-only approaches.

    Purpose of the Study:

    • To introduce CLIP4STR, a novel scene text recognition method utilizing the CLIP vision-language model.
    • To develop a method that effectively integrates visual and textual semantics for enhanced text recognition.
    • To establish a strong baseline for future STR research leveraging VLMs.

    Main Methods:

    • CLIP4STR employs a dual encoder-decoder architecture with separate visual and cross-modal branches.
    • A visual branch provides initial text predictions based on image features.
    • A cross-modal branch refines predictions by reconciling visual features with text semantics using a dual predict-and-refine decoding scheme.

    Main Results:

    • CLIP4STR achieved state-of-the-art performance across 13 scene text recognition benchmarks.
    • Scaling the model size, pre-training, and training data significantly boosted CLIP4STR's effectiveness.
    • Empirical studies provide insights into CLIP's adaptation for scene text recognition tasks.

    Conclusions:

    • CLIP4STR demonstrates the efficacy of adapting VLMs, specifically CLIP, for scene text recognition.
    • The proposed dual-branch architecture and decoding scheme effectively leverage multi-modal information.
    • CLIP4STR serves as a robust and simple baseline for advancing VLM-based scene text recognition research.