Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Vision

Vision

Vision is the result of light being detected and transduced into neural signals by the retina of the eye. This information is then further analyzed and interpreted by the brain. First, light enters the front of the eye and is focused by the cornea and lens onto the retina—a thin sheet of neural tissue lining the back of the eye. Because of refraction through the convex lens of the eye, images are projected onto the retina upside-down and reversed.

Force Classification

Force Classification

Forces play a crucial role in the study of physics and engineering. They are essential in describing the motion, behavior, and equilibrium of objects in the physical world. Forces can be classified based on their origin, type, and direction of action.
Contact and non-contact forces are two of the most widely used categories of forces. As the name suggests, contact forces require physical contact between two objects to act upon each other. Examples of contact forces include frictional,...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Chemistry-Informed Machine Learning Framework for Predicting Structural Properties in Osmabenzene Complexes.

The journal of physical chemistry letters·2026

Same author

Mask-Guided Self-Supervised Video Object Segmentation.

IEEE transactions on pattern analysis and machine intelligence·2026

Same author

Gut microbiota and bile acids profiles study of ulcerative colitis and Crohn's disease patients.

Frontiers in microbiology·2026

Same author

ChatLeafDisease: a chain-of-thought prompting approach for crop disease classification using large language models.

Plant phenomics (Washington, D.C.)·2025

Same author

SOX2 induces LPCAT1 expression to promote cholesterol metabolic reprogramming-mediated invasion and metastasis in osteosarcoma.

Frontiers in molecular biosciences·2025

Same author

Identification of routine blood derived hematological and lipid indices in ILD through machine learning; a retrospective case-control study.

Frontiers in medicine·2025

Same journal

Change-Prior-Guided Unsupervised Change Detection of Heterogeneous Remote Sensing Images.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026

Same journal

AgonicDreamer: Enhancing Multi-View Consistency in Text-to-3D Generation via Rectified Score Distillation.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026

Same journal

BiCM-Prompt: Bidirectional Cross-Modal Prompt Tuning for Class-Incremental Learning on Multisource Remote Sensing Images.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026

Same journal

GoP-based Quality Enhancement on Video Compression.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026

Same journal

Align then Tensorize: Multi-Level Consistent Anchor Graph Learning for Scalable Multi-View Clustering.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026

Same journal

Beyond Fidelity: Diverse Image Synthesis via Retrieval-Augmented Diffusion.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026

See all related articles

Search research articles

Related Experiment Video

Updated: Apr 30, 2026

Using Eye Movements Recorded in the Visual World Paradigm to Explore the Online Processing of Spoken Language

Using Eye Movements Recorded in the Visual World Paradigm to Explore the Online Processing of Spoken Language

Published on: October 13, 2018

CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model.

Shuai Zhao, Ruijie Quan, Linchao Zhu

IEEE Transactions on Image Processing : a Publication of the IEEE Signal Processing Society

|March 3, 2025

Summary

This summary is machine-generated.

CLIP4STR leverages vision-language models (VLMs) for scene text recognition (STR). This method enhances text recognition accuracy by combining visual and cross-modal features, setting a new benchmark.

More Related Videos

Foreign Accent and Forensic Speaker Identification in Voice Lineups: The Influence of Acoustic Features Based on Prosody

Foreign Accent and Forensic Speaker Identification in Voice Lineups: The Influence of Acoustic Features Based on Prosody

Published on: September 27, 2024

Author Spotlight: Investigating the Impact of Emotional Prosodies on Voice Recognition and Perception

Author Spotlight: Investigating the Impact of Emotional Prosodies on Voice Recognition and Perception

Published on: August 9, 2024

Related Experiment Videos

Last Updated: Apr 30, 2026

Using Eye Movements Recorded in the Visual World Paradigm to Explore the Online Processing of Spoken Language

Using Eye Movements Recorded in the Visual World Paradigm to Explore the Online Processing of Spoken Language

Published on: October 13, 2018

Foreign Accent and Forensic Speaker Identification in Voice Lineups: The Influence of Acoustic Features Based on Prosody

Foreign Accent and Forensic Speaker Identification in Voice Lineups: The Influence of Acoustic Features Based on Prosody

Published on: September 27, 2024

Author Spotlight: Investigating the Impact of Emotional Prosodies on Voice Recognition and Perception

Author Spotlight: Investigating the Impact of Emotional Prosodies on Voice Recognition and Perception

Published on: August 9, 2024

Area of Science:

Computer Vision
Natural Language Processing
Artificial Intelligence

Background:

Vision-language models (VLMs) are foundational for many tasks, yet scene text recognition (STR) predominantly uses single-modality pre-trained models.
VLMs like CLIP demonstrate robust capabilities in identifying diverse text types, including regular and irregular formats.
Adapting VLMs for STR offers potential for improved performance beyond traditional visual-only approaches.

Purpose of the Study:

To introduce CLIP4STR, a novel scene text recognition method utilizing the CLIP vision-language model.
To develop a method that effectively integrates visual and textual semantics for enhanced text recognition.
To establish a strong baseline for future STR research leveraging VLMs.

Main Methods:

CLIP4STR employs a dual encoder-decoder architecture with separate visual and cross-modal branches.
A visual branch provides initial text predictions based on image features.
A cross-modal branch refines predictions by reconciling visual features with text semantics using a dual predict-and-refine decoding scheme.

Main Results:

CLIP4STR achieved state-of-the-art performance across 13 scene text recognition benchmarks.
Scaling the model size, pre-training, and training data significantly boosted CLIP4STR's effectiveness.
Empirical studies provide insights into CLIP's adaptation for scene text recognition tasks.

Conclusions:

CLIP4STR demonstrates the efficacy of adapting VLMs, specifically CLIP, for scene text recognition.
The proposed dual-branch architecture and decoding scheme effectively leverage multi-modal information.
CLIP4STR serves as a robust and simple baseline for advancing VLM-based scene text recognition research.