Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

Depth Perception and Spatial Vision01:15

Depth Perception and Spatial Vision

2.5K
Depth perception is the ability to perceive objects three-dimensionally. It relies on two types of cues: binocular and monocular. Binocular cues depend on the combination of images from both eyes and how the eyes work together. Since the eyes are in slightly different positions, each eye captures a slightly different image. This disparity between images, known as binocular disparity, helps the brain interpret depth. When the brain compares these images, it determines the distance to an object.
2.5K

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

A framework of digital biomarkers for neurodegenerative diseases.

Nature reviews bioengineering·2026
Same author

The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human Motion.

Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition·2026
Same author

Discovering Latent Graphs with GFlowNets for Diverse Conditional Image Generation.

Advances in neural information processing systems·2026
Same author

Repurposing 2D Diffusion Models with Gaussian Atlas for 3D Generation.

... IEEE International Conference on Computer Vision workshops. IEEE International Conference on Computer Vision·2026
Same author

EchoAtlas: A Conversational, Multi-View Vision-Language Foundation Model for Echocardiography Interpretation and Clinical Reasoning.

medRxiv : the preprint server for health sciences·2026
Same author

Developing ICU Clinical Behavioral Atlas Using Ambient Intelligence and Computer Vision.

NEJM AI·2026
Same journal

Relation DETR+: Exploring Explicit Position Relation Prior for Dense Prediction.

IEEE transactions on pattern analysis and machine intelligence·2026
Same journal

RBF++: Quantifying and Optimizing Reasoning Boundaries across Measurable and Unmeasurable Capabilities for Chain-of-Thought Reasoning.

IEEE transactions on pattern analysis and machine intelligence·2026
Same journal

CAFE: Cross-View Adaptive Fusion and Cluster Center Enhancement for Robust Multi-View Clustering.

IEEE transactions on pattern analysis and machine intelligence·2026
Same journal

DIVER: Reinforced Diffusion Breaks Imitation Bottlenecks in End-to-End Autonomous Driving.

IEEE transactions on pattern analysis and machine intelligence·2026
Same journal

Ethics-Aware Safe Reinforcement Learning for Rare-Event Risk Control in Interactive Urban Driving.

IEEE transactions on pattern analysis and machine intelligence·2026
Same journal

Learning Shape Anchors for Holistic Indoor Scene Understanding.

IEEE transactions on pattern analysis and machine intelligence·2026
See all related articles

Related Experiment Video

Updated: Mar 16, 2026

Author Spotlight: Enhancement of Salient Object Detection for Smart Grid Applications
03:31

Author Spotlight: Enhancement of Salient Object Detection for Smart Grid Applications

Published on: December 15, 2023

1.2K

Deep Visual-Semantic Alignments for Generating Image Descriptions.

Andrej Karpathy, Li Fei-Fei

    IEEE Transactions on Pattern Analysis and Machine Intelligence
    |August 12, 2016
    PubMed
    Summary
    This summary is machine-generated.

    This study introduces a new model for generating image descriptions using deep learning. The model effectively aligns visual and language data, improving image region description generation and retrieval performance.

    More Related Videos

    Application of Deep Learning-Based Medical Image Segmentation via Orbital Computed Tomography
    04:48

    Application of Deep Learning-Based Medical Image Segmentation via Orbital Computed Tomography

    Published on: November 30, 2022

    3.7K
    Swin-PSAxialNet: An Efficient Multi-Organ Segmentation Technique
    04:48

    Swin-PSAxialNet: An Efficient Multi-Organ Segmentation Technique

    Published on: July 5, 2024

    852

    Related Experiment Videos

    Last Updated: Mar 16, 2026

    Author Spotlight: Enhancement of Salient Object Detection for Smart Grid Applications
    03:31

    Author Spotlight: Enhancement of Salient Object Detection for Smart Grid Applications

    Published on: December 15, 2023

    1.2K
    Application of Deep Learning-Based Medical Image Segmentation via Orbital Computed Tomography
    04:48

    Application of Deep Learning-Based Medical Image Segmentation via Orbital Computed Tomography

    Published on: November 30, 2022

    3.7K
    Swin-PSAxialNet: An Efficient Multi-Organ Segmentation Technique
    04:48

    Swin-PSAxialNet: An Efficient Multi-Organ Segmentation Technique

    Published on: July 5, 2024

    852

    Area of Science:

    • Computer Vision
    • Natural Language Processing
    • Artificial Intelligence

    Background:

    • Generating descriptive text for images is a challenging task in AI.
    • Understanding the relationship between visual content and language is crucial for AI development.

    Purpose of the Study:

    • To develop a novel model for generating natural language descriptions of images and their specific regions.
    • To improve the accuracy and relevance of AI-generated image captions.

    Main Methods:

    • Utilized Convolutional Neural Networks (CNNs) for image region analysis and bidirectional Recurrent Neural Networks (RNNs) for sentence processing.
    • Developed a multimodal embedding and structured objective to align visual and language data.
    • Implemented a Multimodal Recurrent Neural Network architecture for description generation.

    Main Results:

    • Achieved state-of-the-art results in image-text retrieval tasks on benchmark datasets (Flickr8K, Flickr30K, MSCOCO).
    • Generated descriptions outperformed retrieval baselines for both full images and region-level annotations.
    • Conducted large-scale analysis on the Visual Genome dataset, revealing captioning statistics.

    Conclusions:

    • The proposed alignment model effectively bridges the gap between visual and language modalities.
    • The Multimodal RNN architecture successfully generates novel and accurate descriptions for image regions.
    • The findings contribute to advancements in image captioning and multimodal AI research.