Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Depth Perception and Spatial Vision

Depth Perception and Spatial Vision

Depth perception is the ability to perceive objects three-dimensionally. It relies on two types of cues: binocular and monocular. Binocular cues depend on the combination of images from both eyes and how the eyes work together. Since the eyes are in slightly different positions, each eye captures a slightly different image. This disparity between images, known as binocular disparity, helps the brain interpret depth. When the brain compares these images, it determines the distance to an object.

Perception

Perception

Perception is a fundamental psychological process that enables individuals to organize, interpret, and consciously experience sensory information. This process is crucial for understanding and interacting with the world around us. It includes both bottom-up and top-down processing, each playing a distinct role in how we perceive our environment.
Bottom-up processing begins at the sensory level, where receptors detect external environmental stimuli. These could include the tactile sensation of...

Visual System

Visual System

Light enters the eye through the cornea, a transparent, dome-shaped surface covering the surface of the eyeball that helps to direct and focus incoming light. This light is then channeled toward the pupil, an adjustable opening whose size is controlled by the iris. The iris, a pigmented muscle, regulates the amount of light entering the eye by contracting or dilating the pupil, thereby ensuring optimal light levels for clear vision.
Once through the pupil, the light passes through the lens, a...

Gestalt Principles of Perception

Gestalt Principles of Perception

Gestalt principles provide a framework for understanding how humans perceive objects as unified wholes within their context. These principles are essential in explaining the cognitive processes that make sense of complex visual stimuli by organizing them into coherent groups. One fundamental principle is proximity, which posits that objects located close to each other are perceived as a collective group. For instance, when dots are positioned near one another, the visual system interprets them...

Vision

Vision

Vision is the result of light being detected and transduced into neural signals by the retina of the eye. This information is then further analyzed and interpreted by the brain. First, light enters the front of the eye and is focused by the cornea and lens onto the retina—a thin sheet of neural tissue lining the back of the eye. Because of refraction through the convex lens of the eye, images are projected onto the retina upside-down and reversed.

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Corrigendum to "Dual-target carboxymethylated mannan nanoparticles for enhanced pH-responsive monomethyl auristatin E delivery in hepatocellular carcinoma therapy" [Int. J. Biol. Macromol. Volume 339, Part 1, (2026) 149879].

International journal of biological macromolecules·2026

Same author

From distance to context: GPS-derived life-space mapping in older adults with and without dementia.

Health & place·2026

Same author

Endoplasmic Reticulum-Targeted Biomimetic Nanoparticles Potentiate the Immunotherapy of Triple-Negative Breast Cancer by Improving Immunogenicity and Eliminating Immune Resistance.

ACS nano·2026

Same author

Corrigendum to "Characterization, bioactivity and pharmacokinetic study of a novel carbohydrate-peptide polymer: Glycol-split heparin-endostatin2 (GSHP-ES2)" [Carbohydrate Polymers 207 (2019) 79-90].

Carbohydrate polymers·2026

Same author

Comprehensive and visualized analysis of the global application of the international standards for neurological classification of spinal cord injury: A Bibliometric Study.

Spinal cord·2026

Same author

Exosomes: critical mediator and therapeutic target for osteoarthritis.

American journal of translational research·2026

Same journal

Exploring Synergy Between Tactile Perception and Arm Usage.

IEEE ... International Conference on Rehabilitation Robotics : [proceedings]·2025

Same journal

Multi-Modal Muscle Activation Modeling Using Koopman Operator Linearization for an Ankle Exoskeleton.

IEEE ... International Conference on Rehabilitation Robotics : [proceedings]·2025

Same journal

Unsupervised Robot-Assisted Therapy at Home After Stroke: a Pilot Feasibility Study.

IEEE ... International Conference on Rehabilitation Robotics : [proceedings]·2025

Same journal

Optimizing Senior Living with Robots: A User Study on Social and Architectural Integration.

IEEE ... International Conference on Rehabilitation Robotics : [proceedings]·2025

Same journal

Effects of Exoskeletons on Error Between Marker and Markerless Motion Capture in Children With Crouch Gait: A Pilot Study.

IEEE ... International Conference on Rehabilitation Robotics : [proceedings]·2025

Same journal

Recovr Glove: Accessible Hand Exoskeleton for Stroke Rehabilitation and Everyday Aid.

IEEE ... International Conference on Rehabilitation Robotics : [proceedings]·2025

See all related articles

Search research articles

Related Experiment Video

Updated: Sep 16, 2025

Development of an Audio-based Virtual Gaming Environment to Assist with Navigation Skills in the Blind

Development of an Audio-based Virtual Gaming Environment to Assist with Navigation Skills in the Blind

Published on: March 27, 2013

Egocentric Perception of Walking Environments Using an Interactive Vision-Language System.

Haining Tan, Alex Mihailidis, Brokoslaw Laschowski

IEEE ... International Conference on Rehabilitation Robotics : [Proceedings]

|July 11, 2025

Summary

This summary is machine-generated.

This study introduces a multimodal vision-language system for egocentric perception, enhancing scene understanding for robotics. The system generates personalized image captions with audio feedback, improving human-AI interaction in real-world navigation.

More Related Videos

Using a Virtual Reality Walking Simulator to Investigate Pedestrian Behavior

Using a Virtual Reality Walking Simulator to Investigate Pedestrian Behavior

Published on: June 9, 2020

Virtual Reality Experiments with Physiological Measures

Virtual Reality Experiments with Physiological Measures

Published on: August 29, 2018

Related Experiment Videos

Last Updated: Sep 16, 2025

Development of an Audio-based Virtual Gaming Environment to Assist with Navigation Skills in the Blind

Development of an Audio-based Virtual Gaming Environment to Assist with Navigation Skills in the Blind

Published on: March 27, 2013

Using a Virtual Reality Walking Simulator to Investigate Pedestrian Behavior

Using a Virtual Reality Walking Simulator to Investigate Pedestrian Behavior

Published on: June 9, 2020

Virtual Reality Experiments with Physiological Measures

Virtual Reality Experiments with Physiological Measures

Published on: August 29, 2018

Area of Science:

Computer Vision
Artificial Intelligence
Robotics

Background:

Large language models (LLMs) offer contextual scene understanding beyond computer vision.
Embodied intelligence and robotics benefit from enhanced perception systems.

Purpose of the Study:

Develop a multimodal vision-language system for egocentric visual perception.
Enable personalized image captioning and audio feedback for real-world navigation.

Main Methods:

Trained transformer-based vision-language models using causal language modeling.
Utilized a custom dataset of 43,055 image-text pairs for few-shot image captioning.
Developed a speech synthesis model and user interface for audio feedback and personalized captions via user prompts.

Main Results:

Generated detailed image captions (avg. 10 words) with high ROUGE-L score (43.9%) and low word error rate (28.1%).
Achieved end-to-end processing time of 2.2 seconds.
Demonstrated effective personalization of captions through user prompts.

Conclusions:

The multimodal system provides accurate, detailed scene descriptions.
Personalized captions optimize human-AI interaction for environmental understanding and navigation.
This work advances embodied AI by integrating human cognition into generative models.