Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

Depth Perception and Spatial Vision01:15

Depth Perception and Spatial Vision

1.4K
Depth perception is the ability to perceive objects three-dimensionally. It relies on two types of cues: binocular and monocular. Binocular cues depend on the combination of images from both eyes and how the eyes work together. Since the eyes are in slightly different positions, each eye captures a slightly different image. This disparity between images, known as binocular disparity, helps the brain interpret depth. When the brain compares these images, it determines the distance to an object.
1.4K
Collisions in Multiple Dimensions: Problem Solving01:06

Collisions in Multiple Dimensions: Problem Solving

4.8K
In multiple dimensions, the conservation of momentum applies in each direction independently. Hence, to solve collisions in multiple dimensions, we should write down the momentum conservation in each direction separately. To help understand collisions in multiple dimensions, consider an example.
A small car of mass 1,200 kg traveling east at 60 km/h collides at an intersection with a truck of mass 3,000 kg traveling due north at 40 km/h. The two vehicles are locked together. What is the...
4.8K
Visual Agnosia01:12

Visual Agnosia

639
Visual agnosia is a condition characterized by the inability to recognize visually presented objects despite having normal vision. For instance, a person with visual agnosia can describe the shape and color of an object but cannot identify or name it. This impairment does not affect their visual field, acuity, color vision, brightness discrimination, language, or memory. An example of this condition in a social setting is someone at a dinner party asking for "that silver thing with a round...
639
Vision01:24

Vision

58.5K
Vision is the result of light being detected and transduced into neural signals by the retina of the eye. This information is then further analyzed and interpreted by the brain. First, light enters the front of the eye and is focused by the cornea and lens onto the retina—a thin sheet of neural tissue lining the back of the eye. Because of refraction through the convex lens of the eye, images are projected onto the retina upside-down and reversed.
58.5K
Observational Learning01:12

Observational Learning

617
Albert Bandura's observational learning, also known as imitation or modeling, occurs when a person observes and imitates another's behavior. It is a quicker process than operant conditioning. A well-known example is the Bobo doll study, where children who saw an adult acting aggressively towards the doll were more likely to act aggressively when left alone, compared to those who observed a nonaggressive adult. Many psychologists view observational learning as a form of latent learning...
617
Associative Learning01:27

Associative Learning

862
Associative learning is a fundamental concept in behavioral psychology, wherein a connection is established between two stimuli or events, leading to a learned response. This process is critical in understanding how behaviors are acquired and modified. Conditioning, the mechanism through which associations are formed, can be divided into two main types: classical conditioning and operant conditioning, each elucidating different aspects of associative learning.
Classical conditioning, also known...
862

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

Tracklet Pair Proposal and Context Reasoning for Video Scene Graph Generation.

Sensors (Basel, Switzerland)·2021
Same author

Hybrid Imitation Learning Framework for Robotic Manipulation Tasks.

Sensors (Basel, Switzerland)·2021
Same author

Vision-Language-Knowledge Co-Embedding for Visual Commonsense Reasoning.

Sensors (Basel, Switzerland)·2021
Same author

NMN-VD: A Neural Module Network for Visual Dialog.

Sensors (Basel, Switzerland)·2021
Same author

Spatio-Temporal Action Detection in Untrimmed Videos by Using Multimodal Features and Region Proposals.

Sensors (Basel, Switzerland)·2019
Same author

A Robotic Context Query-Processing Framework Based on Spatio-Temporal Context Ontology.

Sensors (Basel, Switzerland)·2018
Same journal

RETRACTED: Zhang et al. A Novel Framework for Reconstruction and Imaging of Target Scattering Centers via Wide-Angle Incidence in Radar Networks. <i>Sensors</i> 2025, <i>25</i>, 6802.

Sensors (Basel, Switzerland)·2026
Same journal

Enhancing Unsupervised Multi-Source Domain Adaptation for Person Re-Identification via Mixture of Experts and Graph-Based Relation.

Sensors (Basel, Switzerland)·2026
Same journal

Development of an Instrumented Glove for Palmar Pressure Assessment in Kayakers.

Sensors (Basel, Switzerland)·2026
Same journal

Development and Experimental Validation of an Autonomous IoT-Based Monitoring System for Real-Time Water Quality Assessment in the Amazon River.

Sensors (Basel, Switzerland)·2026
Same journal

Semi-Supervised Adversarial Learning Framework for Controller Area Network Bus Intrusion Detection.

Sensors (Basel, Switzerland)·2026
Same journal

Smart Optimization Method for Safety Signs in Innovative Manufacturing Environments Integrating Industrial Field IoT Sensors and Knowledge Graphs.

Sensors (Basel, Switzerland)·2026
See all related articles

Related Experiment Video

Updated: Nov 18, 2025

Modeling the Functional Network for Spatial Navigation in the Human Brain
05:55

Modeling the Functional Network for Spatial Navigation in the Human Brain

Published on: October 13, 2023

1.3K

Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation.

Jisu Hwang1, Incheol Kim1

  • 1Department of Computer Science, Kyonggi University, Suwon-si 16227, Korea.

Sensors (Basel, Switzerland)
|February 5, 2021
PubMed
Summary
This summary is machine-generated.

This study introduces JMEBS, a novel deep neural network for vision-and-language navigation (VLN). It enhances navigation success rates and path optimization using joint multimodal embedding and backtracking search.

Keywords:
backtracking-enabled greedy local searchdeep neural networkmultimodal embeddingnatural language instructionpanoramic imagepretrained modelthree-dimensional simulated indoor environmentvision-and-language navigation task

More Related Videos

Author Spotlight: Investigating the Effects of Mind-Body-Movement Practices on Brain Function
06:17

Author Spotlight: Investigating the Effects of Mind-Body-Movement Practices on Brain Function

Published on: January 26, 2024

2.4K
A Methodology for Capturing Joint Visual Attention Using Mobile Eye-Trackers
12:39

A Methodology for Capturing Joint Visual Attention Using Mobile Eye-Trackers

Published on: January 18, 2020

7.9K

Related Experiment Videos

Last Updated: Nov 18, 2025

Modeling the Functional Network for Spatial Navigation in the Human Brain
05:55

Modeling the Functional Network for Spatial Navigation in the Human Brain

Published on: October 13, 2023

1.3K
Author Spotlight: Investigating the Effects of Mind-Body-Movement Practices on Brain Function
06:17

Author Spotlight: Investigating the Effects of Mind-Body-Movement Practices on Brain Function

Published on: January 26, 2024

2.4K
A Methodology for Capturing Joint Visual Attention Using Mobile Eye-Trackers
12:39

A Methodology for Capturing Joint Visual Attention Using Mobile Eye-Trackers

Published on: January 18, 2020

7.9K

Area of Science:

  • Artificial Intelligence
  • Computer Vision
  • Natural Language Processing

Background:

  • Multimodal intelligent tasks integrating vision and language are gaining prominence.
  • Vision-and-language navigation (VLN) requires aligning and grounding image and text data for real-time task status perception.

Purpose of the Study:

  • To propose a novel deep neural network model, JMEBS, for enhanced performance in vision-and-language navigation tasks.
  • To improve task success rates and optimize navigation paths through advanced embedding and search algorithms.

Main Methods:

  • Developed a transformer-based joint multimodal embedding module for JMEBS, utilizing both multimodal and temporal contexts.
  • Implemented backtracking-enabled greedy local search (BGLS) with a novel global scoring method for action selection and trajectory evaluation.
  • Evaluated the model using the Matterport3D Simulator and room-to-room (R2R) benchmark datasets.

Main Results:

  • The JMEBS model demonstrated improved task success rates and optimized navigation paths compared to existing models.
  • The novel global scoring method effectively improved performance by comparing partial trajectories with natural language instructions.
  • Experimental results validated the model's effectiveness across various operations on benchmark datasets.

Conclusions:

  • The proposed JMEBS model offers a significant advancement in vision-and-language navigation.
  • The integration of joint multimodal embedding and backtracking search effectively addresses key challenges in VLN.
  • JMEBS provides a robust framework for future research in embodied AI and multimodal understanding.