Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

Vision01:24

Vision

53.0K
Vision is the result of light being detected and transduced into neural signals by the retina of the eye. This information is then further analyzed and interpreted by the brain. First, light enters the front of the eye and is focused by the cornea and lens onto the retina—a thin sheet of neural tissue lining the back of the eye. Because of refraction through the convex lens of the eye, images are projected onto the retina upside-down and reversed.
53.0K
Types Of Transformers01:16

Types Of Transformers

952
Transformers can provide desired voltages to a circuit by modifying the number of turns in the secondary windings.
If the ratio of the number of turns in the secondary winding to that of the primary winding is greater than one, then the transformer is said to be a step-up transformer. In a step-up transformer, the voltage at the secondary winding is greater than the voltage applied at the primary winding.
However, if this ratio is less than one, the transformer is said to be a step-down...
952
Depth Perception and Spatial Vision01:15

Depth Perception and Spatial Vision

605
Depth perception is the ability to perceive objects three-dimensionally. It relies on two types of cues: binocular and monocular. Binocular cues depend on the combination of images from both eyes and how the eyes work together. Since the eyes are in slightly different positions, each eye captures a slightly different image. This disparity between images, known as binocular disparity, helps the brain interpret depth. When the brain compares these images, it determines the distance to an object.
605
Visual System01:26

Visual System

554
Light enters the eye through the cornea, a transparent, dome-shaped surface covering the surface of the eyeball that helps to direct and focus incoming light. This light is then channeled toward the pupil, an adjustable opening whose size is controlled by the iris. The iris, a pigmented muscle, regulates the amount of light entering the eye by contracting or dilating the pupil, thereby ensuring optimal light levels for clear vision.
Once through the pupil, the light passes through the lens, a...
554
Improving Translational Accuracy02:07

Improving Translational Accuracy

9.4K
Base complementarity between the three base pairs of mRNA codon and the tRNA anticodon is not a failsafe mechanism. Inaccuracies can range from a single mismatch to no correct base pairing at all. The free energy difference between the correct and nearly correct base pairs can be as small as 3 kcal/ mol. With complementarity being the only proofreading step, the estimated error frequency would be one wrong amino acid in every 100 amino acids incorporated. However, error frequencies observed in...
9.4K
Transformers in Distribution System01:27

Transformers in Distribution System

99
Transformers in distribution systems can be broadly categorized into distribution substation transformers and other distribution transformers. They are crucial for stepping down high transmission voltages to levels suitable for distribution and end-user applications.
Distribution substation transformers come in various ratings and typically use mineral oil for insulation and cooling. To prevent moisture and air from entering the oil, some transformers use an inert gas like nitrogen to fill the...
99

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

FocalClick-XL: Towards Unified and High-quality Interactive Segmentation.

IEEE transactions on pattern analysis and machine intelligence·2026
Same author

Causal Prompts for Open-Vocabulary Video Instance Segmentation.

IEEE transactions on pattern analysis and machine intelligence·2026
Same author

GPT4Point++: Advancing Unified Point-Language Understanding and Generation.

IEEE transactions on pattern analysis and machine intelligence·2025
Same author

Toward Unified 3D Object Detection via Algorithm and Data Unification.

IEEE transactions on pattern analysis and machine intelligence·2025
Same author

DreamComposer++: Empowering Diffusion Models with Multi-View Conditions for 3D Content Generation.

IEEE transactions on pattern analysis and machine intelligence·2025
Same author

AnyDoor: Zero-Shot Image Customization With Region-to-Region Reference.

IEEE transactions on pattern analysis and machine intelligence·2025
Same journal

Relation DETR+: Exploring Explicit Position Relation Prior for Dense Prediction.

IEEE transactions on pattern analysis and machine intelligence·2026
Same journal

RBF++: Quantifying and Optimizing Reasoning Boundaries across Measurable and Unmeasurable Capabilities for Chain-of-Thought Reasoning.

IEEE transactions on pattern analysis and machine intelligence·2026
Same journal

CAFE: Cross-View Adaptive Fusion and Cluster Center Enhancement for Robust Multi-View Clustering.

IEEE transactions on pattern analysis and machine intelligence·2026
Same journal

DIVER: Reinforced Diffusion Breaks Imitation Bottlenecks in End-to-End Autonomous Driving.

IEEE transactions on pattern analysis and machine intelligence·2026
Same journal

Ethics-Aware Safe Reinforcement Learning for Rare-Event Risk Control in Interactive Urban Driving.

IEEE transactions on pattern analysis and machine intelligence·2026
Same journal

Learning Shape Anchors for Holistic Indoor Scene Understanding.

IEEE transactions on pattern analysis and machine intelligence·2026
See all related articles

Related Experiment Video

Updated: Jun 12, 2025

Swin-PSAxialNet: An Efficient Multi-Organ Segmentation Technique
04:48

Swin-PSAxialNet: An Efficient Multi-Organ Segmentation Technique

Published on: July 5, 2024

377

Language-Aware Vision Transformer for Referring Segmentation.

Zhao Yang, Jiaqi Wang, Xubing Ye

    IEEE Transactions on Pattern Analysis and Machine Intelligence
    |September 25, 2024
    PubMed
    Summary
    This summary is machine-generated.

    This study introduces the Language-Aware Vision Transformer (LAVT) for referring segmentation, achieving better object localization by fusing language and visual features early in the model. LAVT enhances segmentation accuracy for both images and videos.

    More Related Videos

    A Swin Transformer-Based Model for Thyroid Nodule Detection in Ultrasound Images
    04:23

    A Swin Transformer-Based Model for Thyroid Nodule Detection in Ultrasound Images

    Published on: April 21, 2023

    1.8K
    Application of Deep Learning-Based Medical Image Segmentation via Orbital Computed Tomography
    04:48

    Application of Deep Learning-Based Medical Image Segmentation via Orbital Computed Tomography

    Published on: November 30, 2022

    2.7K

    Related Experiment Videos

    Last Updated: Jun 12, 2025

    Swin-PSAxialNet: An Efficient Multi-Organ Segmentation Technique
    04:48

    Swin-PSAxialNet: An Efficient Multi-Organ Segmentation Technique

    Published on: July 5, 2024

    377
    A Swin Transformer-Based Model for Thyroid Nodule Detection in Ultrasound Images
    04:23

    A Swin Transformer-Based Model for Thyroid Nodule Detection in Ultrasound Images

    Published on: April 21, 2023

    1.8K
    Application of Deep Learning-Based Medical Image Segmentation via Orbital Computed Tomography
    04:48

    Application of Deep Learning-Based Medical Image Segmentation via Orbital Computed Tomography

    Published on: November 30, 2022

    2.7K

    Area of Science:

    • Computer Vision
    • Natural Language Processing
    • Artificial Intelligence

    Background:

    • Referring segmentation is crucial for vision-language tasks, requiring precise object localization based on textual descriptions.
    • Existing methods often rely on late fusion within cross-modal decoders, which can limit alignment accuracy.
    • Transformers have shown success in vision-language tasks, but their application in referring segmentation can be improved.

    Purpose of the Study:

    • To propose a novel framework, Language-Aware Vision Transformer (LAVT), for improved referring segmentation.
    • To enhance cross-modal alignment by fusing linguistic and visual features early in the vision Transformer encoder.
    • To develop a unified framework capable of handling both image and video referring segmentation tasks.

    Main Methods:

    • Implemented early fusion of linguistic and visual features within the intermediate layers of a vision Transformer encoder.
    • Introduced a dense attention mechanism for capturing pixel-specific linguistic cues.
    • Developed a 3D version of the dense attention mechanism with multi-scale convolutional operators for video segmentation, leveraging spatio-temporal dependencies.
    • Proposed a unified LAVT framework for both image and video referring segmentation.

    Main Results:

    • Achieved significantly better cross-modal alignments compared to previous methods.
    • Demonstrated state-of-the-art performance on seven benchmark datasets for referring image and video segmentation.
    • The proposed LAVT framework provides accurate segmentation with a lightweight mask predictor.

    Conclusions:

    • Early fusion of multi-modal features in vision Transformer encoders is an effective strategy for referring segmentation.
    • The LAVT framework offers a unified and efficient approach for both image and video referring segmentation.
    • The proposed dense attention mechanism successfully extracts pixel-specific linguistic cues, improving segmentation accuracy.