Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

Vision01:24

Vision

55.8K
Vision is the result of light being detected and transduced into neural signals by the retina of the eye. This information is then further analyzed and interpreted by the brain. First, light enters the front of the eye and is focused by the cornea and lens onto the retina—a thin sheet of neural tissue lining the back of the eye. Because of refraction through the convex lens of the eye, images are projected onto the retina upside-down and reversed.
55.8K
Stereotype Content Model02:16

Stereotype Content Model

14.9K
The Stereotype Content Model (SCM) was first proposed by Susan Fiske and her colleagues (Fiske, Cuddy, Glick & Xu, 2002; see also Fiske, 2012 and Fiske, 2017). The SCM specifies that when someone encounters a new group, they will stereotype them based on two metrics: warmth—or that group’s perceived intent, and how likely they are to provide help or inflict harm—and competence—or their ability to carry out that objective. Depending on the warmth-competence...
14.9K

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

Multiscale Aggregate Networks with Dense Connections for Crowd Counting.

Computational intelligence and neuroscience·2021
Same author

Skeleton-Based Action Recognition Based on Distance Vector and Multihigh View Adaptive Networks.

Computational intelligence and neuroscience·2021
Same journal

RETRACTION: Real-Time Modulation of Physical Training Intensity Based on Wavelet Recursive Fuzzy Neural Networks.

Computational intelligence and neuroscience·2026
Same journal

RETRACTION: Multidimensional Heterogeneous Network Link Adaptation Based on Mobile Environment.

Computational intelligence and neuroscience·2026
Same journal

RETRACTION: Framework to Segment and Evaluate Multiple Sclerosis Lesion in MRI Slices Using VGG-UNet.

Computational intelligence and neuroscience·2026
Same journal

RETRACTION: Facial Emotion Recognition Using a Novel Fusion of Convolutional Neural Network and Local Binary Pattern in Crime Investigation.

Computational intelligence and neuroscience·2026
Same journal

RETRACTION: Automatic Intelligent System Using Medical of Things for Multiple Sclerosis Detection.

Computational intelligence and neuroscience·2026
Same journal

RETRACTION: Intangible Cultural Heritage Reproduction and Revitalization: Value Feedback, Practice, and Exploration Based on the IPA Model.

Computational intelligence and neuroscience·2026
See all related articles

Related Experiment Video

Updated: Oct 5, 2025

A Swin Transformer-Based Model for Thyroid Nodule Detection in Ultrasound Images
04:23

A Swin Transformer-Based Model for Thyroid Nodule Detection in Ultrasound Images

Published on: April 21, 2023

2.0K

Visual-Text Reference Pretraining Model for Image Captioning.

Pengfei Li1, Min Zhang1, Peijie Lin1

  • 1Hangzhou Dianzi University, Baiyang Road #2, Hangzhou, China.

Computational Intelligence and Neuroscience
|January 31, 2022
PubMed
Summary
This summary is machine-generated.

The novel Visual-Text Reference Pretraining Model (VTR-PTM) enhances image captioning by integrating visual and textual information. This new approach significantly improves performance on benchmark datasets like MS COCO and Visual Genome.

More Related Videos

Swin-PSAxialNet: An Efficient Multi-Organ Segmentation Technique
04:48

Swin-PSAxialNet: An Efficient Multi-Organ Segmentation Technique

Published on: July 5, 2024

543
Author Spotlight: Enhancement of Salient Object Detection for Smart Grid Applications
03:31

Author Spotlight: Enhancement of Salient Object Detection for Smart Grid Applications

Published on: December 15, 2023

672

Related Experiment Videos

Last Updated: Oct 5, 2025

A Swin Transformer-Based Model for Thyroid Nodule Detection in Ultrasound Images
04:23

A Swin Transformer-Based Model for Thyroid Nodule Detection in Ultrasound Images

Published on: April 21, 2023

2.0K
Swin-PSAxialNet: An Efficient Multi-Organ Segmentation Technique
04:48

Swin-PSAxialNet: An Efficient Multi-Organ Segmentation Technique

Published on: July 5, 2024

543
Author Spotlight: Enhancement of Salient Object Detection for Smart Grid Applications
03:31

Author Spotlight: Enhancement of Salient Object Detection for Smart Grid Applications

Published on: December 15, 2023

672

Area of Science:

  • Artificial Intelligence
  • Computer Vision
  • Natural Language Processing

Background:

  • Image captioning models traditionally rely on either visual or textual data.
  • Integrating both visual and textual information effectively remains a challenge.
  • Pretraining models offer a promising avenue for improving performance on downstream tasks.

Purpose of the Study:

  • To introduce a novel pretraining model, VTR-PTM (Visual-Text Reference Pretraining Model), for image captioning.
  • To leverage both visual and textual references within a unified pretraining framework.
  • To enhance the accuracy and relevance of automatically generated image captions.

Main Methods:

  • Designed a dual-stream input mode incorporating both image and text references.
  • Utilized two distinct masking strategies (bidirectional and sequence-to-sequence) for pretraining.
  • Built upon existing pretraining models like BERT/UNIML.
  • Fine-tuned the VTR-PTM on target image captioning datasets.

Main Results:

  • Achieved significant improvements across most evaluation metrics on benchmark datasets.
  • Demonstrated the effectiveness of the visual-text reference approach in pretraining.
  • Outperformed existing methods on MS COCO and Visual Genome datasets.

Conclusions:

  • VTR-PTM is the first pretraining model to effectively utilize visual-text references for image captioning.
  • The proposed dual-stream input and masking strategies are crucial for the model's success.
  • The model shows strong potential for advancing the field of image captioning.