Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Transformers

Transformers

A device that transforms voltages from one value to another using induction is called a transformer. A transformer consists of two separate coils, or windings, wrapped around the same soft iron core. However, they are electrically insulated from each other.
The iron core has a substantial relative permeability. Therefore, the magnetic field lines generated due to the current in one winding are almost entirely confined within the core, such that the same magnetic flux permeates each turn of both...

Deconvolution

Deconvolution

Deconvolution, also known as inverse filtering, is the process of extracting the impulse response from known input and output signals. This technique is vital in scenarios where the system's characteristics are unknown, and they must be inferred from the observable signals.
Deconvolution involves several mathematical techniques to derive the impulse response. One common approach is polynomial division. In this method, the input and output sequences are treated as coefficients of...

Upsampling

Upsampling

Managing signal sampling rates is essential in digital signal processing to maintain signal integrity. A decimated signal, characterized by a reduced frequency range due to its lower sampling rate, can be upsampled by inserting zeros between each sample. This upsampling process expands the original spectrum and introduces repeated spectral replicas at intervals dictated by the new Nyquist frequency. To refine this zero-inserted sequence, it is passed through a lowpass filter with a cutoff...

Downsampling

Downsampling

When considering a sampled sequence with zero values between sampling instants, one can replace it by taking every N-th value of the sequence. At these integer multiples of N, the original and sampled sequences coincide. This process, known as decimation, involves extracting every N-th sample from a sequence, thereby creating a more efficient sequence.
The Fourier transform of the decimated sequence reveals a combination of scaled and shifted versions of the original spectrum. This...

Diffusion

Diffusion

Diffusion is a type of passive transport. In passive transport, a substance tends to move from an area of high concentration to an area of low concentration until the concentration is equal across the space. For example, take the diffusion of substances through the air. When someone opens a perfume bottle in a room filled with people, the perfume is at its highest concentration in the bottle and is at its lowest at the edges of the room. The perfume vapor will diffuse, or spread away, from the...

Diffusion

Diffusion

Diffusion is the passive movement of substances down their concentration gradients—requiring no expenditure of cellular energy. Substances, such as molecules or ions, diffuse from an area of high concentration to an area of low concentration in the cytosol or across membranes. Eventually, the concentration will even out, with the substance moving randomly but causing no net change in concentration. Such a state is called dynamic equilibrium, which is essential for maintaining overall...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

LoRASculpt: Harmonious Low-Rank Adaptation for Multimodal Large Language Models.

IEEE transactions on pattern analysis and machine intelligence·2026

Same author

Towards clinical-level interpretation of dental panoramic radiography using an instance-guided vision-language model.

Nature biomedical engineering·2026

Same author

Systemic immune-inflammation index predicts post-thrombectomy outcomes and reveals a mediating role in the association between neurocardiac stress and prognosis: a multicenter study.

Frontiers in neurology·2026

Same author

Holistic Invariant Retracing for Distortion-Resilient Multi-Modal Learning in Spatial Transcriptomics.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026

Same author

Differentiable Clustering Graph Convolutional Network for Hyperspectral Unmixing: Methodology and Benchmark.

IEEE transactions on neural networks and learning systems·2026

Same author

MUP-SAM: Multi-scale vision mamba UNet prompt generation for SAM in multi-organ medical image segmentation.

Neural networks : the official journal of the International Neural Network Society·2026

Same journal

Change-Prior-Guided Unsupervised Change Detection of Heterogeneous Remote Sensing Images.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026

Same journal

AgonicDreamer: Enhancing Multi-View Consistency in Text-to-3D Generation via Rectified Score Distillation.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026

Same journal

BiCM-Prompt: Bidirectional Cross-Modal Prompt Tuning for Class-Incremental Learning on Multisource Remote Sensing Images.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026

Same journal

GoP-based Quality Enhancement on Video Compression.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026

Same journal

Align then Tensorize: Multi-Level Consistent Anchor Graph Learning for Scalable Multi-View Clustering.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026

Same journal

Beyond Fidelity: Diverse Image Synthesis via Retrieval-Augmented Diffusion.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026

See all related articles

Search research articles

Related Experiment Video

Updated: Jan 8, 2026

A Swin Transformer-Based Model for Thyroid Nodule Detection in Ultrasound Images

A Swin Transformer-Based Model for Thyroid Nodule Detection in Ultrasound Images

Published on: April 21, 2023

Fine-Grained Image Captioning by Ranking Diffusion Transformer.

Jun Wan, Min Gan, Lefei Zhang

IEEE Transactions on Image Processing : a Publication of the IEEE Signal Processing Society

|December 15, 2025

Summary

This summary is machine-generated.

This study introduces a new Ranking Diffusion Transformer (RDT) for image captioning, improving descriptive and discriminative captions by better using visual cues and aligning vision with language. The RDT model achieves state-of-the-art results on benchmark datasets.

More Related Videos

Swin-PSAxialNet: An Efficient Multi-Organ Segmentation Technique

Swin-PSAxialNet: An Efficient Multi-Organ Segmentation Technique

Published on: July 5, 2024

Author Spotlight: Enhancement of Salient Object Detection for Smart Grid Applications

Author Spotlight: Enhancement of Salient Object Detection for Smart Grid Applications

Published on: December 15, 2023

Related Experiment Videos

Last Updated: Jan 8, 2026

A Swin Transformer-Based Model for Thyroid Nodule Detection in Ultrasound Images

A Swin Transformer-Based Model for Thyroid Nodule Detection in Ultrasound Images

Published on: April 21, 2023

Swin-PSAxialNet: An Efficient Multi-Organ Segmentation Technique

Swin-PSAxialNet: An Efficient Multi-Organ Segmentation Technique

Published on: July 5, 2024

Author Spotlight: Enhancement of Salient Object Detection for Smart Grid Applications

Author Spotlight: Enhancement of Salient Object Detection for Smart Grid Applications

Published on: December 15, 2023

Area of Science:

Computer Vision
Artificial Intelligence
Natural Language Processing

Background:

CLIP visual feature-based image captioning models have advanced rapidly.
Existing models face challenges in generating descriptive and discriminative captions due to insufficient fine-grained visual cue exploitation and complex vision-language alignment modeling.

Purpose of the Study:

To address limitations in current image captioning models.
To propose a novel approach for fine-grained image captioning that enhances descriptive and discriminative capabilities.

Main Methods:

Introduced the Ranking Diffusion Transformer (RDT) model.
Integrated a Ranking Visual Encoder (RVE) with a novel ranking attention mechanism to mine diverse visual information from CLIP features.
Incorporated a Ranking Loss (RL) that uses caption quality ranking as a global semantic supervisory signal to enhance the diffusion process and vision-language alignment.

Main Results:

The RVE effectively mines diverse and discriminative visual information.
The RL strengthens vision-language semantic alignment by leveraging caption quality ranking.
The RDT model learns more discriminative visual features precisely aligned with language features through collaborative RVE and RL, and controlled noise diffusion.
Experimental results show the RDT surpasses existing state-of-the-art image captioning models on benchmark datasets.

Conclusions:

The proposed Ranking Diffusion Transformer (RDT) effectively addresses limitations in current image captioning models.
The RDT model demonstrates superior performance in generating descriptive and discriminative captions by enhancing fine-grained visual cue utilization and vision-language alignment.
The RDT represents a significant advancement in the field of fine-grained image captioning.