Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

Transformers01:26

Transformers

1.7K
A device that transforms voltages from one value to another using induction is called a transformer. A transformer consists of two separate coils, or windings, wrapped around the same soft iron core. However, they are electrically insulated from each other.
The iron core has a substantial relative permeability. Therefore, the magnetic field lines generated due to the current in one winding are almost entirely confined within the core, such that the same magnetic flux permeates each turn of both...
1.7K
Deconvolution01:20

Deconvolution

524
Deconvolution, also known as inverse filtering, is the process of extracting the impulse response from known input and output signals. This technique is vital in scenarios where the system's characteristics are unknown, and they must be inferred from the observable signals.
Deconvolution involves several mathematical techniques to derive the impulse response. One common approach is polynomial division. In this method, the input and output sequences are treated as coefficients of...
524
Upsampling01:22

Upsampling

568
Managing signal sampling rates is essential in digital signal processing to maintain signal integrity. A decimated signal, characterized by a reduced frequency range due to its lower sampling rate, can be upsampled by inserting zeros between each sample. This upsampling process expands the original spectrum and introduces repeated spectral replicas at intervals dictated by the new Nyquist frequency. To refine this zero-inserted sequence, it is passed through a lowpass filter with a cutoff...
568
Downsampling01:20

Downsampling

575
When considering a sampled sequence with zero values between sampling instants, one can replace it by taking every N-th value of the sequence. At these integer multiples of N, the original and sampled sequences coincide. This process, known as decimation, involves extracting every N-th sample from a sequence, thereby creating a more efficient sequence.
The Fourier transform of the decimated sequence reveals a combination of scaled and shifted versions of the original spectrum. This...
575
Diffusion01:21

Diffusion

6.1K
Diffusion is a type of passive transport. In passive transport, a substance tends to move from an area of high concentration to an area of low concentration until the concentration is equal across the space. For example, take the diffusion of substances through the air. When someone opens a perfume bottle in a room filled with people, the perfume is at its highest concentration in the bottle and is at its lowest at the edges of the room. The perfume vapor will diffuse, or spread away, from the...
6.1K
Diffusion01:12

Diffusion

215.7K
Diffusion is the passive movement of substances down their concentration gradients—requiring no expenditure of cellular energy. Substances, such as molecules or ions, diffuse from an area of high concentration to an area of low concentration in the cytosol or across membranes. Eventually, the concentration will even out, with the substance moving randomly but causing no net change in concentration. Such a state is called dynamic equilibrium, which is essential for maintaining overall...
215.7K

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

LoRASculpt: Harmonious Low-Rank Adaptation for Multimodal Large Language Models.

IEEE transactions on pattern analysis and machine intelligence·2026
Same author

Towards clinical-level interpretation of dental panoramic radiography using an instance-guided vision-language model.

Nature biomedical engineering·2026
Same author

Systemic immune-inflammation index predicts post-thrombectomy outcomes and reveals a mediating role in the association between neurocardiac stress and prognosis: a multicenter study.

Frontiers in neurology·2026
Same author

Holistic Invariant Retracing for Distortion-Resilient Multi-Modal Learning in Spatial Transcriptomics.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026
Same author

Differentiable Clustering Graph Convolutional Network for Hyperspectral Unmixing: Methodology and Benchmark.

IEEE transactions on neural networks and learning systems·2026
Same author

MUP-SAM: Multi-scale vision mamba UNet prompt generation for SAM in multi-organ medical image segmentation.

Neural networks : the official journal of the International Neural Network Society·2026
Same journal

Change-Prior-Guided Unsupervised Change Detection of Heterogeneous Remote Sensing Images.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026
Same journal

AgonicDreamer: Enhancing Multi-View Consistency in Text-to-3D Generation via Rectified Score Distillation.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026
Same journal

BiCM-Prompt: Bidirectional Cross-Modal Prompt Tuning for Class-Incremental Learning on Multisource Remote Sensing Images.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026
Same journal

GoP-based Quality Enhancement on Video Compression.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026
Same journal

Align then Tensorize: Multi-Level Consistent Anchor Graph Learning for Scalable Multi-View Clustering.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026
Same journal

Beyond Fidelity: Diverse Image Synthesis via Retrieval-Augmented Diffusion.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026
See all related articles

Related Experiment Video

Updated: Jan 8, 2026

A Swin Transformer-Based Model for Thyroid Nodule Detection in Ultrasound Images
04:23

A Swin Transformer-Based Model for Thyroid Nodule Detection in Ultrasound Images

Published on: April 21, 2023

2.2K

Fine-Grained Image Captioning by Ranking Diffusion Transformer.

Jun Wan, Min Gan, Lefei Zhang

    IEEE Transactions on Image Processing : a Publication of the IEEE Signal Processing Society
    |December 15, 2025
    PubMed
    Summary
    This summary is machine-generated.

    This study introduces a new Ranking Diffusion Transformer (RDT) for image captioning, improving descriptive and discriminative captions by better using visual cues and aligning vision with language. The RDT model achieves state-of-the-art results on benchmark datasets.

    More Related Videos

    Swin-PSAxialNet: An Efficient Multi-Organ Segmentation Technique
    04:48

    Swin-PSAxialNet: An Efficient Multi-Organ Segmentation Technique

    Published on: July 5, 2024

    723
    Author Spotlight: Enhancement of Salient Object Detection for Smart Grid Applications
    03:31

    Author Spotlight: Enhancement of Salient Object Detection for Smart Grid Applications

    Published on: December 15, 2023

    991

    Related Experiment Videos

    Last Updated: Jan 8, 2026

    A Swin Transformer-Based Model for Thyroid Nodule Detection in Ultrasound Images
    04:23

    A Swin Transformer-Based Model for Thyroid Nodule Detection in Ultrasound Images

    Published on: April 21, 2023

    2.2K
    Swin-PSAxialNet: An Efficient Multi-Organ Segmentation Technique
    04:48

    Swin-PSAxialNet: An Efficient Multi-Organ Segmentation Technique

    Published on: July 5, 2024

    723
    Author Spotlight: Enhancement of Salient Object Detection for Smart Grid Applications
    03:31

    Author Spotlight: Enhancement of Salient Object Detection for Smart Grid Applications

    Published on: December 15, 2023

    991

    Area of Science:

    • Computer Vision
    • Artificial Intelligence
    • Natural Language Processing

    Background:

    • CLIP visual feature-based image captioning models have advanced rapidly.
    • Existing models face challenges in generating descriptive and discriminative captions due to insufficient fine-grained visual cue exploitation and complex vision-language alignment modeling.

    Purpose of the Study:

    • To address limitations in current image captioning models.
    • To propose a novel approach for fine-grained image captioning that enhances descriptive and discriminative capabilities.

    Main Methods:

    • Introduced the Ranking Diffusion Transformer (RDT) model.
    • Integrated a Ranking Visual Encoder (RVE) with a novel ranking attention mechanism to mine diverse visual information from CLIP features.
    • Incorporated a Ranking Loss (RL) that uses caption quality ranking as a global semantic supervisory signal to enhance the diffusion process and vision-language alignment.

    Main Results:

    • The RVE effectively mines diverse and discriminative visual information.
    • The RL strengthens vision-language semantic alignment by leveraging caption quality ranking.
    • The RDT model learns more discriminative visual features precisely aligned with language features through collaborative RVE and RL, and controlled noise diffusion.
    • Experimental results show the RDT surpasses existing state-of-the-art image captioning models on benchmark datasets.

    Conclusions:

    • The proposed Ranking Diffusion Transformer (RDT) effectively addresses limitations in current image captioning models.
    • The RDT model demonstrates superior performance in generating descriptive and discriminative captions by enhancing fine-grained visual cue utilization and vision-language alignment.
    • The RDT represents a significant advancement in the field of fine-grained image captioning.