Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Experiment Video

Updated: Jan 14, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness
03:14

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

1.0K

RAR: Retrieving and Ranking Augmented MLLMs for Visual Recognition.

Ziyu Liu, Zeyi Sun, Yuhang Zang

    IEEE Transactions on Image Processing : a Publication of the IEEE Signal Processing Society
    |January 12, 2026
    PubMed
    Summary
    This summary is machine-generated.

    Related Concept Videos

    You might also read

    Related Articles

    Articles linked to this work by shared authors, journal, and citation graph.

    Sort by
    Same author

    Revisiting InternVL: A Systematic Technical Framework for Building Powerful Open-Source Vision-Language Models.

    IEEE transactions on pattern analysis and machine intelligence·2026
    Same author

    VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models.

    IEEE transactions on pattern analysis and machine intelligence·2025
    Same author

    GPT4Point++: Advancing Unified Point-Language Understanding and Generation.

    IEEE transactions on pattern analysis and machine intelligence·2025
    Same author

    A survey of low-bit large language models: Basics, systems, and algorithms.

    Neural networks : the official journal of the International Neural Network Society·2025
    Same author

    GetMesh: A Controllable Model for High-quality Mesh Generation and Manipulation.

    IEEE transactions on pattern analysis and machine intelligence·2025
    Same author

    PointLLM-V2: Empowering Large Language Models to Better Understand Point Clouds.

    IEEE transactions on pattern analysis and machine intelligence·2025
    Same journal

    Change-Prior-Guided Unsupervised Change Detection of Heterogeneous Remote Sensing Images.

    IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026
    Same journal

    AgonicDreamer: Enhancing Multi-View Consistency in Text-to-3D Generation via Rectified Score Distillation.

    IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026
    Same journal

    BiCM-Prompt: Bidirectional Cross-Modal Prompt Tuning for Class-Incremental Learning on Multisource Remote Sensing Images.

    IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026
    Same journal

    GoP-based Quality Enhancement on Video Compression.

    IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026
    Same journal

    Align then Tensorize: Multi-Level Consistent Anchor Graph Learning for Scalable Multi-View Clustering.

    IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026
    Same journal

    Beyond Fidelity: Diverse Image Synthesis via Retrieval-Augmented Diffusion.

    IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026
    See all related articles

    This study introduces RAR, a novel method combining CLIP and Multimodal Large Language Models (MLLMs) to improve fine-grained visual recognition. RAR enhances few-shot and zero-shot capabilities for extensive, detailed datasets.

    Area of Science:

    • Computer Vision
    • Artificial Intelligence
    • Machine Learning

    Background:

    • Contrastive Language-Image Pre-training (CLIP) excels at broad associations but struggles with fine-grained distinctions.
    • Multimodal Large Language Models (MLLMs) handle fine-grained classification but degrade with more categories and limited context.
    • Existing methods face challenges in few-shot/zero-shot recognition for large, detailed visual vocabularies.

    Purpose of the Study:

    • To develop a method that synergizes CLIP and MLLMs for enhanced few-shot/zero-shot recognition.
    • To address limitations in fine-grained recognition and MLLM performance with increased category numbers.
    • To improve accuracy on datasets with extensive and fine-grained visual categories.

    Main Methods:

    • Introduced RAR (Retrieving And Ranking), an augmented method for MLLMs.

    Related Experiment Videos

    Last Updated: Jan 14, 2026

    Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness
    03:14

    Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

    Published on: December 6, 2024

    1.0K
  • Established a multi-modal retriever using CLIP to create an explicit memory for categories.
  • Implemented a retrieval and ranking process where MLLMs predict based on retrieved memory and context.
  • Main Results:

    • RAR significantly boosts accuracy in vision-language recognition tasks.
    • Demonstrated substantial performance improvements on 5 fine-grained visual recognition benchmarks.
    • Achieved notable gains on 11 few-shot image recognition datasets and 2 object detection datasets under zero-shot settings.

    Conclusions:

    • RAR effectively combines the strengths of CLIP and MLLMs for superior fine-grained recognition.
    • The method overcomes context window limitations and category number constraints in MLLMs.
    • RAR offers a robust solution for few-shot/zero-shot recognition in complex visual datasets.