Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Search research articles

Related Experiment Video

Updated: Jan 14, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

RAR: Retrieving and Ranking Augmented MLLMs for Visual Recognition.

Ziyu Liu, Zeyi Sun, Yuhang Zang

IEEE Transactions on Image Processing : a Publication of the IEEE Signal Processing Society

|January 12, 2026

Summary

This summary is machine-generated.

Related Concept Videos

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Revisiting InternVL: A Systematic Technical Framework for Building Powerful Open-Source Vision-Language Models.

IEEE transactions on pattern analysis and machine intelligence·2026

Same author

VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models.

IEEE transactions on pattern analysis and machine intelligence·2025

Same author

GPT4Point++: Advancing Unified Point-Language Understanding and Generation.

IEEE transactions on pattern analysis and machine intelligence·2025

Same author

A survey of low-bit large language models: Basics, systems, and algorithms.

Neural networks : the official journal of the International Neural Network Society·2025

Same author

GetMesh: A Controllable Model for High-quality Mesh Generation and Manipulation.

IEEE transactions on pattern analysis and machine intelligence·2025

Same author

PointLLM-V2: Empowering Large Language Models to Better Understand Point Clouds.

IEEE transactions on pattern analysis and machine intelligence·2025

Same journal

Change-Prior-Guided Unsupervised Change Detection of Heterogeneous Remote Sensing Images.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026

Same journal

AgonicDreamer: Enhancing Multi-View Consistency in Text-to-3D Generation via Rectified Score Distillation.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026

Same journal

BiCM-Prompt: Bidirectional Cross-Modal Prompt Tuning for Class-Incremental Learning on Multisource Remote Sensing Images.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026

Same journal

GoP-based Quality Enhancement on Video Compression.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026

Same journal

Align then Tensorize: Multi-Level Consistent Anchor Graph Learning for Scalable Multi-View Clustering.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026

Same journal

Beyond Fidelity: Diverse Image Synthesis via Retrieval-Augmented Diffusion.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026

See all related articles

This study introduces RAR, a novel method combining CLIP and Multimodal Large Language Models (MLLMs) to improve fine-grained visual recognition. RAR enhances few-shot and zero-shot capabilities for extensive, detailed datasets.

Area of Science:

Computer Vision
Artificial Intelligence
Machine Learning

Background:

Contrastive Language-Image Pre-training (CLIP) excels at broad associations but struggles with fine-grained distinctions.
Multimodal Large Language Models (MLLMs) handle fine-grained classification but degrade with more categories and limited context.
Existing methods face challenges in few-shot/zero-shot recognition for large, detailed visual vocabularies.

Purpose of the Study:

To develop a method that synergizes CLIP and MLLMs for enhanced few-shot/zero-shot recognition.
To address limitations in fine-grained recognition and MLLM performance with increased category numbers.
To improve accuracy on datasets with extensive and fine-grained visual categories.

Main Methods:

Introduced RAR (Retrieving And Ranking), an augmented method for MLLMs.

Related Experiment Videos

Last Updated: Jan 14, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

Established a multi-modal retriever using CLIP to create an explicit memory for categories.

Implemented a retrieval and ranking process where MLLMs predict based on retrieved memory and context.

Main Results:

RAR significantly boosts accuracy in vision-language recognition tasks.
Demonstrated substantial performance improvements on 5 fine-grained visual recognition benchmarks.
Achieved notable gains on 11 few-shot image recognition datasets and 2 object detection datasets under zero-shot settings.

Conclusions:

RAR effectively combines the strengths of CLIP and MLLMs for superior fine-grained recognition.
The method overcomes context window limitations and category number constraints in MLLMs.
RAR offers a robust solution for few-shot/zero-shot recognition in complex visual datasets.