RAP-MLLM

Abstract

The development of large language models (LLMs) has significantly enhanced the capabilities of multimodal LLMs (MLLMs) as general assistants. However, lack of user-specific knowledge still restricts their application in human's daily life. In this paper, we introduce the Retrieval Augmented Personalization (RAP) framework for MLLMs' personalization. Starting from a general MLLM, we turn it into a personalized assistant in three steps. (a) Remember: We design a key-value database to store user-related information, e.g., user's name, avatar and other attributes. (b) Retrieve: When the user initiates a conversation, RAP will retrieve relevant information from the database using a multimodal retriever. (c) Generate: The input query and retrieved concepts' information are fed into MLLMs to generate personalized, knowledge-augmented responses. Unlike previous methods, RAP allows real-time concept editing via updating the external database. To further improve generation quality and alignment with user-specific information, we design a pipeline for data collection and create a specialized dataset for personalized training of MLLMs. Based on the dataset, we train a series of MLLMs as personalized multimodal assistants. By pretraining on large-scale dataset, RAP-MLLMs can generalize to infinite visual concepts without additional finetuning. Our models demonstrate outstanding flexibility and generation quality across a variety of tasks, such as personalized image captioning, question answering and visual recognition.

Retrieval-Augmented Personalization

Region-of-interest detected by an open world detector are used to retrieve concepts from the database. The images and accompanying information of the retrieved concepts are then integrated into the input for the MLLM.

Our RAP works in three main steps: Remember, Retrieve and Generate. (a) Remember: RAP includes a designed database to help remember each concept via storing its image and basic information, e.g., name, avatar and other attributes. (b) Retrieve: When a user initiates a conversation, RAP will retrieve relevant information from the database using a multimodal retriever. (c) Generate: The input query and retrieved concepts information are incorporated into the MLLM's input for personalized, knowledge-augmented generation. RAP requires only one image per concept with its basic information for personalization. It allows users to make real-time adjustments to the model's outputs by modifying their personal databases, eliminating the need for retraining.

Personalized Training Dataset

We first crop the target concept from the image based on the dataset annotations and then query Gemini to generate its personalized description. We also apply data augmentation to diversify these cropped images. Then we combine them with the original image to derive a series of instructions and answers from Gemini. When noise concepts are included in the additional information, the answer remains unchanged, helping to train the MLLMs' ability to filter out irrelevant concepts.

Examples of Personalized Image Captioning

Image examples of target concepts are shown in the left and captions are shown in the right.

Our RAP-MLLMs produce clear and accurate captions based on the database content, which also ensures the reliability of the outputs.

Examples of Personalized Conversation

Examples of Personalized Concept Recognition

Real-time Concept Editing

Our models support real-time editing of concepts by modifying the database. Based on the information recorded in the database, our RAP-LLaVA can provide reliable and accurate answers.

Real-time Concept Updating

The first caption is generated when toy2 not yet stored in the database. Once the new concept is added, RAP-LLaVA can recognize both toy1 and toy2.

BibTeX

@InProceedings{Hao_2025_CVPR,
        author    = {Hao, Haoran and Han, Jiaming and Li, Changsheng and Li, Yu-Feng and Yue, Xiangyu},
        title     = {RAP: Retrieval-Augmented Personalization for Multimodal Large Language Models},
        booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
        month     = {June},
        year      = {2025},
        pages     = {14538-14548}
    }

RAP:
Retrieval-Augmented Personalization for Multimodal Large Language Models

CVPR 2025

Introduce some user-specific concepts to our RAP-LLaVA, it can remember them and achieve excellent performance in a variety of personalized multimodal generation tasks.