Contrastive Language-Image Pre-training (CLIP) plays an essential role in extracting valuable content information from images across diverse tasks. It aligns textual and visual modalities to comprehend the entire image, including all the details, even those irrelevant to specific tasks. However, for a finer understanding and controlled editing of images, it becomes crucial to focus on specific regions of interest, which can be indicated as points, masks, or boxes by humans or perception models.
To fulfill the requirements, we introduce Alpha-CLIP, an enhanced version of CLIP with an auxiliary alpha channel to suggest attentive regions and fine-tuned with constructed millions of RGBA region-text pairs. Alpha-CLIP not only preserves the visual recognition ability of CLIP but also enables precise control over the emphasis of image contents. It demonstrates effectiveness in various tasks, including but not limited to open-world recognition, multimodal large language models, and conditional 2D / 3D generation. It has a strong potential to serve as a versatile tool for image-related tasks. All the code, models, and training data will be publicly available.
Alpha-CLIP can enhance CLIP across a wide array of downstream tasks, applying a plug-and-play methodology that permeates diverse domains, spanning from perception to generation in 2D and 3D applications. We show downstream tasks of Alpha-CLIP and their advantages over the original CLIP in the following figure and table.
Alpha-CLIP boosts the capability of region-based recognition for original CLIP. We set alpha channel input to all one according to areas of interest to obtain precise object categories in complex images. The illustration are as follows.
We test both original CLIP and Alpha-CLIP's recognition ability with different alpha-map levels on ImageNet. The results are as follows.
Model | Alpha Map | Top-1 | Top-5 |
---|---|---|---|
CLIP | - | 73.48 | 91.60 |
Alpha-CLIP |
whole image | 73.37 | 91.75 |
rectangular box | 75.62 | 93.34 | |
mask | 77.41 | 94.45 |
We replace CLIP used in BLIP-2 and LLaVA-1.5 with our Alpha-CLIP to make MLLM directly focus on user-defined region in vision-language tasks such as region level captioning and VQA. All cases shown here are made simply by replacing the original CLIP of BLIP2 or LLaVA-1.5 with a plug-in Alpha-CLIP without further tuning.
Besides qualitative results, we also provide quantitative region captioning results of Alpha-CLIP with LLaVA-1.5 on Visual Genome and RefCOCOg.
Model | refCOCOg | Visual Genome | ||
---|---|---|---|---|
METEOR | CIDEr | METEOR | CIDEr | |
GRIT | 15.2 | 71.6 | 17.1 | 142 |
Kosmos-2 | 14.1 | 62.3 | - | - |
GPT4RoI | - | - | 17.4 | 145.2 |
GLaMM | 16.2 | 105.0 | 18.6 | 157.8 |
Alpha-CLIP+LLaVA | 16.7 | 109.2 | 18.9 | 160.3 |
Alpha-CLIP can be used in most image variation models that use CLIP image encoder. For example, BLIP-Diffusion bridges CLIP and stable-diffusion with Q-former to generate and edit 2D images controlled by text. By introducing Alpha-CLIP, we can add an additional set of vision prompts to allow the model to focus on specified regions for 2D generation, enabling subject-driven generation in complex images. The first row per three is the original BLIP-Diffusion generated
images. Other rows represent the outcomes of Alpha-CLIP with highlighted regions marked in red.
A beautifully carved square fountain with an ornate statue standing in the center.
A silver candlestick with several burning candles.
A unique officer hat, black with gold trim, adds a sense of authority.
Tokyo city; trending on artstation.
The pipeline of our data generation method and model architecture. (a) Our method generates millions of RGBA region-text pairs. (b) Alpha-CLIP modifies the CLIP image encoder to take an additional alpha channel along with RGB. We first generate millions of RGBA region-text data from grounding and classification datasets. Using our generated data, we then train our Alpha-CLIP with additional Alpha-channel inputs.
We check the attention map of [CLS] token in the last transformer block in the vision encoder.
Each first line per four is from original CLIP and the other three lines are from Alpha-CLIP with user-defined focus regions marked
in red. This visualization verifies that Alpha-CLIP pays more attention to the area to focus on and more importantly, with no damage to the 2D location information preserved in the feature location of the original CLIP.
@misc{sun2023alphaclip,
title={Alpha-CLIP: A CLIP Model Focusing on Wherever You Want},
author={Zeyi Sun and Ye Fang and Tong Wu and Pan Zhang and Yuhang Zang and Shu Kong and Yuanjun Xiong and Dahua Lin and Jiaqi Wang},
year={2023},
eprint={2312.03818},
archivePrefix={arXiv},
primaryClass={cs.CV}
}