Cooperative Foundational Models

Our proposed cooperative mechanism integrates pre-trained foundational models such as CLIP, SAM, and GDINO with a Mask-RCNN model in order to identify and semantically label both known and novel objects. These foundational model interacts using different components including Initialization, Unknown Object Labelling, and Refinement to refine and categorize objects. We establish state-of-the-art (SOTA) results in novel object detection on LVIS, and open-vocabulary detection benchmark on COCO.

Abstract

In this work, we address the challenging and emergent problem of novel object detection (NOD), focusing on the accurate detection of both known and novel object categories during inference. Traditional object detection algorithms are inherently closed-set, limiting their capability to handle NOD. We present a novel approach to transform existing closed-set detectors into open-set detectors. This transformation is achieved by leveraging the complementary strengths of pre-trained foundational models, specifically CLIP and SAM, through our cooperative mechanism. Furthermore, by integrating this mechanism with state-of-the-art open-set detectors such as GDINO, we establish new benchmarks in object detection performance. Our method achieves 17.42 mAP in novel object detection and 42.08 mAP for known objects on the challenging LVIS dataset. Adapting our approach to the COCO OVD split, we surpass the current state-of-the-art by a margin of 7.2 AP50 for novel classes. Our code is available at https://github.com/rohit901/cooperative-foundational-models.

Key Features

1. We establish state-of-the-art (SOTA) results in novel object detection on LVIS, and open-vocabulary detection benchmark on COCO.

2. We propose a simple, modular, and training-free approach which can detect (i.e. localize and classify) known as well as novel objects in the given input image.

3. Our approach easily transforms any existing closed-set detectors into open-set detectors by leveraging the complimentary strengths of foundational models like CLIP and SAM.

4. The modular nature of our approach allows us to easily swap out any specific component, and further combine it with existing SOTA open-set detectors to achieve additional performance improvements.

Approach Overview

In this work, we show how to convert an existing closed-set detector, i.e, pre-trained Mask-RCNN, to an open-set detector by utilizing the complementary strengths of pre-trained foundational models such as CLIP, and SAM via our cooperative mechanism. Our proposed mechanism leverages CLIP's understanding of unseen classes with Mask-RCNN's ability to localize background objects for finding novel object classes. The bounding boxes are then refined by exploiting SAM instance mask-to-box properties. Further, when our proposed cooperative mechanism is combined with open-set detectors like GDINO, we observe additional performance gains.

Novel Object Detection

Comparison of object detection performance using mAP on the lvis_val dataset.
Method	Mask-RCNN	GDINO	VLM	Novel AP	Known AP	All AP

K-Means	-	-	-	0.20	17.77	1.55
Weng et al	-	-	-	0.27	17.85	1.62
ORCA	-	-	-	0.49	20.57	2.03
UNO	-	-	-	0.61	21.09	2.18
RNCDL	V1	-	-	5.42	25.00	6.92
GDINO	-	✔	-	13.47	37.13	15.30
Ours	V2	✔	SigLIP	17.42	42.08	19.33

Open Vocabulary Detection

Results on COCO OVD benchmark. *Our approach with GDINO, SigLIP, and Mask-RCNN trained on COCO OVD split.
Method	Backbone	Use Extra Training Set	Novel AP₅₀
OVR-CNN	RN50	✔	22.8
ViLD	ViT-B/32	✘	27.6
Detic	RN50	✔	27.8
OV-DETR	ViT-B/32	✘	29.4
BARON	RN50	✘	34
Rasheed et al	RN50	✔	36.6
CORA	RN50x4	✘	41.7
BARON	RN50	✔	42.7
CORA+	RN50x4	✔	43.1
Ours*	RN101 + SwinT	✘	50.3

Qualitative Visualization

Top-5 Bounding Box Predictions by RNCDL, GDINO, Plain MaskRCNN + CLIP, Ours (GDINO + MaskRCNN + SAM + SigLIP)

BibTeX

@misc{bharadwaj2023enhancing,
    title={Enhancing Novel Object Detection via Cooperative Foundational Models}, 
    author={Rohit Bharadwaj and Muzammal Naseer and Salman Khan and Fahad Shahbaz Khan},
    year={2023},
    eprint={2311.12068},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}