Our proposed cooperative mechanism integrates pre-trained foundational models such as CLIP, SAM, and GDINO with a Mask-RCNN model in order to identify and semantically label both known and novel objects. These foundational model interacts using different components including Initialization, Unknown Object Labelling, and Refinement to refine and categorize objects. We establish state-of-the-art (SOTA) results in novel object detection on LVIS, and open-vocabulary detection benchmark on COCO.
In this work, we address the challenging and emergent problem of novel object detection (NOD), focusing on the accurate detection of both known and novel object categories during inference. Traditional object detection algorithms are inherently closed-set, limiting their capability to handle NOD. We present a novel approach to transform existing closed-set detectors into open-set detectors. This transformation is achieved by leveraging the complementary strengths of pre-trained foundational models, specifically CLIP and SAM, through our cooperative mechanism. Furthermore, by integrating this mechanism with state-of-the-art open-set detectors such as GDINO, we establish new benchmarks in object detection performance. Our method achieves 17.42 mAP in novel object detection and 42.08 mAP for known objects on the challenging LVIS dataset. Adapting our approach to the COCO OVD split, we surpass the current state-of-the-art by a margin of 7.2 AP50 for novel classes. Our code is available at https://github.com/rohit901/cooperative-foundational-models.
1. We establish state-of-the-art (SOTA) results in novel object detection on LVIS, and open-vocabulary detection benchmark on COCO.
2. We propose a simple, modular, and training-free approach which can detect (i.e. localize and classify) known as well as novel objects in the given input image.
3. Our approach easily transforms any existing closed-set detectors into open-set detectors by leveraging the complimentary strengths of foundational models like CLIP and SAM.
4. The modular nature of our approach allows us to easily swap out any specific component, and further combine it with existing SOTA open-set detectors to achieve additional performance improvements.
In this work, we show how to convert an existing closed-set detector, i.e, pre-trained Mask-RCNN, to an open-set detector by utilizing the complementary strengths of pre-trained foundational models such as CLIP, and SAM via our cooperative mechanism. Our proposed mechanism leverages CLIP's understanding of unseen classes with Mask-RCNN's ability to localize background objects for finding novel object classes. The bounding boxes are then refined by exploiting SAM instance mask-to-box properties. Further, when our proposed cooperative mechanism is combined with open-set detectors like GDINO, we observe additional performance gains.
Method | Mask-RCNN | GDINO | VLM | Novel AP | Known AP | All AP |
---|---|---|---|---|---|---|
K-Means | - | - | - | 0.20 | 17.77 | 1.55 |
Weng et al | - | - | - | 0.27 | 17.85 | 1.62 |
ORCA | - | - | - | 0.49 | 20.57 | 2.03 |
UNO | - | - | - | 0.61 | 21.09 | 2.18 |
RNCDL | V1 | - | - | 5.42 | 25.00 | 6.92 |
GDINO | - | β | - | 13.47 | 37.13 | 15.30 |
Ours | V2 | β | SigLIP | 17.42 | 42.08 | 19.33 |
Method | Backbone | Use Extra Training Set | Novel AP50 |
---|---|---|---|
OVR-CNN | RN50 | β | 22.8 |
ViLD | ViT-B/32 | β | 27.6 |
Detic | RN50 | β | 27.8 |
OV-DETR | ViT-B/32 | β | 29.4 |
BARON | RN50 | β | 34 |
Rasheed et al | RN50 | β | 36.6 |
CORA | RN50x4 | β | 41.7 |
BARON | RN50 | β | 42.7 |
CORA+ | RN50x4 | β | 43.1 |
Ours* | RN101 + SwinT | β | 50.3 |
Top-5 Bounding Box Predictions by RNCDL, GDINO, Plain MaskRCNN + CLIP, Ours (GDINO + MaskRCNN + SAM + SigLIP)
@misc{bharadwaj2023enhancing,
title={Enhancing Novel Object Detection via Cooperative Foundational Models},
author={Rohit Bharadwaj and Muzammal Naseer and Salman Khan and Fahad Shahbaz Khan},
year={2023},
eprint={2311.12068},
archivePrefix={arXiv},
primaryClass={cs.CV}
}