Enhancing Novel Object Detection via Cooperative Foundational Models

1Mohamed Bin Zayed University of Artificial Intelligence
  2LinkΓΆping University   3Australian National University

Cooperative Foundational Models

Cooperative Foundational Models Architecture.

Our proposed cooperative mechanism integrates pre-trained foundational models such as CLIP, SAM, and GDINO with a Mask-RCNN model in order to identify and semantically label both known and novel objects. These foundational model interacts using different components including Initialization, Unknown Object Labelling, and Refinement to refine and categorize objects. We establish state-of-the-art (SOTA) results in novel object detection on LVIS, and open-vocabulary detection benchmark on COCO.

Abstract

In this work, we address the challenging and emergent problem of novel object detection (NOD), focusing on the accurate detection of both known and novel object categories during inference. Traditional object detection algorithms are inherently closed-set, limiting their capability to handle NOD. We present a novel approach to transform existing closed-set detectors into open-set detectors. This transformation is achieved by leveraging the complementary strengths of pre-trained foundational models, specifically CLIP and SAM, through our cooperative mechanism. Furthermore, by integrating this mechanism with state-of-the-art open-set detectors such as GDINO, we establish new benchmarks in object detection performance. Our method achieves 17.42 mAP in novel object detection and 42.08 mAP for known objects on the challenging LVIS dataset. Adapting our approach to the COCO OVD split, we surpass the current state-of-the-art by a margin of 7.2 AP50 for novel classes. Our code is available at https://github.com/rohit901/cooperative-foundational-models.

Key Features

1. We establish state-of-the-art (SOTA) results in novel object detection on LVIS, and open-vocabulary detection benchmark on COCO.

2. We propose a simple, modular, and training-free approach which can detect (i.e. localize and classify) known as well as novel objects in the given input image.

3. Our approach easily transforms any existing closed-set detectors into open-set detectors by leveraging the complimentary strengths of foundational models like CLIP and SAM.

4. The modular nature of our approach allows us to easily swap out any specific component, and further combine it with existing SOTA open-set detectors to achieve additional performance improvements.

Approach Overview

In this work, we show how to convert an existing closed-set detector, i.e, pre-trained Mask-RCNN, to an open-set detector by utilizing the complementary strengths of pre-trained foundational models such as CLIP, and SAM via our cooperative mechanism. Our proposed mechanism leverages CLIP's understanding of unseen classes with Mask-RCNN's ability to localize background objects for finding novel object classes. The bounding boxes are then refined by exploiting SAM instance mask-to-box properties. Further, when our proposed cooperative mechanism is combined with open-set detectors like GDINO, we observe additional performance gains.

Novel Object Detection

Comparison of object detection performance using mAP on the lvis_val dataset.
Method Mask-RCNN GDINO VLM Novel AP Known AP All AP
K-Means - - - 0.20 17.77 1.55
Weng et al - - - 0.27 17.85 1.62
ORCA - - - 0.49 20.57 2.03
UNO - - - 0.61 21.09 2.18
RNCDL V1 - - 5.42 25.00 6.92
GDINO - βœ” - 13.47 37.13 15.30
Ours V2 βœ” SigLIP 17.42 42.08 19.33

Open Vocabulary Detection

Results on COCO OVD benchmark. *Our approach with GDINO, SigLIP, and Mask-RCNN trained on COCO OVD split.
Method Backbone Use Extra Training Set Novel AP50
OVR-CNN RN50 βœ” 22.8
ViLD ViT-B/32 ✘ 27.6
Detic RN50 βœ” 27.8
OV-DETR ViT-B/32 ✘ 29.4
BARON RN50 ✘ 34
Rasheed et al RN50 βœ” 36.6
CORA RN50x4 ✘ 41.7
BARON RN50 βœ” 42.7
CORA+ RN50x4 βœ” 43.1
Ours* RN101 + SwinT ✘ 50.3

Qualitative Visualization

Top-5 Bounding Box Predictions by RNCDL, GDINO, Plain MaskRCNN + CLIP, Ours (GDINO + MaskRCNN + SAM + SigLIP)

BibTeX

@misc{bharadwaj2023enhancing,
    title={Enhancing Novel Object Detection via Cooperative Foundational Models}, 
    author={Rohit Bharadwaj and Muzammal Naseer and Salman Khan and Fahad Shahbaz Khan},
    year={2023},
    eprint={2311.12068},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}