DitHub

A Modular Framework for Incremental Open-Vocabulary Object Detection

Chiara Cappellino, Gianluca Mancusi, Matteo Mosconi, Angelo Porrello, Simone Calderara, Rita Cucchiara

Abstract

Open-Vocabulary object detectors inherently generalize to an unrestricted set of categories, enabling recognition through simple textual prompting. However, adapting these models to rare classes or reinforcing their abilities on specialized domains remains essential. While recent methods rely on monolithic adaptation strategies with a single set of weights, we embrace modular deep learning. We introduce DitHub, a framework designed to build and maintain a library of efficient adaptation modules. Inspired by Version Control Systems, DitHub manages expert modules as branches that can be fetched and merged as needed. This modular approach allows us to conduct an in-depth exploration of the compositional properties of adaptation modules, marking the first such study in Object Detection. Our method achieves state-of-the-art performance on the ODinW13 benchmark and ODinWO, a newly introduced benchmark designed to assess class reappearance.

Model Overview

DitHub Model Architecture

Figure 1: Overview of the DitHub architecture inspired by Version Control Systems

Core Features & Innovations

Problem Statement

Existing approaches use monolithic adaptations, which condense all new information into a single set of weights. This creates difficulties when classes reappear in different contexts, preventing selective updates without knowledge loss.

Modular Adaptation

DitHub introduces a modular approach: expert modules are treated as separate branches that can be retrieved and fused dynamically, ensuring a more efficient and flexible incremental learning process.

Specialized Modules

Our experiments demonstrate that class-specialized modules significantly improve performance, with an average gain of +1.23 mAP. The modular approach allows targeted updates without altering existing knowledge.

Training-free Unlearning

DitHub enables selective knowledge removal without requiring retraining. By subtracting a specific module, we can efficiently erase the detection capability of a class.

Figure 3: DitHub's modularity enables more precise incremental adaptations.

Experimental Results

Performance on ODinW13

We evaluate DitHub on the ODinW13 benchmark, which consists of 13 sub-datasets spanning from traditional Pascal VOC to more challenging domains with significant distribution shifts. Our method outperforms the main competitor, ZiRa, by a substantial 4.21 mAP points and achieves the best results. Notably, on the zero-shot MS COCO evaluation (ZCOCO), our approach outperforms ZiRa by 0.75 mAP, setting a new state-of-the-art in both incremental and zero-shot retention capabilities.

Table 1: Comparison of mAP values across 13 tasks. Avg represents the average across the 13 tasks, while zero-shot performance is reported in the ZCOCO column. Best results are highlighted in bold.
Shots Method ZCOCO Avg Ae Aq Co Eg Mu Pa Pv Pi Po Ra Sh Th Ve
0 G-Dino 47.41 46.80 19.11 20.82 64.75 59.98 25.34 56.27 54.80 65.94 22.13 62.02 32.85 70.38 57.07
Full ZiRA 46.26 57.98 31.76 47.35 71.77 64.74 46.53 62.66 66.39 71.00 48.48 63.03 41.44 76.13 62.44
DitHub 47.01 62.19 34.62 50.65 70.46 68.56 49.28 65.57 69.58 71.10 56.65 70.88 52.82 79.30 68.18

Performance on ODinWO

We introduce ODinWO (Overlapped), a variant of ODinW13 specifically designed to evaluate performance on classes that reoccur across different domains. In this benchmark, DitHub secures a substantial improvement of 4.75 mAP and a 2.08 mAP gain on ZCOCO over ZiRa. We attribute this success to our class-oriented modular design, which enables selective updates to recurring concepts without overwriting knowledge associated with other classes.

Table 2: mAP values on ODinWO. Best results are highlighted in bold.
Method ZCOCO Avg Ae Hw Pv Sd Th Ve
G-Dino 47.41 53.15 45.12 67.54 58.11 25.84 70.40 51.87
ZiRa 44.43 57.63 39.92 68.00 64.90 46.26 77.26 49.47
DitHub 46.51 62.38 53.35 71.07 71.01 41.75 80.21 56.90