Open-Vocabulary object detectors inherently generalize to an unrestricted set of categories, enabling recognition through simple textual prompting. However, adapting these models to rare classes or reinforcing their abilities on specialized domains remains essential. While recent methods rely on monolithic adaptation strategies with a single set of weights, we embrace modular deep learning. We introduce DitHub, a framework designed to build and maintain a library of efficient adaptation modules. Inspired by Version Control Systems, DitHub manages expert modules as branches that can be fetched and merged as needed. This modular approach allows us to conduct an in-depth exploration of the compositional properties of adaptation modules, marking the first such study in Object Detection. Our method achieves state-of-the-art performance on the ODinW13 benchmark and ODinWO, a newly introduced benchmark designed to assess class reappearance.
Figure 1: Overview of the DitHub architecture inspired by Version Control Systems
Existing approaches use monolithic adaptations, which condense all new information into a single set of weights. This creates difficulties when classes reappear in different contexts, preventing selective updates without knowledge loss.
DitHub introduces a modular approach: expert modules are treated as separate branches that can be retrieved and fused dynamically, ensuring a more efficient and flexible incremental learning process.
Our experiments demonstrate that class-specialized modules significantly improve performance, with an average gain of +1.23 mAP. The modular approach allows targeted updates without altering existing knowledge.
DitHub enables selective knowledge removal without requiring retraining. By subtracting a specific module, we can efficiently erase the detection capability of a class.
Figure 3: DitHub's modularity enables more precise incremental adaptations.
We evaluate DitHub on the ODinW13 benchmark, which consists of 13 sub-datasets spanning from traditional Pascal VOC to more challenging domains with significant distribution shifts. Our method outperforms the main competitor, ZiRa, by a substantial 4.21 mAP points and achieves the best results. Notably, on the zero-shot MS COCO evaluation (ZCOCO), our approach outperforms ZiRa by 0.75 mAP, setting a new state-of-the-art in both incremental and zero-shot retention capabilities.
Shots | Method | ZCOCO | Avg | Ae | Aq | Co | Eg | Mu | Pa | Pv | Pi | Po | Ra | Sh | Th | Ve |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | G-Dino | 47.41 | 46.80 | 19.11 | 20.82 | 64.75 | 59.98 | 25.34 | 56.27 | 54.80 | 65.94 | 22.13 | 62.02 | 32.85 | 70.38 | 57.07 |
Full | ZiRA | 46.26 | 57.98 | 31.76 | 47.35 | 71.77 | 64.74 | 46.53 | 62.66 | 66.39 | 71.00 | 48.48 | 63.03 | 41.44 | 76.13 | 62.44 |
DitHub | 47.01 | 62.19 | 34.62 | 50.65 | 70.46 | 68.56 | 49.28 | 65.57 | 69.58 | 71.10 | 56.65 | 70.88 | 52.82 | 79.30 | 68.18 |
We introduce ODinWO (Overlapped), a variant of ODinW13 specifically designed to evaluate performance on classes that reoccur across different domains. In this benchmark, DitHub secures a substantial improvement of 4.75 mAP and a 2.08 mAP gain on ZCOCO over ZiRa. We attribute this success to our class-oriented modular design, which enables selective updates to recurring concepts without overwriting knowledge associated with other classes.
Method | ZCOCO | Avg | Ae | Hw | Pv | Sd | Th | Ve |
---|---|---|---|---|---|---|---|---|
G-Dino | 47.41 | 53.15 | 45.12 | 67.54 | 58.11 | 25.84 | 70.40 | 51.87 |
ZiRa | 44.43 | 57.63 | 39.92 | 68.00 | 64.90 | 46.26 | 77.26 | 49.47 |
DitHub | 46.51 | 62.38 | 53.35 | 71.07 | 71.01 | 41.75 | 80.21 | 56.90 |