Open-Vocabulary object detectors inherently generalize to an unrestricted set of categories, enabling recognition through simple textual prompting. However, adapting these models to rare classes or reinforcing their abilities on specialized domains remains essential. While recent methods rely on monolithic adaptation strategies with a single set of weights, we embrace modular deep learning. We introduce DitHub, a framework designed to build and maintain a library of efficient adaptation modules. Inspired by Version Control Systems, DitHub manages expert modules as branches that can be fetched and merged as needed. This modular approach allows us to conduct an in-depth exploration of the compositional properties of adaptation modules, marking the first such study in Object Detection. Our method achieves state-of-the-art performance on the ODinW-13 benchmark and ODinW-O, a newly introduced benchmark designed to assess class reappearance.
Vision-language detectors can incrementally learn new categories without losing zero-shot abilities, but existing methods rely on monolithic adaptation, merging all knowledge into one model. This makes updating individual concepts difficult and can lead to knowledge interference and performance loss.
DitHub introduces a modular approach by maintaining a library of specialized detection modules. Instead of integrating new knowledge into a single model, this approach enables efficient retrieval, fusion, and adaptation, ensuring flexible and scalable incremental learning.
Our experiments show that specialized modules improve performance significantly, with a state-of-the-art gain of +4.21 mAP in the Incremental Vision-Language Object Detection paradigm.
DitHub allows selective knowledge removal without retraining, enabling the efficient elimination of specific classes by removing associated modules.
DitHub creates modular class-specific detectors, handling both rare and common objects while offering a memory-efficient structure for scalable long-term use. By using low-rank adaptation (LoRA), we fine-tune specialized modules for each task, ensuring minimal interference. This modular design supports on-the-fly adaptation to new classes, making DitHub a versatile solution for evolving detection needs.
Expand the library of specialized modules dynamically as new tasks arise, adapting to both common and rare object classes.
Employ a subset of shared learnable parameters across different adaptation modules to enhance efficiency without compromising performance.
Enable selective activation of specialized modules, allowing direct deployment for fine-grained adaptation to individual classes.
We evaluate DitHub on the ODinW-13 benchmark, which consists of 13 sub-datasets spanning from traditional Pascal VOC to more challenging domains with significant distribution shifts. Our method outperforms the main competitor, ZiRa, by a substantial 4.21 mAP points on Avg. Notably, on the zero-shot MS COCO evaluation (ZCOCO), our approach outperforms ZiRa by 0.75 mAP, setting a new state-of-the-art in both incremental and zero-shot retention capabilities.
Shots | Method | ZCOCO | Avg | Ae | Aq | Co | Eg | Mu | Pa | Pv | Pi | Po | Ra | Sh | Th | Ve |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | G-Dino | 47.41 | 46.80 | 19.11 | 20.82 | 64.75 | 59.98 | 25.34 | 56.27 | 54.80 | 65.94 | 22.13 | 62.02 | 32.85 | 70.38 | 57.07 |
Full | ZiRA | 46.26 | 57.98 | 31.76 | 47.35 | 71.77 | 64.74 | 46.53 | 62.66 | 66.39 | 71.00 | 48.48 | 63.03 | 41.44 | 76.13 | 62.44 |
DitHub | 47.01 | 62.19 | 34.62 | 50.65 | 70.46 | 68.56 | 49.28 | 65.57 | 69.58 | 71.10 | 56.65 | 70.88 | 52.82 | 79.30 | 68.18 |
We introduce ODinW-O (Overlapped), a variant of ODinW-35 specifically designed to evaluate performance on classes that reoccur across different domains. In this benchmark, DitHub secures a substantial improvement of 4.75 mAP on Avg and a 2.08 mAP gain on ZCOCO over ZiRa. We attribute this success to our class-oriented modular design, which enables selective updates to recurring concepts without overwriting knowledge associated with other classes.
Shots | Method | ZCOCO | Avg | Ae | Hw | Pv | Sd | Th | Ve |
---|---|---|---|---|---|---|---|---|---|
0 | G-Dino | 47.41 | 53.15 | 45.12 | 67.54 | 58.11 | 25.84 | 70.40 | 51.87 |
Full | ZiRa | 44.43 | 57.63 | 39.92 | 68.00 | 64.90 | 46.26 | 77.26 | 49.47 |
DitHub | 46.51 | 62.38 | 53.35 | 71.07 | 71.01 | 41.75 | 80.21 | 56.90 |
If you find DitHub useful for your research, please consider citing our paper:
@article{cappellino2025dithub, title={DitHub: A Modular Framework for Incremental Open-Vocabulary Object Detection}, author={Cappellino, Chiara and Mancusi, Gianluca and Mosconi, Matteo and Porrello, Angelo and Calderara, Simone and Cucchiara, Rita}, journal={arXiv preprint arXiv:2503.09271}, year={2025} }