DitHub: Incremental Open-Vocabulary Object Detection

Abstract

Open-Vocabulary object detectors can generalize to an unrestricted set of categories through simple textual prompting. However, adapting these models to rare classes or reinforcing their abilities on multiple specialized domains remains essential. While recent methods rely on monolithic adaptation strategies with a single set of weights, we embrace modular deep learning. We introduce DitHub, a framework designed to build and maintain a library of efficient adaptation modules. Inspired by Version Control Systems, DitHub manages expert modules as branches that can be fetched and merged as needed. This modular approach allows us to conduct an in-depth exploration of the compositional properties of adaptation modules, marking the first such study in Object Detection. Our method achieves state-of-theart performance on the ODinW-13 benchmark and ODinW-O, a newly introduced benchmark designed to assess class reappearance.

Core Features & Innovations

Problem Statement

Vision-language detectors can incrementally learn new categories without losing zero-shot abilities, but existing methods rely on monolithic adaptation, merging all knowledge into one model. This makes updating individual concepts difficult and can lead to knowledge interference and performance loss.

Modular Adaptation

DitHub introduces a modular approach by maintaining a library of specialized detection modules. Instead of integrating new knowledge into a single model, this approach enables efficient retrieval, fusion, and adaptation, ensuring flexible and scalable incremental learning.

Specialized Modules

Our experiments show that specialized modules improve performance significantly, with a state-of-the-art gain of +4.21 mAP in the Incremental Vision-Language Object Detection paradigm.

Training-free Unlearning

DitHub allows selective knowledge removal without retraining, enabling the efficient elimination of specific classes by removing associated modules.

Model Overview

DitHub creates modular class-specific detectors, handling both rare and common objects while offering a memory-efficient structure for scalable long-term use. By using low-rank adaptation (LoRA), we fine-tune specialized modules for each task, ensuring minimal interference. This modular design supports on-the-fly adaptation to new classes, making DitHub a versatile solution for evolving detection needs.

Modular Adaptation

Expand the library of specialized modules dynamically as new tasks arise, adapting to both common and rare object classes.

Memory Efficiency

Employ a subset of shared learnable parameters across different adaptation modules to enhance efficiency without compromising performance.

Flexible Inference

Enable selective activation of specialized modules, allowing direct deployment for fine-grained adaptation to individual classes.

Experimental Results

Performance on ODinW-13

We evaluate DitHub on the ODinW-13 benchmark, which consists of 13 sub-datasets spanning from traditional Pascal VOC to more challenging domains with significant distribution shifts. Our method outperforms the main competitor, ZiRa, by a substantial 4.21 mAP points on Avg. Notably, on the zero-shot MS COCO evaluation (ZCOCO), our approach outperforms ZiRa by 0.75 mAP, setting a new state-of-the-art in both incremental and zero-shot retention capabilities.

Shots	Method	ZCOCO	Avg	Ae	Aq	Co	Eg	Mu	Pa	Pv	Pi	Po	Ra	Sh	Th	Ve
0	G-Dino	47.41	46.80	19.11	20.82	64.75	59.98	25.34	56.27	54.80	65.94	22.13	62.02	32.85	70.38	57.07
Full	ZiRA	46.26	57.98	31.76	47.35	71.77	64.74	46.53	62.66	66.39	71.00	48.48	63.03	41.44	76.13	62.44
Full	DitHub	47.01	62.19	34.62	50.65	70.46	68.56	49.28	65.57	69.58	71.10	56.65	70.88	52.82	79.30	68.18

Performance on ODinW-O

We introduce ODinW-O (Overlapped), a variant of ODinW-35 specifically designed to evaluate performance on classes that reoccur across different domains. In this benchmark, DitHub secures a substantial improvement of 4.75 mAP on Avg and a 2.08 mAP gain on ZCOCO over ZiRa. We attribute this success to our class-oriented modular design, which enables selective updates to recurring concepts without overwriting knowledge associated with other classes.

Shots	Method	ZCOCO	Avg	Ae	Hw	Pv	Sd	Th	Ve
0	G-Dino	47.41	53.15	45.12	67.54	58.11	25.84	70.40	51.87
Full	ZiRa	44.43	57.63	39.92	68.00	64.90	46.26	77.26	49.47
Full	DitHub	46.51	62.38	53.35	71.07	71.01	41.75	80.21	56.90

Citation

If you find DitHub useful for your research, please consider citing our paper:

@article{cappellino2025dithub,
    title={DitHub: A Modular Framework for Incremental Open-Vocabulary Object Detection},
    author={Cappellino, Chiara and Mancusi, Gianluca and Mosconi, Matteo and Porrello, Angelo and Calderara, Simone and Cucchiara, Rita},
    journal={arXiv preprint arXiv:2503.09271},
    year={2025}
    }