PASTA: Is Multiple Object Tracking a Matter of Specialization

Deep Modules Compositionality Meets Multiple Object Tracking

Gianluca Mancusi, Mattia Bernardi, Aniello Panariello, Angelo Porrello, Simone Calderara, Rita Cucchiara

Abstract

End-to-end transformer-based trackers have achieved remarkable performance on most human-related datasets. However, training these trackers in heterogeneous scenarios poses significant challenges, including negative interference -- where the model learns conflicting scene-specific parameters -- and limited domain generalization, which often necessitates expensive fine-tuning to adapt the models to new domains. In response to these challenges, we introduce PArameter efficient Scenario specific Tracking Architecture (PASTA), a novel framework that combines Parameter-Efficient Fine-Tuning (PEFT) and Modular Deep Learning (MDL). Specifically, we define key scenario attributes (e.g., camera-viewpoint, lighting condition) and train specialized PEFT modules for each attribute. These expert modules are hence combined in parameter space, enabling systematic generalization to new domains without increasing inference time. Extensive experiments on MOTSynth, along with zero-shot evaluations on MOT17 and PersonPath22, demonstrate that a neural tracker built from carefully selected modules surpasses its monolithic counterpart.

Model Overview

PASTA Model Architecture

Figure 1: Overview of the PASTA architecture

Key Features

Problem Statement: Domain-shifts

The limited availability of annotated data often leads end-to-end trackers to overfit on training sets, making them vulnerable to domain shifts. With limited data, the model struggles to generalize, especially when negative interference arises between scenarios with differing attributes.

Figure 2: Example of attributes among MOTSynth, MOT17, PersonPath22

Solution: Attribute-specific PEFT modules

We train parameter-efficient modules for each attribute, creating a specialized expert system. During inference, an operator selects the expert modules for each scenario, enabling better adaptation to specific tracking conditions.

Figure 3: Overview of our modular framework

Experimental Results

Our experiments on MOTSynth show that reducing negative interference enhances association metrics. Zero-shot evaluations (Tab. 1) on real-world datasets (MOT17, PersonPath22) illustrate the improved generalization achieved by composing expert modules.

Zero-shot Results

Zero-shot results on MOT17/PP22

Table 1: Our zero-shot results on MOT17 and PersonPath22 datasets

We show that, within an in-domain scenario, composing only the modules selected through expert knowledge yields superior results.
Conversely, during domain shifts, leveraging all modules while assigning lower weights to unselected ones helps the model retain valuable knowledge without discarding any (Tab. 2).

Ablation Results

Ablation results

Table 2: Our ablation results on MOTSynth (in-domain) and MOT17 datasets

Acknowledgements

The research was supported by the Italian Ministry for University and Research through the PNRR project ECOSISTER ECS 00000033 CUP E93C22001100001 and by the EU Horizon project "ELIAS - European Lighthouse of AI for Sustainability" (No. 101120237).