Datasets#

Mammoth datasets define a complete Continual Learning benchmark. This means that each dataset defines all the necessary information to run a continual learning experiment, including:

Required properties

  • Name of the dataset: NAME attribute (str). This will be used to select the dataset from the command line with the --dataset argument.

  • Incremental setting (class-il, domain-il, or general-continual): SETTING attribute (str). See more in section Experimental settings.

  • Size of the input data: SIZE attribute (tuple[int]).

Required properties for class-il and domain-il settings

  • Number of tasks: TASKS attribute (int).

  • Number of classes per task: N_CLASSES_PER_TASK attribute (int|tuple[int]). This can be a list of integers (one for each task and only for class-il setting), or a single integer.

Required methods for all settings

  • get_epochs static method (int): returns the number of epoch for each task. This method is optional only for datasets that follow the general-continual setting.

  • get_batch_size static method (int): returns the batch size for each task.

  • get_data_loaders static method ([DataLoader, DataLoader]): returns the train and test data loaders for each task. See more in Utils.

  • get_backbone static method (str): returns the name of backbone model to be used for the experiment. Backbones are defined in backbones folder and can be registered with the register_backbone decorator. See more in Backbones.

  • get_transform static method (callable): returns the data-augmentation transform to apply to the data during train.

  • get_loss static method (callable): returns the loss function to use during train.

  • get_normalization_transform static method (callable): returns the normalization transform to apply on torch tensors (no ToTensor() required).

  • get_denormalization_transform static method (callable): returns the transform to apply on the tensors to revert the normalization. You can use the DeNormalize function defined in datasets/transforms/denormalization.py.

  • get_scheduler static method (callable): returns the learning rate scheduler to use during train. By default, it also initializes the optimizer. This prevents errors due to the learning rate being continouosly reduced task after task. This behavior can be changed setting the argument reload_optim=False.

Optional methods to implement: - get_prompt_templates (callable): returns the prompt templates for the dataset. This method is expected for some methods (e.g., clip). By default, it returns the ImageNet prompt templates.

  • get_class_names (callable): returns the class names for the dataset. This method is not implemented by default, but is expected for some methods (e.g., clip). The method should populate the class_names attribute of the dataset to cache the result and call the fix_class_names_order method to ensure that the class names are in the correct order.

See Continual Dataset for more details or SequentialCIFAR10 in Seq CIFAR-10 for an example.

Note

Datasets are downloaded by default in the data folder. You can change this default location by setting the base_path function in conf.

Dataset configurations#

To allow for a more flexible configuration of the datasets, Mammoth supports the use of configuration files that can be used to set the values of the dataset attributes. This greatly simplifies the creation of new datasets, as it allows to separate the definition of a dataset (i.e., its data) from its configuration (number of tasks, transforms, etc.).

The configuration files are stored in datasets/configs/<dataset name>/<configuration name>.yaml and can be selected from the command line using the --dataset_config argument.

The configuration file may contain: - SETTING: the incremental setting of the dataset. This can be one of ‘class-il’, ‘domain-il’, ‘general-continual’, or ‘cssl’. - N_CLASSES_PER_TASK: the number of classes per task. This can be a single integer or a list of integers (one for each task). - N_TASKS: the number of tasks. - SIZE: the size of the input data. - N_CLASSES: the total number of classes in the dataset. - AVAIL_SCHEDS: the available learning rate schedulers for the dataset. - TRANSFORM: the data augmentation transform to apply to the data during training. - TEST_TRANSFORM: the normalization transform to apply to the data during training. - MEAN, STD: the mean and standard deviation of the dataset, used for normalization. - any field specified by the set_default_from_args decorator in the dataset class (see more in section Default arguments and command line). This includes the backbone, batch_size, n_epochs, etc. - args: special field that allows to set the values of the default values for the command line arguments

The configuration file sets the default values for the dataset attributes and all values defined by the set_default_from_args decorator. The priority is as follows: command line arguments > default values set by the model > configuration file.

Experimental settings#

Experimental settings follow and extend the notation of Three Scenarios for Continual Learning, and are defined in the SETTING attribute of each dataset. The following settings are available:

  • class-il: the total number of classes increases at each task, following the N_CLASSES_PER_TASK attribute.

    On task-il and class-il

    Using this setting metrics will be computed both for class-il and task-il. Metrics for task-il will be computed by masking the correct task for each sample during inference. This allows to compute metrics for both settings without having to run the experiment twice.

  • domain-il: the total number of classes is fixed, but the distribution of the input data changes at each task.

  • general-continual: the distribution of the classes change gradually over time, without notion of task boundaries. In this setting, the TASKS and N_CLASSES_PER_TASK attributes are ignored as there is only a single long tasks that changes over time.

  • cssl: this setting is the same as class-il, but with some of the labels missing due to limited supervision. This setting is used to simulate the case where a percentage of the labels is not available for training. For example, if --label_perc_by_task or --label_perc_by_class is set to 0.5, only 50% of the labels will be available for training. The remaining 50% will be masked with a label of -1 and ignored during training if the currently used method does not support partial labels (check out the COMPATIBILITY attribute in Models).

Experiments on the joint setting

Mammoth datasets support the joint setting, which is a special case of the class-il setting where all the classes are available at each task. This is useful to compare the performance of a method on what is usually considered the upper bound for the class-il setting. To run an experiment on the joint setting, simply set the --joint to 1. This will automatically set the N_CLASSES_PER_TASK attribute to the total number of classes in the dataset and the TASKS attribute to 1.

Note that the joint setting is available only for the class-il (and task-il) setting. If you want to run an experiment on the joint setting for a dataset that follows the domain-il setting, you can use the Joint model (with --model=joint).

Evaluate on Future Tasks#

By default, the evaluation is done up to the current task. However, some models also support evaluation on future tasks (e.g., CGIL). In this case, you can set the --eval_future to 1 to evaluate the model on future tasks.

Important

In order to be able to evaluate on future tasks, the method must extend the FutureModel class. Notably, this function includes the future_forward method, which performs inference on all classes, and the change_transform method, which allows to change the transform to be applied to the data during inference.

Default arguments and command line#

Besides get_epochs and get_batch_size, datasets can define default arguments that are used to set the default values for the command line arguments. This is done with the set_default_from_args decorator, which takes the name of the command line argument as input. For example, the following code sets the default value for the –label_perc_by_task argument:

@set_default_from_args('--label_perc_by_task')
def get_label_perc(self):
    return 0.5

Steps to create a new dataset#

The following steps are required to create a dataset following the legacy naming convention. A new and more flexible way to define datasets is available with the register_dataset decorator. See more in Registration of backbones and datasets.

All datasets must inherit from the ContinualDataset class, which is defined in Continual Dataset. The only exception are datasets that follow the general-continual setting, which inherit from the GCLDataset class, (defined in GCL Dataset). These classes provide some useful methods to create data loaders and store masked data loaders for continual learning experiments. See more in the next section.

  1. Create a new file in the datasets folder, e.g. my_dataset.py.

  2. Define a SINGLE new class that inherits from ContinualDataset or GCLDataset and implements all the required methods and attributes.

  3. Define the get_data_loaders method, which returns a list of train and test data loaders for each task (see more in section Utils).

Tip

For convenience, most datasets are initially created with all classes and then masked appropriately by the store_masked_loaders function. For example, in Seq CIFAR-10 the get_data_loaders function of SequentialCIFAR10 dataset first inizializes the MyCIFAR10 and TCIFAR10 datasets with train and test data for all classes respectively, and then masks the data loaders to return only the data for the current task.

Important

The train data loader must return both augmented and non-augmented data. This is done to allow the storage of raw data for replay-based methods (for more information, check out Rethinking Experience Replay: a Bag of Tricks for Continual Learning). The signature return for the train data loader is (augmented_data, labels, non_augmented_data), while the test data loader should return (data, labels).

  1. If all goes well, your dataset should be picked up by the get_dataset function and you should be able to run an experiment with it.

Utils#

  • get_data_loaders: This function should take care of downloading the dataset if necessary, make sure that it contains samples and labels for

only the current task (you can use the store_masked_loaders function), and create the data loaders.

  • store_masked_loaders: This function is defined in Continual Dataset and takes care of masking the data loaders to return only the data for the current task.

It is used by most datasets to create the data loaders for each task.

  • If the --permute_classes flag is set to 1, it also applies the appropriate permutation to the classes before splitting the data.

  • If the --label_perc_by_task/--label_perc_by_class argument is set to a value between 0 and 1, it also randomly masks a percentage of the labels for each task/class.

Module attributes and functions#

Datasets can be included either by registering them using the register_dataset decorator or by following the old naming convention: - A single dataset is defined in a file named <dataset_name>.py in the datasets folder. - The dataset class must inherit from ContinualDataset.

datasets.get_all_datasets_legacy()[source]#

Returns the list of all the available datasets in the datasets folder that follow the old naming convention.

datasets.get_dataset(args)[source]#

Creates and returns a continual dataset among those that are available. If an error was detected while loading the available datasets, it raises the appropriate error message.

Parameters:

args (Namespace) – the arguments which contains the hyperparameters

Return type:

ContinualDataset

Exceptions:

AssertError: if the dataset is not available Exception: if an error is detected in the dataset

Returns:

the continual dataset instance

Return type:

ContinualDataset

datasets.get_dataset_class(args, return_args=False)[source]#

Return the class of the selected continual dataset among those that are available. If an error was detected while loading the available datasets, it raises the appropriate error message.

Parameters:
  • args (Namespace) – the arguments which contains the –dataset attribute

  • return_args (bool) – whether to return the parsable arguments of the dataset

Return type:

ContinualDataset

Exceptions:

AssertError: if the dataset is not available Exception: if an error is detected in the dataset

Returns:

the continual dataset class

Return type:

ContinualDataset

datasets.get_dataset_config_names(dataset)[source]#

Return the names of the available continual dataset configurations.

The configurations can be used to create a dataset with specific hyperparameters and can be specified using the –dataset_config attribute.

The configurations are stored in the datasets/configs/<dataset> folder.

datasets.get_dataset_names(names_only=False)[source]#

Return the names of the available continual dataset. If an error was detected while loading the available datasets, it raises the appropriate error message.

Parameters:

names_only (bool) – whether to return only the names of the available datasets

Exceptions:

AssertError: if the dataset is not available Exception: if an error is detected in the dataset

Returns:

the named of the available continual datasets

datasets.register_dataset(name)[source]#

Decorator to register a ContinualDatasety. The decorator may be used on a class that inherits from ContinualDataset or on a function that returns a ContinualDataset instance. The registered dataset can be accessed using the get_dataset function and can include additional keyword arguments to be set during parsing.

The arguments can be inferred by the signature of the dataset’s class. The value of the argument is the default value. If the default is set to Parameter.empty, the argument is required. If the default is set to None, the argument is optional. The type of the argument is inferred from the default value (default is str).

Parameters:

name (str) – the name of the dataset

Return type:

Callable