The CoDE model architecture is built on a ViT-Tiny, which balances high accuracy with a low number of parameters. This makes it particularly suitable for production environments due to its high throughput of processed images per second. To enhance robustness against post-processing operations often applied to images in real-world scenarios, various transformations are applied to images during training. These transformations are sampled in both type and intensity.
Additionally, both local and global crops of images are considered to capture contextual features and fine details, which are useful for identifying traces left by different generative models. Overall, during training CoDE employs a combination of loss functions, leveraging global and multi-scale features to optimally position the features in the embedding space for deepfake detection.
Existing deepfake detection datasets are limited in their diversity of generators and quantity of images. Therefore, we create and release a new dataset that can support learning deepfake detection methods from scratch. Our Diffusion-generated Deepfake Detection dataset (D3) contains nearly 2.3M records and 11.5M images. Each record in the dataset consists of a prompt, a real image, and four images generated with as many generators. Prompts and corresponding real images are taken from LAION-400M , while fake images are generated, starting from the same prompt, using different text-to-image generators.
We employ four state-of-the-art opensource diffusion models, namely Stable Diffusion 1.4 (SD-1.4), Stable Diffusion 2.1 (SD-2.1), Stable Diffusion XL (SD-XL), and DeepFloyd IF (DF-IF). While the first three generators are variants of the Stable Diffusion approach, DeepFloyd IF is strongly inspired by Imagen and thus represents a different generation technique.
With the aim of increasing the variance of the dataset, images have been generated with different aspect ratios, 2562, 5122, 640x480, and 640x360. Moreover, to mimic the distribution of real images, we also employ a variety of encoding and compression methods (BMP, GIF, JPEG, TIFF, PNG). In particular, we closely follow the distribution of encoding methods of LAION itself, therefore favoring the presence of JPEG-encoded images.
This work has been done under the Multimedia use case of the European network ELSA - European Lighthouse on Secure and Safe AI. The objective of the Multimedia use case is to develop effective solutions for detecting and mitigating the spread of deep fake images in multimedia content.
Join our thrilling competition on deepfake detection and put your skills to the test. As the rise of deepfake technology poses unprecedented challenges, we invite individuals and teams from all backgrounds to showcase their expertise in identifying and debunking manipulated media.
@inproceedings{baraldi2024contrastive,
title={{Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities}},
author={Baraldi, Lorenzo and Cocchi, Federico and Cornia, Marcella and Baraldi, Lorenzo and Nicolosi, Alessandro and Cucchiara, Rita},
booktitle={Proceedings of the European Conference on Computer Vision},
year={2024}
}