Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities

1University of Modena and Reggio Emilia, 2University of Pisa, 3Leonardo S.p.A.
* Equal contribution

We introduce CoDE (Contrastive Deepfake Embeddings), a novel approach that utilize contrastive learning and global-local similarities to create an effective embedding space specifically for deepfake detection. CoDE achieves state-of-the-art accuracy and generalization across various image generators, including some that were not seen during training.

Additionally, we release the D3 dataset used for the train, which consists of 9.2 million images generated using different diffusion models.

Abstract

Discerning between authentic content and that generated by advanced AI methods has become increasingly challenging. While previous research primarily addresses the detection of fake faces, the identification of generated natural images has only recently surfaced. This prompted the recent exploration of solutions that employ foundation vision-and-language models, like CLIP. However, the CLIP embedding space is optimized for global image-to-text alignment and is not inherently designed for deepfake detection, neglecting the potential benefits of tailored training and local image features. In this study, we propose CoDE (Contrastive Deepfake Embeddings), a novel embedding space specifically designed for deepfake detection. To sustain the training of our model, we generate a comprehensive dataset that focuses on images generated by diffusion models and encompasses a collection of 9.2 million images produced by using four different generators. Experimental results demonstrate that CoDE achieves excellent generalization capabilities to unseen image generators.

Model Architecture

The CoDE model architecture is built on a ViT-Tiny, which balances high accuracy with a low number of parameters. This makes it particularly suitable for production environments due to its high throughput of processed images per second. To enhance robustness against post-processing operations often applied to images in real-world scenarios, various transformations are applied to images during training. These transformations are sampled in both type and intensity.

Additionally, both local and global crops of images are considered to capture contextual features and fine details, which are useful for identifying traces left by different generative models. Overall, during training CoDE employs a combination of loss functions, leveraging global and multi-scale features to optimally position the features in the embedding space for deepfake detection.

A dataset for Large-scale Deepfake Detection (D3)

Existing deepfake detection datasets are limited in their diversity of generators and quantity of images. Therefore, we create and release a new dataset that can support learning deepfake detection methods from scratch. Our Diffusion-generated Deepfake Detection dataset (D3) contains nearly 2.3M records and 11.5M images. Each record in the dataset consists of a prompt, a real image, and four images generated with as many generators. Prompts and corresponding real images are taken from LAION-400M , while fake images are generated, starting from the same prompt, using different text-to-image generators.

We employ four state-of-the-art opensource diffusion models, namely Stable Diffusion 1.4 (SD-1.4), Stable Diffusion 2.1 (SD-2.1), Stable Diffusion XL (SD-XL), and DeepFloyd IF (DF-IF). While the first three generators are variants of the Stable Diffusion approach, DeepFloyd IF is strongly inspired by Imagen and thus represents a different generation technique.

With the aim of increasing the variance of the dataset, images have been generated with different aspect ratios, 2562, 5122, 640x480, and 640x360. Moreover, to mimic the distribution of real images, we also employ a variety of encoding and compression methods (BMP, GIF, JPEG, TIFF, PNG). In particular, we closely follow the distribution of encoding methods of LAION itself, therefore favoring the presence of JPEG-encoded images.

Qualitative Results

ELSA Challenge and Benchmarks

This work has been done under the Multimedia use case of the European network ELSA - European Lighthouse on Secure and Safe AI. The objective of the Multimedia use case is to develop effective solutions for detecting and mitigating the spread of deep fake images in multimedia content.

Join our thrilling competition on deepfake detection and put your skills to the test. As the rise of deepfake technology poses unprecedented challenges, we invite individuals and teams from all backgrounds to showcase their expertise in identifying and debunking manipulated media.

Citation

@inproceedings{baraldi2024contrastive,
              title={{Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities}},
              author={Baraldi, Lorenzo and Cocchi, Federico and Cornia, Marcella and Baraldi, Lorenzo and Nicolosi, Alessandro and Cucchiara, Rita},
              booktitle={Proceedings of the European Conference on Computer Vision},
              year={2024}
            }