FFAA: Face Forgery Analysis Assistant

Multimodal Large Language Model based Explainable Open-World
Face Forgery Analysis Assistant

Tsinghua University The Chinese University of Hong Kong

Abstract

The rapid advancement of deepfake technologies has sparked widespread public concern, particularly as face forgery poses a serious threat to public information security. However, the unknown and diverse forgery techniques, varied facial features and complex environmental factors pose significant challenges for face forgery analysis. Existing datasets lack descriptions of these aspects, making it difficult for models to distinguish between real and forged faces using only visual information amid various confounding factors. In addition, existing methods do not yield user-friendly and explainable results, complicating the understanding of the model's decision-making process. To address these challenges, we made the following contributions:

  1. Novel Task. We introduce a novel OW-FFA-VQA task and establish the OW-FFA-Bench for evaluation. This task is essential for understanding the model's decision-making process and advancing real-world face forgery analysis.
  2. FFA-VQA Dataset. We utilize GPT4-assisted data generation to create the FFA-VQA dataset, featuring diverse real and forged face images with essential descriptions and forgery reasoning.
  3. FFAA. We propose Face Forgery Analysis Assistant, consisting of a fine-tuned MLLM and Multi-answer Intelligent Decision System (MIDS). By integrating hypothetical prompts with MIDS, the impact of fuzzy classification boundaries is effectively mitigated, enhancing robustness.
  4. Performance. FFAA not only provides user-friendly explainable results but also achieves state-of-the-art generalization performance on the OW-FFA-Bench with an ACC of 86.5% and an AUC of 94.4%, and demonstrates excellent robustness.

Face Forgery Analysis VQA Data

We utilize FF++, Celeb-DF-v2, DFFD (FFHQ, FaceApp, StyleGAN) and GanDiffFace as our source datasets since they encompass diverse forgery techniques and enriched facial characteristics. Subsequently, we construct the Multi-attack (MA) dataset, comprising 95K images with diverse facial features and various types of forgeries. Then, we employ GPT-4o to generate face forgery analysis processes. Finally, experts scrutinize the analysis process and the qualified responses are organized into a VQA format, resulting in 20K high-quality face forgery analysis VQA data.

Data file name File Size Sample Size
ffa_vqa_20k.json 32.4 MB 20K
mistral_ffa_vqa_90k.json 98.2 MB 30K

FFAA: Face Forgery Analysis Assistant

FFAA primarily consists of two modules: a fine-tuned Multimodal Large Language Model (MLLM) and Multi-answer Intelligent Decision System (MIDS). MIDS is proposed to select the answer that best matches the image's authenticity, mitigating the impact of fuzzy classification boundaries between real and forged faces, thereby enhancing model's robustness. We consider a two-stage training procedure:

  • Stage 1: Fine-tune MLLM with Hypothetical Prompts. We introduce hypothetical prompts to the FFA-VQA dataset that presume the face is either real or fake prior to analysis. By fine-tuning the MLLM on this dataset, we enable the generation of answers based on varying hypotheses.
  • Stage 2: Train MIDS with Historical Answers. Utilize the fine-tuned MLLM to extract answers from unused samples in the MA dataset for training.

Performance

OW-FFA-Bench: Open-World Face Forgery Analysis Benchmark

We create the generalization test sets of the OW-FFA-Bench by randomly collecting face images from the public datasets: Deeperforensics (DPF), DeepFakeDetection (DFD), DFDC, PGGAN, Celeb-A, WhichFaceIsReal (WFIR) and MultiFFDI (MFFDI). The images are then divided into six generalization test sets based on their source, with each containing 1K images and differing in distribution. FFAA achieves state-of-the-art generalization performance on the OW-FFA-Bench with an ACC of 86.5% and an AUC of 94.4%, and demonstrates excellent robustness with an sACC of 10.0%

Comparison with Advanced MLLMs

We compare FFAA with advanced MLLMs in face forgery analysis, including LLaVA-v1.6-34b and GPT-4o. Judgment results are indicated in parentheses as ('Real', 'Fake', 'Refuse to judge'), with green for correct judgments and red for incorrect ones. These advanced MLLMs exhibit strong image understanding capabilities but often struggle to determine face authenticity, sometimes avoiding definitive conclusions. In contrast, FFAA not only understands the image but also conducts a reasoned analysis of authenticity from various perspectives.

Enhanced interpretability across diverse forgery techniques

We present eight face images from different sources, along with the heatmap visualizations of MIDS's Fvl and Fvg, and several key information from FFAA's answers. It can be observed that Fvl tends to focus on specific facial features, the surrounding environment, and more apparent local forgery marks or authenticity clues, whereas Fvg emphasizes more abstract features (e.g., 'smoothness', 'lighting', 'shadows', 'integration'). In some cases, such as examples (a), (b) and (c), Fvl and Fvg may rely on the same features as clues for assessing the authenticity of the face.

Examples on Open-World Explainable Face Forgery Analysis

BibTeX

@article{huang2024ffaa,
          title={FFAA: Multimodal Large Language Model based Explainable Open-World Face Forgery Analysis Assistant},
          author={Huang, Zhengchao and Xia, Bin and Lin, Zicheng and Mou, Zhun and Yang, Wenming},
          journal={arXiv preprint arXiv:2408.10072},
          year={2024}
}

Acknowledgement

This website is modified from LLaVA and LLMGA, both of which are outstanding works! We thank the LLaVA team for giving us access to their models, and open-source projects.

Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP, LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.