A New Dataset and Benchmark for Grounding Multimodal Misinformation

1School of Computer Science, Wuhan University
2Peking University
3School of Computing, National University of Singapore
Technical Report
GroundLie360 Dataset Overview
Overview. Our multi-modal benchmark contains 2,000+ fact-checked videos with fake type and grounding annotations. Fake types include: (1) False Title/False Speech - video title or spoken content containing demonstrably false claims; (2) Temporal Edit - videos altered to distort event chronologies or fabricate deceptive narratives; (3) CGI - digitally manipulated or generated synthetic media; (4) Contradictory Content - text-video semantic mismatches; and (5) Unsupported Content - headlines lacking evidentiary support in video content. The dataset offers a unified benchmark for fake content classification and localization.

Abstract

The proliferation of online misinformation videos poses serious societal risks. Current datasets and detection methods primarily target binary classification or single-modality localization based on post-processed data, lacking the interpretability needed to counter persuasive misinformation. In this paper, we introduce the task of Grounding Multimodal Misinformation (GroundMM), which verifies multimodal content and localizes misleading segments across modalities. We present the first real-world dataset for this task, GroundLie360, featuring a taxonomy of misinformation types, fine-grained annotations across text, speech, and visuals, and validation with Snopes evidence and annotator reasoning. We also propose a VLM-based, QA-driven baseline, FakeMark, using single- and cross-modal cues for effective detection and grounding. Our experiments highlight the challenges of this task and lay a foundation for explainable multimodal misinformation detection.

Dataset Pipeline

Dataset Pipeline
Construction pipeline of GroundLie360. It consists of three stages: (1) Video Collection harvesting videos from snopes.com; (2) Curation classifying videos into five categories: target videos (directly related to claims), original videos (pre-manipulation), debunking videos (refutations), evidence videos(supporting materials), and others; (3) Annotation labeling target videos with 6 fake types (False Title/Speech, Temporal Edits, CGI, and so on) and grounding information.
Dataset Comparison
Comparison of Multimodal Fake News Datasets. Label Levels (L1: binary veracity classification. L2: fake types, L3: fake content grounding. M.A. (Manual Annotation), Annotation (Title: Textual headline, Speech: Transcript of verbal content, V.T.: Video Temporal localization, V.S.: Video Spatial localization). Range (RW-Gen: Real-World General Misinformation, DF: Deepfake, Tam: Tampering), Rat. (Explanation Rationale), and #Plat. (Number of platforms from which the video data was collected).

Method

Method Overview
Method Overview. (a) Overall pipeline; (b) Analysis Module, including text, video temporal/spatial, and cross-modality analyses; (c) Multimodal Classification and Grounding with a binary classifier, a multilabel classifier, and four localizers (false speech and contradictory content share designs with false title and unsupported types, respectively)

Results

Experimental Results
Binary classification results. Trad. and LLM represent for traditional video fake detection methods and LLM-based method.
Experimental Results
Subtype classification and text grounding performance across fake types.
Experimental Results
Performance on video grounding tasks. Temporal grounding is evaluated with Precision (Prec.), Recall (Rec.), and F1; spatial-temporal grounding with m_tIoU, m _vIoU, and vIoU@ thresholds. T.E. is temporal edit.

Supplementary Materials

Overview

This supplementary is organized as follows:

Dataset Construction Details

Video Curation Process

The dataset collection is followed by a crucial first step: video curation. In addition to the target news videos, the auxiliary data in the dataset is manually categorized into three types: debunking videos, original videos, and evidence videos. Debunking videos are typically produced by authoritative sources or fact-checking organizations to explicitly refute misinformation or false claims presented in the target news videos. Original videos refer to unedited or primary-source footage that may have been manipulated or misrepresented in the news content, providing a baseline for comparison and authenticity verification. Evidence videos, on the other hand, offer additional contextual or corroborative information that indirectly supports the assessment of the video's authenticity and credibility. Notably, only target videos are sent to the three-level annotation pipeline.

auxiliary data
Examples of auxiliary videos

Three-Level Annotation Process

Figure presents the detailed pipelines of the 3-level annotation scheme. Firstly, the filtered target video is automatically assigned a veracity label based on the rating provided by the fact-checking website Snopes. For samples labeled as fake, annotators are required to separately assess the presence of each of the six predefined types, thereby performing a second-level multilabel classification. To implement fine-grained third-level annotation, each identified fake label is further grounded to specific locations within the corresponding modality. The grounding process varies depending on the modality and the type of fake content: (1) For textual content (e.g., video headlines, speech), grounding refers to highlighting the specific phrase or entire sentence that contains false or misleading information. (2) For video content, grounding is typically temporal for the temporal edit type, marking the timestamps of deceptive editing operations. In the case of CGI-based falsification, spatial grounding is also applied to indicate the manipulated area within the video frame. (3) For cross-modal inconsistencies, grounding involves locating contradictions or unsupported claims in the text relative to the video. Notably, candidate timestamps for the temporal edit type are discrete, comprising start timestamp, end timestamp, and scene transition timestamp, which are uniformly mapped to a binary vector indicating the locations of deceptive edit operations.

3-level Annotation
An example pipeline of 3-level annotation scheme.

Definition of Six Faking types

To ensure consistency in classification and grounding across different annotators, we provide detailed definitions for each sub-label.

Definition of six types
Definitions of sub-labels used in the proposed dataset.

Method Details

Temporal Edit Localizer

For the sake of brevity, the temporal edit localizer is presented in a simplified form in the main text. In practice, in addition to scene transitions, the candidate grounding element also includes the start and end of video as shown in Figure. Aligned with the annotation process, the output of the temporal edit localizer is also a binary vector indicating the locations of deceptive edit operations.

temporal edit localizer
The actual implementation of temporal edit localizer. The VLM first checks whether the start or end of the video has been deceptively edited. It then evaluates transitions between scenes to detect malicious edits.