A New Dataset and Benchmark for Grounding Multimodal Misinformation

Dataset Pipeline

Construction pipeline of GroundLie360. It consists of three stages: (1) Video Collection harvesting videos from snopes.com; (2) Curation classifying videos into five categories: target videos (directly related to claims), original videos (pre-manipulation), debunking videos (refutations), evidence videos(supporting materials), and others; (3) Annotation labeling target videos with 6 fake types (False Title/Speech, Temporal Edits, CGI, and so on) and grounding information.

Comparison of Multimodal Fake News Datasets. Label Levels (L1: binary veracity classification. L2: fake types, L3: fake content grounding. M.A. (Manual Annotation), Annotation (Title: Textual headline, Speech: Transcript of verbal content, V.T.: Video Temporal localization, V.S.: Video Spatial localization). Range (RW-Gen: Real-World General Misinformation, DF: Deepfake, Tam: Tampering), Rat. (Explanation Rationale), and #Plat. (Number of platforms from which the video data was collected).

Supplementary Materials

Overview

This supplementary is organized as follows:

Dataset Construction Details
1. Video Curation Process
2. Three-Level Annotation Process
3. Definition of Six Faking types
Method Details
1. Temporal Edit Localizer

Dataset Construction Details

Video Curation Process

The dataset collection is followed by a crucial first step: video curation. In addition to the target news videos, the auxiliary data in the dataset is manually categorized into three types: debunking videos, original videos, and evidence videos. Debunking videos are typically produced by authoritative sources or fact-checking organizations to explicitly refute misinformation or false claims presented in the target news videos. Original videos refer to unedited or primary-source footage that may have been manipulated or misrepresented in the news content, providing a baseline for comparison and authenticity verification. Evidence videos, on the other hand, offer additional contextual or corroborative information that indirectly supports the assessment of the video's authenticity and credibility. Notably, only target videos are sent to the three-level annotation pipeline.

Examples of auxiliary videos

Three-Level Annotation Process

Figure presents the detailed pipelines of the 3-level annotation scheme. Firstly, the filtered target video is automatically assigned a veracity label based on the rating provided by the fact-checking website Snopes. For samples labeled as fake, annotators are required to separately assess the presence of each of the six predefined types, thereby performing a second-level multilabel classification. To implement fine-grained third-level annotation, each identified fake label is further grounded to specific locations within the corresponding modality. The grounding process varies depending on the modality and the type of fake content: (1) For textual content (e.g., video headlines, speech), grounding refers to highlighting the specific phrase or entire sentence that contains false or misleading information. (2) For video content, grounding is typically temporal for the temporal edit type, marking the timestamps of deceptive editing operations. In the case of CGI-based falsification, spatial grounding is also applied to indicate the manipulated area within the video frame. (3) For cross-modal inconsistencies, grounding involves locating contradictions or unsupported claims in the text relative to the video. Notably, candidate timestamps for the temporal edit type are discrete, comprising start timestamp, end timestamp, and scene transition timestamp, which are uniformly mapped to a binary vector indicating the locations of deceptive edit operations.

An example pipeline of 3-level annotation scheme.

Definition of Six Faking types

To ensure consistency in classification and grounding across different annotators, we provide detailed definitions for each sub-label.

Definitions of sub-labels used in the proposed dataset.

Method Details

Temporal Edit Localizer

For the sake of brevity, the temporal edit localizer is presented in a simplified form in the main text. In practice, in addition to scene transitions, the candidate grounding element also includes the start and end of video as shown in Figure. Aligned with the annotation process, the output of the temporal edit localizer is also a binary vector indicating the locations of deceptive edit operations.

The actual implementation of temporal edit localizer. The VLM first checks whether the start or end of the video has been deceptively edited. It then evaluates transitions between scenes to detect malicious edits.