This supplementary is organized as follows:
The dataset collection is followed by a crucial first step: video curation. In addition to the target news videos, the auxiliary data in the dataset is manually categorized into three types: debunking videos, original videos, and evidence videos. Debunking videos are typically produced by authoritative sources or fact-checking organizations to explicitly refute misinformation or false claims presented in the target news videos. Original videos refer to unedited or primary-source footage that may have been manipulated or misrepresented in the news content, providing a baseline for comparison and authenticity verification. Evidence videos, on the other hand, offer additional contextual or corroborative information that indirectly supports the assessment of the video's authenticity and credibility. Notably, only target videos are sent to the three-level annotation pipeline.
Figure presents the detailed pipelines of the 3-level annotation scheme. Firstly, the filtered target video is automatically assigned a veracity label based on the rating provided by the fact-checking website Snopes. For samples labeled as fake, annotators are required to separately assess the presence of each of the six predefined types, thereby performing a second-level multilabel classification. To implement fine-grained third-level annotation, each identified fake label is further grounded to specific locations within the corresponding modality. The grounding process varies depending on the modality and the type of fake content: (1) For textual content (e.g., video headlines, speech), grounding refers to highlighting the specific phrase or entire sentence that contains false or misleading information. (2) For video content, grounding is typically temporal for the temporal edit type, marking the timestamps of deceptive editing operations. In the case of CGI-based falsification, spatial grounding is also applied to indicate the manipulated area within the video frame. (3) For cross-modal inconsistencies, grounding involves locating contradictions or unsupported claims in the text relative to the video. Notably, candidate timestamps for the temporal edit type are discrete, comprising start timestamp, end timestamp, and scene transition timestamp, which are uniformly mapped to a binary vector indicating the locations of deceptive edit operations.
To ensure consistency in classification and grounding across different annotators, we provide detailed definitions for each sub-label.
For the sake of brevity, the temporal edit localizer is presented in a simplified form in the main text. In practice, in addition to scene transitions, the candidate grounding element also includes the start and end of video as shown in Figure. Aligned with the annotation process, the output of the temporal edit localizer is also a binary vector indicating the locations of deceptive edit operations.