Home // AIMEDIA 2025, The First International Conference on AI-based Media Innovation // View article
Authors:
Matthias Harter
Keywords: Wikipedia; Data poisoning; LLM; AI training data; Trustworthiness in AI; Crowdsourced content; Disinformation
Abstract:
Analyzing revision histories of over 15,000 articles in the German-language Wikipedia legal domain from 2004 to 2025, this study examines the persistent infiltration of entries by contributors later permanently banned for vandalism, extremist propaganda, promotional editing, or uncooperative conduct. We quantify a non-trivial proportion of edits originating from compromised accounts, demonstrating how such editorial contamination degrades Wikipedia’s reliability as a training corpus for Large Language Models (LLMs) in legal and mediacontent generation contexts where factual precision is critical. Our investigation further reveals that Retrieval-Augmented Generation (RAG) architectures, which ground outputs in external data, risk propagating inaccuracies if their source repositories are compromised. These findings have direct implications for trust and disinformation in AI media, ethical considerations in AI-generated content, and the evaluation of LLM-based tools, by highlighting vulnerabilities in open-source knowledge pipelines. Ultimately, our findings challenge assumptions about swarm intelligence and demonstrate the urgent need for robust safeguards to ensure reliable AI-driven media production workflows.
Pages: 41 to 47
Copyright: Copyright (c) IARIA, 2025
Publication date: July 6, 2025
Published in: conference
ISBN: 978-1-68558-330-9
Location: Venice, Italy
Dates: from July 6, 2025 to July 10, 2025