Home // GPTMB 2024, The First International Conference on Generative Pre-trained Transformer Models and Beyond // View article


An Empirical Taxonomy for Rating Trustability of LLMs

Authors:
Matthias Harter

Keywords: AI; trustability; truthfulness; trustworthiness; myths; misconceptions; urban legends; prejudice; mixture of experts; question answering; Q&A; benchmarks

Abstract:
This paper proposes a new classification scheme for evaluating the trustworthiness and usefulness of Large Language Models (LLMs) in fact-checking and combating misinformation. Using a dataset of 1,000 questions about common myths and misconceptions from the German newspaper DIE ZEIT, the author compares LLM responses to expert-verified answers. A point-based weighting system is applied, considering factors such as the LLMs’ ability to identify uncertainty and avoid confabulation. Testing several well-known LLMs, the results suggest that some models, like GPT-4 and Claude-3, achieve “superhuman” or “expert” level performance in debunking myths. However, manual comparison of LLM reasoning with expert explanations is needed to fully validate these findings. We also examine LLM confidence scores and concludes that they do not necessarily improve answer quality or overall trustworthiness ratings. This taxonomy offers a novel approach to assessing LLM reliability in real-world applications.

Pages: 1 to 14

Copyright: Copyright (c) IARIA, 2024

Publication date: June 30, 2024

Published in: conference

ISBN: 978-1-68558-182-4

Location: Porto, Portugal

Dates: from June 30, 2024 to July 4, 2024