Home // SEMAPRO 2020, The Fourteenth International Conference on Advances in Semantic Processing // View article


Properties of Semantic Coherence Measures - Case of Topic Models

Authors:
Pirkko Pietiläinen

Keywords: Measuring Topic Coherence; LDA; Wikipedia; WordNet; Palmetto.

Abstract:
Measures of semantic relatedness and coherence are used in several Artificial Intelligence (AI) applications. Topic models is one of the fields where these measures have a role. In evaluating topic models, it is important to know well the properties of the used measure or measures. In this paper, it is first shown how 16 proposed coherence measures behave in finding the highest coherence in Latent Dirichlet Allocation (LDA) processing. With the collected exceptionally large corpus data from Wikipedia, it was then determined the correlations of the measures and the number of topics in LDA. From the average behavior of the measures, it is possible to conclude the range where the maximum values of coherence probably occur. Approximation of the size of a corpus giving statistically significant results in these respects is possible. Comparisons to human ratings are also included. The data and the R-codes for the calculations are made public. This paper explains many of the features affecting the use of coherence measures, including the roles of corpus/sample size, number of topics and the existence of local maxima of the measures. Differences of the measures and their correlations are also described.

Pages: 36 to 44

Copyright: Copyright (c) IARIA, 2020

Publication date: October 25, 2020

Published in: conference

ISSN: 2308-4510

ISBN: 978-1-61208-813-6

Location: Nice, France

Dates: from October 25, 2020 to October 29, 2020