Home // SEMAPRO 2020, The Fourteenth International Conference on Advances in Semantic Processing // View article
Properties of Semantic Coherence Measures - Case of Topic Models
Authors:
Pirkko Pietiläinen
Keywords: Measuring Topic Coherence; LDA; Wikipedia; WordNet; Palmetto.
Abstract:
Measures of semantic relatedness and coherence are used in several Artificial Intelligence (AI) applications. Topic models is one of the fields where these measures have a role. In evaluating topic models, it is important to know well the properties of the used measure or measures. In this paper, it is first shown how 16 proposed coherence measures behave in finding the highest coherence in Latent Dirichlet Allocation (LDA) processing. With the collected exceptionally large corpus data from Wikipedia, it was then determined the correlations of the measures and the number of topics in LDA. From the average behavior of the measures, it is possible to conclude the range where the maximum values of coherence probably occur. Approximation of the size of a corpus giving statistically significant results in these respects is possible. Comparisons to human ratings are also included. The data and the R-codes for the calculations are made public. This paper explains many of the features affecting the use of coherence measures, including the roles of corpus/sample size, number of topics and the existence of local maxima of the measures. Differences of the measures and their correlations are also described.
Pages: 36 to 44
Copyright: Copyright (c) IARIA, 2020
Publication date: October 25, 2020
Published in: conference
ISSN: 2308-4510
ISBN: 978-1-61208-813-6
Location: Nice, France
Dates: from October 25, 2020 to October 29, 2020