Home // ICDT 2011, The Sixth International Conference on Digital Telecommunications // View article
Authors:
Ondrej Kazik
Jan Lansky
Keywords: compression; part-of-speech tagging; neural networks
Abstract:
Compression of texts written in natural language can exploit information about its linguistic structure. We show that separation of coding of part-of-speech tags of a sentence (so called sentence types) from the text and coding this sentence types separately can improve resulting compression ratio. For this purpose the tagging method NNTagger based on neural networks is designed. This article is focused on a specification and formalization of a compression model of texts written in Czech. Language with such a complicated morphology contains a great amount of implicit grammatical information of a sentence and it is thus suitable for this approach. We propose methods of constructing of initial dictionaries and test their influence on resulting compression ratio.
Pages: 64 to 73
Copyright: Copyright (c) IARIA, 2011
Publication date: April 17, 2011
Published in: conference
ISSN: 2308-3964
ISBN: 978-1-61208-127-4
Location: Budapest, Hungary
Dates: from April 17, 2011 to April 22, 2011