Linguistic Text Compression

Kazik, Ondrej; Lansky, Jan

Home // ICDT 2011, The Sixth International Conference on Digital Telecommunications // View article

Authors:
Ondrej Kazik
Jan Lansky

Keywords: compression; part-of-speech tagging; neural networks

Abstract:
Compression of texts written in natural language can exploit information about its linguistic structure. We show that separation of coding of part-of-speech tags of a sentence (so called sentence types) from the text and coding this sentence types separately can improve resulting compression ratio. For this purpose the tagging method NNTagger based on neural networks is designed. This article is focused on a specification and formalization of a compression model of texts written in Czech. Language with such a complicated morphology contains a great amount of implicit grammatical information of a sentence and it is thus suitable for this approach. We propose methods of constructing of initial dictionaries and test their influence on resulting compression ratio.

Pages: 64 to 73

Copyright: Copyright (c) IARIA, 2011

Publication date: April 17, 2011

Published in: conference

ISSN: 2308-3964

ISBN: 978-1-61208-127-4

Location: Budapest, Hungary

Dates: from April 17, 2011 to April 22, 2011