Home // DATA ANALYTICS 2013, The Second International Conference on Data Analytics // View article
Exploiting Wiktionary for Lightweight Part-of-Speech Tagging for Machine Learning Tasks
Authors:
Mario Zechner
Stefan Klampfl
Roman Kern
Keywords: Machine learning; feature engineering; natural language processing; part-of-speech tagging; big data.
Abstract:
Part-of-speech (PoS) tagging is a crucial part in many natural language machine learning tasks. Current state-of-the-art PoS taggers exhibit excellent qualitative performance, but also contribute heavily to the total runtime of text preprocessing and feature generation, which makes feature engineering a time-consuming task. We propose a lightweight dictionary and heuristics based PoS tagger that exploits Wiktionary as its information source. We demonstrate that its application to natural language machine learning tasks considerably decreases the feature generation runtime, while not degrading the overall performance on these tasks. We compare the lightweight tagger to a state-of-the-art maximum entropy based PoS tagger in clustering and classification tasks and evaluate its performance on the Brown Corpus. Finally, we explore future research scenarios where our tagger and Wiktionary lookup enables efficient processing of big data due to the significant decrease in runtime.
Pages: 11 to 17
Copyright: Copyright (c) IARIA, 2013
Publication date: September 29, 2013
Published in: conference
ISSN: 2308-4464
ISBN: 978-1-61208-295-0
Location: Porto, Portugal
Dates: from September 29, 2013 to October 3, 2013