Exploiting Wiktionary for Lightweight Part-of-Speech Tagging for Machine Learning Tasks

Zechner, Mario; Klampfl, Stefan; Kern, Roman

Home // DATA ANALYTICS 2013, The Second International Conference on Data Analytics // View article

Exploiting Wiktionary for Lightweight Part-of-Speech Tagging for Machine Learning Tasks

Authors:
Mario Zechner
Stefan Klampfl
Roman Kern

Keywords: Machine learning; feature engineering; natural language processing; part-of-speech tagging; big data.

Abstract:
Part-of-speech (PoS) tagging is a crucial part in many natural language machine learning tasks. Current state-of-the-art PoS taggers exhibit excellent qualitative performance, but also contribute heavily to the total runtime of text preprocessing and feature generation, which makes feature engineering a time-consuming task. We propose a lightweight dictionary and heuristics based PoS tagger that exploits Wiktionary as its information source. We demonstrate that its application to natural language machine learning tasks considerably decreases the feature generation runtime, while not degrading the overall performance on these tasks. We compare the lightweight tagger to a state-of-the-art maximum entropy based PoS tagger in clustering and classification tasks and evaluate its performance on the Brown Corpus. Finally, we explore future research scenarios where our tagger and Wiktionary lookup enables efficient processing of big data due to the significant decrease in runtime.

Pages: 11 to 17

Copyright: Copyright (c) IARIA, 2013

Publication date: September 29, 2013

Published in: conference

ISSN: 2308-4464

ISBN: 978-1-61208-295-0

Location: Porto, Portugal

Dates: from September 29, 2013 to October 3, 2013