bot.zen @ EmpiriST 2015 - A minimally-deep learning PoS-tagger (trained for German CMC and Web data)


This article describes the system that participated in the Part-of-speech tagging subtask of the “EmpiriST 2015 shared task on automatic linguistic annotation of computer-mediated communication / social media”. The system combines a small assertion of trending techniques, which implement matured methods, from NLP and ML to achieve competitive results on PoS tagging of German CMC and Web corpus data; in particular, the system uses word embeddings and character-level representations of word beginnings and endings in a LSTM RNN architecture. Labelled data (Tiger v2.2 and EmpiriST) and unlabelled data (German Wikipedia) were used for training. The system is available under the APLv2 open-source license.

Proceedings of the 10th Web as Corpus Workshop (WAC-X) and the EmpiriST Shared Task