KoKo: An L1 Learner Corpus for German

Abstract

We introduce the KoKo corpus, a collection of German L1 learner texts annotated with learner errors, along with the methods and tools used in its construction and evaluation. The corpus contains both texts and corresponding survey information from 1,319 pupils and amounts to around 716,000 tokens. The evaluation of the quality of the performed transcriptions and annotations shows an accuracy of orthographic error annotations of approximately 80% as well as high accuracy of transcriptions ($>$ 99%), automatic tokenisation ($>$ 99%), sentence splitting ($>$ 96%) and POS-tagging ($>$ 94%). The KoKo corpus will be published at the end of 2014 and be the first accessible linguistically annotated German L1 learner corpus. It will represent a valuable source for research and teaching on German as L1 language, in particular with regards to writing skills.

Publication
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14)
Next
Previous