Web Corpus Creation and Cleaning


It has proven very difficult to obtain large quantities of ‘traditional’ text that is not overly restricted by authorship or publishing companies and their terms of use, or other forms of intellectual property rights, is versatile – and controllable – enough in type, and hence, suitable for various scientific or commercial use-cases. The growth of the World Wide Web as an information resource has been providing an alternative to large corpora of news feeds, newspaper texts, books, and other electronic versions of classic printed matters: The idea arose to gather data from the Web for it is an unprecedented and virtually inexhaustible source of authentic natural language data and offers the NLP community an opportunity to train statistical models on much larger amounts of data than was previously possible. However, we observe that after crawling content from the Web the subsequent steps, namely, language identification, tokenising, lemmatising, part-of-speech tagging, indexing, etc. suffer from ’large and messy’ training corpora [. . . ] and interesting [. . . ] regularities may easily be lost among the countless duplicates, index and directory pages, Web spam, open or disguised advertising, and boilerplate. The consequence is that thorough pre-processing and cleaning of Web corpora is crucial in order to obtain reliable frequency data. I will talk about Web corpora, their creation, and the necessary cleaning.

Student Research Workshop:Computer Applications in Linguistics (CSRW2012)
English Corpus Linguistics Group at the Institute of Linguistics and Literary Studies, Technische Universität Darmstadt, Darmstadt, DE