Web Corpus Creation and Cleaning


It has proven very difficult to obtain large quantities of ‘traditional’ text that is not overly restricted by authorship or publishing companies and their terms of use, or other forms of intellectual property rights, is versatile – and controllable – enough in type, and hence, suitable for various scientific or commercial use-cases. [1,2,3] The growth of the World Wide Web as an information resource has been providing an alternative to large corpora of news feeds, newspaper texts, books, and other electronic versions of classic printed matters: The idea arose to gather data from the Web for it is an unprecedented and virtually inexhaustible source of authentic natural language data and offers the NLP community an opportunity to train statistical models on much larger amounts of data than was previously possible. [4,5,6] However, we observe that after crawling content from the Web the subsequent steps, namely, language identification, tokenising, lemmatising, part-of-speech tagging, indexing, etc. suffer from ’large and messy’ training corpora [… ] and interesting [… ] regularities may easily be lost among the countless duplicates, index and directory pages, Web spam, open or disguised advertising, and boilerplate [7]. The consequence is that thorough pre-processing and cleaning of Web corpora is crucial in order to obtain reliable frequency data. I will talk about Web corpora, their creation, and the necessary cleaning. [1] Adam Kilgarriff. Googleology is bad science. Comput. Linguist., 33(1):147–151, 2007 [2] Süddeutsche Zeitung Archiv – Allgemeine Geschäftsbedingungen. [3] The British National Corpus (BNC) user licence. Online Version. [4] Gregory Grefenstette and Julien Nioche. Estimation of english and non-english language use on the WWW. In In Recherche d’Information Assistée par Ordinateur (RIAO), pages 237–246, 2000 [5] Pernilla Danielsson and Martijn Wagenmakers, editors. Proceedings of Corpus Linguistics 2005, volume 1 of The Corpus Linguistics Conference Series, 2005. ISSN 1747-9398 [6] Stefan Evert. A lightweight and efficient tool for cleaning web pages. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008). [7] Daniel Bauer, Judith Degen, Xiaoye Deng, Priska Herger, Jan Gasthaus, Eugenie Giesbrecht, Lina Jansen, Christin Kalina, Thorben Krüger, Robert Märtin, Martin Schmidt, Simon Scholler, Johannes Steger, Egon Stemle, and Stefan Evert. FIASCO: Filtering the Internet by Automatic Subtree Classification, Osnabrück. In Building and Exploring Web Corpora (WAC3 - 2007) – Proceedings of the 3rd web as corpus workshop, incorporating CLEANEVAL.

Student Research Workshop:Computer Applications in Linguistics (CSRW2012)
English Corpus Linguistics Group at the Institute of Linguistics and Literary Studies, Technische Universität Darmstadt, Darmstadt, DE