The future of BootCaT: A Creative Commons License filter


“Copyright issues remain a gray area in compiling and distributing Web corpora”[1]; and even though “If a Web corpus is infringing copyright, then it is merely doing on a small scale what search engines such as Google are doing on a colossal scale”[2], and “If you want your webpage to be removed from our corpora, please contact us”[3], are practical stances the former, given the increased heat Google&Co. are facing on this matter, might be of limited use, and the latter still entails some legal risk. Also, “Even if the concrete legal threats are probably minor, they may have negative impact on fund-raising”[4]. So, (adding the possibility for) minimizing the legal risks, or rather, actively facing and eliminating them is paramount to the WaCky initiative. Theoretical aspects of creating ‘a free’ corpus are covered in [5]; one result is that ‘the Creative Commons (CC) licenses’ is the most promising legal model to use as a filter for web pages. Also, examples of ‘free’ (CC) corpora already exist, cf. [6,7]. On a technical level, the change from Google/Yahoo! to Bing as a search API for BootCaT complicated things: Google and Yahoo! both allow for filtering search results according to a - perceived - CC license of a page (for Yahoo! this filter was part of BootCaT and was used in [7]); unfortunately, Bing does not support this option. Then, the “Best Practices for Marking Content with CC Licenses”[8] should be used as clues to filter downloaded content - and given the nature of the BootCaT pipeline, i.e. the downloaded pages are stripped early on (e.g. meta data from html pages; CC info in boilerplate, etc.), post-processing of the pages is not promising. The filter option could be integrated along the other “various filters”, e.g. ‘bad word thresholds’, in because there the whole page, with meta data and boilerplate, is available (for the first and the last time). References: [1] Corpus Analysis of the World Wide Web by William H. Fletcher [2] Introduction to the Special Issue on the Web as Corpus Computational Linguistics, Vol. 29, No. 3. (1 September 2003), pp. 333-347 by Adam Kilgarriff, Gregory Grefenstette [3] [4] Using Web data for linguistic purposes in Corpus linguistics and the Web (2007), pp. 7-24 by Anke Lüdeling, Stefan Evert, Marco Baroni edited by Marianne Hundt, Nadjia Nesselhauf, Caroline Biewer [5] The creation of free linguistic corpora from the web in Proceedings of the Fifth Web as Corpus Workshop (WAC5) (2009), pp. 9-16 by Marco Brunello [6] The English CC corpus by The Centre for Translation Studies, University of Leeds; [7] The Paisà (Piattaforma per l’Apprendimento dell’Italiano Su corpora Annotati) corpus by University of Bologna (Lead Partner) - Sergio Scalise with colleague Claudia Borghetti; CNR Pisa - Vito Pirrelli with colleagues Alessandro Lenci, and Felice Dell’Orletta; European Academy of Bozen/Bolzano - Andrea Abel with colleagues Chris Culy, Henrik Dittmann, and Verena Lyding; University of Trento - Marco Baroni with colleagues Marco Brunello, Sara Castagnoli, and Egon Stemle; [8]

BootCaTters of the world unite! (BOTWU), A workshop (and a survey) on the BootCaT toolkit
Department of Interpreting and Translation, University of Bologna, Forlì, IT