The future of BootCaT: A Creative Commons License filter


“Copyright issues remain a gray area in compiling and distributing Web corpora”(Fletcher online); and even though “If a Web corpus is infringing copyright, then it is merely doing on a small scale what search engines such as Google are doing on a colossal scale”(Kilgarriff and Grefenstette 2003), and “If you want your webpage to be removed from our corpora, please contact us”(WaCKy Project online), are practical stances the former, given the increased heat Google&Co. are facing on this matter, might be of limited use, and the latter still entails some legal risk. Also, “Even if the concrete legal threats are probably minor, they may have negative impact on fund-raising”(Lüdeling, Evert and Baroni 2007). So, (adding the possibility for) minimizing the legal risks, or rather, actively facing and eliminating them is paramount to the WaCky initiative. Theoretical aspects of creating ‘a free’ corpus are covered in Brunello (2009); one result is that ‘the Creative Commons (CC) licenses’ are the most promising legal model to use as a filter for web pages. Also, examples of ‘free’ (CC) corpora already exist, cf. “The English CC corpus by The Centre for Translation Studies, University of Leeds” and “The Paisà corpus by University of Bologna (Lead Partner)). On a technical level, the change from Google/Yahoo! to Bing as a search API for BootCaT complicated things: Google and Yahoo! both allow for filtering search results according to a - perceived - CC license of a page (for Yahoo! this filter was part of BootCaT and was used in ‘the Paisà corpus’); unfortunately, Bing does not support this option. Then, the “Best Practices for Marking Content with CC Licenses”(Creative Commons online) should be used as clues to filter downloaded content - and given the nature of the BootCaT pipeline, i.e. the downloaded pages are stripped early on (e.g. meta data from html pages; CC info in boilerplate, etc.), post-processing of the pages is not promising. The filter option could be integrated along the other “various filters”, e.g. ‘bad word thresholds’, in because there the whole page, with meta data and boilerplate, is available (for the first and the last time).

BootCaTters of the world unite! (BOTWU), A workshop (and a survey) on the BootCaT toolkit
Department of Interpreting and Translation, University of Bologna, Forlì, IT