Leading Edge

  1. Stem­le, Egon W., Lionel Nic­olas, and Ver­ena Lyding. 2015. “South Tyr­o­lian Neo­lo­gisms Pro­ject.” Short talk. COST ENel WG3 meet­ing "Auto­matic Know­ledge Acquis­i­tion for Lexicography". Her­st­mon­ceux Castle, Sus­sex, UK.
    @misc{Stemle2015b,
      address = {Herstmonceux Castle, Sussex, UK},
      author = {Stemle, Egon~W. and Nicolas, Lionel and Lyding, Verena},
      booktitle = {COST ENel WG3 meeting "Automatic Knowledge Acquisition for Lexicography"},
      month = aug,
      title = {{South Tyrolian Neologisms Project}},
      type = {short talk},
      year = {2015}
    }
    

  2. Nic­olas, Lionel, Egon Stem­le, Aivars Glaznieks, and Andrea Abel. 2015. “A Gen­eric Data Work­flow for Build­ing Annot­ated Text Cor­por­a.” In Stud­ies in Learner Cor­pus Lin­guist­ics: Research and Applic­a­tions for For­eign Lan­guage Teach­ing and Assessment, edited by Erik Cas­tel­lo, Kath­er­ine Ack­er­ley, and Francesca Coc­cetta, 190:337–51. Lin­guistic Insights. Bern, Switzer­land: Peter Lang. doi:10.3726/978-3-0351-0736-4.
    We present an abstract and gen­eric work­flow, and detail how it has been imple­men­ted to build and annot­ate learner cor­pora. This work­flow has been developed through an inter­dis­cip­lin­ary col­lab­or­a­tion between lin­guists, who annot­ate and use cor­pora, and com­pu­ta­tional lin­guists and com­puter sci­ent­ists, who are respons­ible for provid­ing tech­nical sup­port and adapt­a­tion or imple­ment­a­tion of soft­ware components.
    @incollection{NicolasStemleGlaznieksAbel2014,
      address = {Bern, Switzerland},
      author = {Nicolas, Lionel and Stemle, Egon and Glaznieks, Aivars and Abel, Andrea},
      booktitle = {Studies in Learner Corpus Linguistics: Research and Applications for Foreign Language Teaching and Assessment},
      chapter = {18},
      doi = {10.3726/978-3-0351-0736-4},
      editor = {Castello, Erik and Ackerley, Katherine and Coccetta, Francesca},
      isbn = {978-3-0351-0736-4},
      month = jun,
      pages = {337--351},
      publisher = {Peter Lang},
      series = {Linguistic Insights},
      title = {{A Generic Data Workflow for Building Annotated Text Corpora}},
      volume = {190},
      year = {2015}
    }
    

  3. Stem­le, Egon, and Alex­an­der Onysko. 2015. “Auto­mated L1 iden­ti­fic­a­tion in Eng­lish learner essays and its implic­a­tions for lan­guage trans­fer.” Edited by Hagen Peukert. Trans­fer Effects in Mul­ti­lin­gual Lan­guage Development, Ham­burg Stud­ies on Lin­guistic Diversity, 4 (April). John Ben­jamins: 297–321.
    This art­icle focuses on auto­matic text clas­si­fic­a­tion which aims at identi­fy­ing the first lan­guage (L1) back­ground of learners of Eng­lish. A par­tic­u­lar ques­tion arising in the con­text of auto­mated L1 iden­ti­fic­a­tion is whether any fea­tures that are inform­at­ive for a machine learn­ing algorithm relate to L1-spe­cific trans­fer phe­nom­ena. In order to explore this issue fur­ther, we dis­cuss the res­ults of a study car­ried out in the wake of a Nat­ive Lan­guage Iden­ti­fic­a­tion Task. The task is based on the TOE­FL11 cor­pus (cf. Blan­chard et al. 2013), which involves a sample of 12,100 essays writ­ten by par­ti­cipants in the TOE­FL® test from 11 dif­fer­ent lan­guage back­grounds (Ar­ab­ic, Chinese, French, Ger­man, Hindi, Itali­an, Japan­ese, Korean, Span­ish, Telugu, and Turk­ish). The art­icle will show our res­ults in auto­matic L1 detec­tion in the TOE­FL11 cor­pus. These res­ults are dis­cussed in light of rel­ev­ant trans­fer fea­tures which turned out to be par­tic­u­larly inform­at­ive for auto­matic detec­tion of L1 Ger­man and L1 Italian.
    @article{stemle-onysko:2014,
      author = {Stemle, Egon and Onysko, Alexander},
      _doi = {10.1075/hsld.4.13ste},
      editor = {Peukert, Hagen},
      journal = {Transfer Effects in Multilingual Language Development},
      month = apr,
      pages = {297--321},
      publisher = {John Benjamins},
      series = {Hamburg Studies on Linguistic Diversity},
      title = {{Automated L1 identification in English learner essays and its implications for language transfer}},
      url = {https://benjamins.com/catalog/hsld.4.13ste},
      volume = {4},
      year = {2015}
    }
    

  4. Glaznieks, Aivars, and Egon Stem­le. 2014. “Chal­lenges of build­ing a CMC cor­pus for ana­lyz­ing writer’s style by age: The DiDi pro­ject.” Edited by Michael Beißwenger, Nellek Oostdijk, Angelika Stor­rer, and Henk van den Heuvel. Journal for Lan­guage Tech­no­logy and Com­pu­ta­tional Lin­guist­ics (JLCL) 29 (2). JLCL: 31–57.
    Spe­cial Issue: Build­ing and annot­at­ing cor­pora of com­puter­-­me­di­ated dis­course. Issues and Chal­lenges at the Inte­face of Cor­pus and Com­pu­ta­tional Linguistics
    @article{GlaznieksStemle2014,
      author = {Glaznieks, Aivars and Stemle, Egon},
      editor = {Bei{\ss}wenger, Michael and Oostdijk, Nellek and Storrer, Angelika and van den Heuvel, Henk},
      issn = {2190-6858},
      journal = {Journal for Language Technology and Computational Linguistics (JLCL)},
      month = dec,
      number = {2},
      pages = {31--57},
      publisher = {JLCL},
      title = {{Challenges of building a CMC corpus for analyzing writer's style by age: The DiDi project}},
      url = {http://www.jlcl.org/2014\_Heft2/2GlaznieksStemle.pdf},
      volume = {29},
      year = {2014}
    }
    

  5. Glaznieks, Aivars, Andrea Abel, Ver­ena Lyding, Lionel Nic­olas, and Egon Stem­le. 2014. “Es­tab­lish­ing a Stand­ard­ised Pro­ced­ure for Build­ing Learner Cor­por­a.” Edited by Tarja Niku­la, Sauli Takala, and Sabine Ylönen. Apples - Journal of Applied Lan­guage Studies 8 (3). Centre for Applied Lan­guage Stud­ies, Uni­ver­sity of Jyväs­ky­lä: 5–20.
    Decisions at the out­set of pre­par­ing a learner cor­pus are of cru­cial import­ance for how the cor­pus can be built and how it can be ana­lysed later on. This paper presents a gen­eric work­flow to build learner cor­pora while tak­ing into account the needs of the users. The work­flow res­ults from an extens­ive col­lab­or­a­tion between lin­guists that annot­ate and use the cor­pus and com­puter lin­guists that are respons­ible for provid­ing tech­nical sup­port. The paper addresses the lin­guists’ research needs as well as the avail­ab­il­ity and usab­il­ity of lan­guage tech­no­logy tools neces­sary to meet them. We demon­strate and illus­trate the rel­ev­ance of the work­flow using res­ults and examples from our L1 learner cor­pus of Ger­man ("KoKo").
    @article{GlaznieksNicolasStemleAbelLyding2014,
      author = {Glaznieks, Aivars and Abel, Andrea and Lyding, Verena and Nicolas, Lionel and Stemle, Egon},
      editor = {Nikula, Tarja and Takala, Sauli and Yl\"{o}nen, Sabine},
      issn = {1457-9863},
      journal = {Apples - Journal of Applied Language Studies},
      keywords = {German as a first language,L1 learner corpus,corpus building workflow},
      month = dec,
      number = {3},
      pages = {5--20},
      publisher = {Centre for Applied Language Studies, University of Jyv\"{a}skyl\"{a}},
      title = {{Establishing a Standardised Procedure for Building Learner Corpora}},
      url = {http://apples.jyu.fi/ArticleFile/download/535},
      volume = {8},
      year = {2014}
    }
    

  6. Généreux, Michel, Egon W. Stem­le, Lionel Nic­olas, and Ver­ena Lyding. 2014. “Cor­rect­ing OCR errors for Ger­man in Frak­tur font.” In Pro­ceed­ings of the First Italian Con­fer­ence on Com­pu­ta­tional Lin­guist­ics (CLiC-it 2014), edited by Roberto Basili, Aless­andro Len­ci, and Bern­ardo Magn­ini. Pisa, Italy.
    In this paper, we present ongo­ing exper­i­ments for cor­rect­ing OCR errors on Ger­man news­pa­pers in Frak­tur font. Our approach bor­rows from tech­niques for spelling cor­rec­tion in con­text using a prob­ab­il­istic edit-­op­er­a­tion error model and lex­ical resources. We high­light con­di­tions in which high error reduc­tion rates can be obtained and where the approach cur­rently stands with real data.
    @inproceedings{GenereuxStemleNicolasLyding2014,
      address = {Pisa, Italy},
      author = {G\'{e}n\'{e}reux, Michel and Stemle, Egon W. and Nicolas, Lionel and Lyding, Verena},
      booktitle = {Proceedings of the First Italian Conference on Computational Linguistics (CLiC-it 2014)},
      editor = {Basili, Roberto and Lenci, Alessandro and Magnini, Bernardo},
      month = dec,
      title = {{Correcting OCR errors for German in Fraktur font}},
      url = {http://clic.humnet.unipi.it/proceedings/vol1/CLICIT2014136.pdf},
      year = {2014}
    }
    

  7. Frey, Jen­nifer­-­Car­men, Egon W. Stem­le, and Aivars Glaznieks. 2014. “Col­lect­ing lan­guage data of non-pub­lic social media pro­files.” In Work­shop Pro­ceed­ings of the 12th Edi­tion of the KON­VENS Conference, edited by Ger­trud Faaß and Josef Rup­pen­hofer, 11–15. Hildesheim, Ger­many: Uni­versit­ats­ver­lag Hildesheim, Germany.
    In this paper, we pro­pose an integ­rated web strategy for mixed soci­o­lin­guistic research meth­od­o­lo­gies in the con­text of social media cor­pora. After stat­ing the par­tic­u­lar chal­lenges for build­ing cor­pora of private, non-pub­lic com­puter­-­me­di­ated com­mu­nic­a­tion, we will present our solu­tion to these prob­lems: a Face­book web applic­a­tion for the acquis­i­tion of such data and the cor­res­pond­ing meta data. Finally, we will dis­cuss pos­it­ive and neg­at­ive implic­a­tions for this method.
    @inproceedings{FreyStemleGlaznieks2014,
      address = {Hildesheim, Germany},
      author = {Frey, Jennifer-Carmen and Stemle, Egon W. and Glaznieks, Aivars},
      booktitle = {Workshop Proceedings of the 12th Edition of the KONVENS Conference},
      editor = {Faa{\ss}, Gertrud and Ruppenhofer, Josef},
      month = oct,
      pages = {11--15},
      publisher = {Universitatsverlag Hildesheim, Germany},
      title = {{Collecting language data of non-public social media profiles}},
      url = {http://www.uni-hildesheim.de/konvens2014/data/konvens2014-workshop-proceedings.pdf},
      year = {2014}
    }
    

  8. Lyding, Ver­ena, Lionel Nic­olas, and Egon Stem­le. 2014. “’in­ter­Hist’ - an inter­act­ive visual inter­face for cor­pus explor­a­tion.” In Pro­ceed­ings of the Ninth Inter­na­tional Con­fer­ence on Lan­guage Resources and Eval­u­ation (LREC’14), edited by Nicoletta Calzolari, Khalid Choukri, Thi­erry Decler­ck, Hrafn Lofts­son, Bente Mae­gaard, Joseph Mari­ani, Asun­cion Moreno, Jan Odijk, and Piperidis Stelios, 635–41. Reyk­javik, Iceland: European Lan­guage Resources Asso­ci­ation (ELRA).
    In this art­icle, we present inter­Hist, a com­pact visu­al­iz­a­tion for the inter­act­ive explor­a­tion of res­ults to com­plex cor­pus quer­ies. Integ­rated with a search inter­face to the PAISÀ cor­pus of Italian web texts, inter­Hist aims at facil­it­at­ing the explor­a­tion of large res­ults sets to lin­guistic cor­pus searches. This object­ive is approached by provid­ing an inter­act­ive visual over­view of the data, which sup­ports the user­-steered nav­ig­a­tion by means of inter­act­ive fil­ter­ing. It allows to dynam­ic­ally switch between an over­view on the data and a detailed view on res­ults in their imme­di­ate tex­tual con­text, thus help­ing to detect and inspect rel­ev­ant hits more effi­ciently. We provide back­ground inform­a­tion on cor­pus lin­guistics and related work on visu­al­iz­a­tions for lan­guage and lin­guistic data. We intro­duce the archi­tec­ture of inter­Hist, by detail­ing the data struc­ture it relies on, describ­ing the visu­al­iz­a­tion design and provid­ing tech­nical details of the imple­ment­a­tion and its integ­ra­tion with the cor­pus query­ing envir­on­ment. Finally, we illus­trate its usage by present­ing a use case for the ana­lysis of the com­pos­i­tion of Italian noun phrases.
    @inproceedings{LYDING14.517,
      address = {Reykjavik, Iceland},
      author = {Lyding, Verena and Nicolas, Lionel and Stemle, Egon},
      booktitle = {Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)},
      editor = {Calzolari, Nicoletta and Choukri, Khalid and Declerck, Thierry and Loftsson, Hrafn and Maegaard, Bente and Mariani, Joseph and Moreno, Asuncion and Odijk, Jan and Stelios, Piperidis},
      isbn = {978-2-9517408-8-4},
      keywords = {corpus linguistics,language analysis,visualization},
      month = may,
      pages = {635--641},
      publisher = {European Language Resources Association (ELRA)},
      title = {{'interHist' - an interactive visual interface for corpus exploration}},
      url = {http://www.lrec-conf.org/proceedings/lrec2014/pdf/517\_Paper.pdf},
      year = {2014}
    }
    

  9. Abel, Andrea, Aivars Glaznieks, Lionel Nic­olas, and Egon Stem­le. 2014. “KoKo: An L1 Learner Cor­pus for Ger­man.” In Pro­ceed­ings of the Ninth Inter­na­tional Con­fer­ence on Lan­guage Resources and Eval­u­ation (LREC’14), edited by Nicoletta Calzolari, Khalid Choukri, Thi­erry Decler­ck, Hrafn Lofts­son, Bente Mae­gaard, Joseph Mari­ani, Asun­cion Moreno, Jan Odijk, and Piperidis Stelios, 2414–21. Reyk­javik, Iceland: European Lan­guage Resources Asso­ci­ation (ELRA).
    We intro­duce the KoKo cor­pus, a col­lec­tion of Ger­man L1 learner texts annot­ated with learner errors, along with the meth­ods and tools used in its con­struc­tion and eval­u­ation. The cor­pus con­tains both texts and cor­res­pond­ing sur­vey inform­a­tion from 1,319 pupils and amounts to around 716,000 tokens. The eval­u­ation of the qual­ity of the per­formed tran­scrip­tions and annota­tions shows an accur­acy of ortho­graphic error annota­tions of approx­im­ately 80% as well as high accur­acy of tran­scrip­tions (> 99%), auto­matic token­isa­tion (> 99%), sen­tence split­ting (> 96%) and POS-tag­ging (> 94%). The KoKo cor­pus will be pub­lished at the end of 2014 and be the first access­ible lin­guist­ic­ally annot­ated Ger­man L1 learner cor­pus. It will rep­res­ent a valu­able source for research and teach­ing on Ger­man as L1 lan­guage, in par­tic­u­lar with regards to writ­ing skills.
    @inproceedings{ABEL14.934,
      address = {Reykjavik, Iceland},
      author = {Abel, Andrea and Glaznieks, Aivars and Nicolas, Lionel and Stemle, Egon},
      booktitle = {Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)},
      editor = {Calzolari, Nicoletta and Choukri, Khalid and Declerck, Thierry and Loftsson, Hrafn and Maegaard, Bente and Mariani, Joseph and Moreno, Asuncion and Odijk, Jan and Stelios, Piperidis},
      isbn = {978-2-9517408-8-4},
      keywords = {German Language,Learner Corpora},
      month = may,
      pages = {2414--2421},
      publisher = {European Language Resources Association (ELRA)},
      title = {{KoKo: An L1 Learner Corpus for German}},
      url = {http://www.lrec-conf.org/proceedings/lrec2014/pdf/934\_Paper.pdf},
      year = {2014}
    }
    

  10. Stem­le*, Egon W., and Alex­an­der Onysko*. 2014. “Auto­mated L1 iden­ti­fic­a­tion in Eng­lish learner essays and its implic­a­tions for lan­guage trans­fer.” Talk. ’Work in Pro­gress Series.’ Bozen/Bolzano, Ita­ly: Kom­pet­en­zzen­trum Sprac­hen, Freie Uni­versität Bozen.
    This talk gives an over­view to our study: it is based on a cor­pus of TOEFL Eng­lish test essays writ­ten by learners of 11 dif­fer­ent first lan­guage back­grounds. In our research we use auto­mated machine learn­ing tech­niques to auto­mat­ic­ally clas­sify the learner texts accord­ing to the L1 of their authors. Fur­ther­more, we take a closer look at some of the most inform­at­ive fea­tures for the clas­si­fier regard­ing L1 Ger­man and L1 Italian speak­ers. Some of these fea­tures show a pos­sible ori­gin in pro­cesses of L1 transfer.
    @misc{StemleOnysko2014b,
      address = {Bozen/Bolzano, Italy},
      author = {Stemle*, Egon~W. and Onysko*, Alexander},
      booktitle = {'Work in Progress Series'},
      institution = {Kompetenzzentrum Sprachen, Freie Universit{\"{a}}t Bozen},
      month = apr,
      title = {{Automated L1 identification in English learner essays and its implications for language transfer}},
      type = {talk},
      year = {2014}
    }
    

  11. Lyding, Ver­ena, Egon Stem­le, Claudia Borghet­ti, Marco Brunel­lo, Sara Castagnoli, Felice Dell Orletta, Hen­rik Dittmann, Aless­andro Len­ci, and Vito Pir­rel­li. 2014. “The PAISÀ Cor­pus of Italian Web Texts.” In Pro­ceed­ings of the 9th Web as Cor­pus Work­shop (WaC-9), 36–43. Gothen­burg, Sweden: Asso­ci­ation for Com­pu­ta­tional Linguistics.
    PAISÀ is a Cre­at­ive Com­mons licensed, large web cor­pus of con­tem­por­ary Itali­an. We describe the design, har­vest­ing, and pro­cessing steps involved in its creation.
    @inproceedings{paisa2014,
      address = {Gothenburg, Sweden},
      author = {Lyding, Verena and Stemle, Egon and Borghetti, Claudia and Brunello, Marco and Castagnoli, Sara and Orletta, Felice Dell and Dittmann, Henrik and Lenci, Alessandro and Pirrelli, Vito},
      booktitle = {Proceedings of the 9th Web as Corpus Workshop (WaC-9)},
      month = apr,
      pages = {36--43},
      publisher = {Association for Computational Linguistics},
      title = {{The PAIS\`{A} Corpus of Italian Web Texts}},
      url = {http://aclweb.org/anthology/W14-0406},
      year = {2014}
    }
    

  12. Stem­le, Egon W., and Alex­an­der Onysko. 2013. “Lan­guage as a Detect­ive Story.” Magazine. Academia.
    Art­icle in Aca­demia (science magazine by EURAC and unibz), Bolzano, Italy
    @misc{Stemle2013c,
      author = {Stemle, Egon W. and Onysko, Alexander},
      booktitle = {Academia},
      month = dec,
      pages = {24--25},
      title = {{Language as a Detective Story}},
      volume = {64},
      type = {magazine},
      year = {2013}
    }
    

  13. Abel, Andrea, Aivars Glaznieks, and Egon W. Stem­le. 2013. “Auto­mat­ische Annota­tion von Schüler­tex­ten - Heraus­for­der­ungen und Lösungs­vorschläge am Beis­piel des Pro­jekts KoKo.” Talk. Work­shop from the "Arbeits­grup­pe: Kor­pus­basierte Lin­guistik" at the 40. Öster­reichis­che Linguistiktagung. Salzburg, Aus­tria: Uni­versität Salzburg.
    Der Vor­trag stellt den iter­at­iven Work­flow zur Erstel­lung eines lem­mat­is­ier­ten, POS-­get­ag­gten und nach aus­gewähl­ten sprach­lichen Merk­malen annotier­ten Lern­erkor­pus vor und geht auf Schwi­erigkeiten und Beson­der­heiten bei der Kor­puser­stel­lung mit L1-Lern­er­tex­ten ein. Lern­er­texte weisen häufig Schreib­weisen und Kon­struk­tionen auf, die der Stand­ard­sprache nicht ents­prechen. Da kor­pus­lin­guistische Ver­arbei­tung­stools gewöhn­lich Zei­tung­stexte o.Ä. als Eingabe erwarten, können Lern­er­texte bei der auto­mat­ischen Ver­arbei­tung Schwi­erigkeiten bereit­en. Dadurch kann die mitunter sehr hohe Zuver­lässigkeit der Tools (z.B. eines POS-Tag­gers, Gies­brecht & Evert 2009) erheb­lich her­abge­set­zt. Eine Heraus­for­der­ung bei der kor­pus­lin­guistischen Auf­bereit­ung von Lern­er­texten liegt folg­lich dar­in, ihre Merk­male im Work­flow so zu ber­ück­sichti­gen, dass sie trotz der Abweichun­gen vom Stand­ard mit einer ähn­lichen Zuver­lässigkeit ver­arbeitet wer­den können wie stand­ard­sprach­liche Tex­te. Im Pro­jekt „KoKo“ wur­den rund 1300 Schüler­texte (811.330 Tokens) aus Ober­schu­len in Thürin­gen, Nordtirol und Südtirol für ein deutschs­prac­higes L1-Lern­erkor­pus auf­bereit­et. Mit o.g. Abweichun­gen wurde dabei fol­gen­der­maßen umgegan­gen: Bereits bei der Digit­al­is­ier­ung der hand­s­chrift­lichen Daten wur­den die Transkripte mit zusätz­lichen Annota­tionen verse­hen, die Ortho­graph­iefehler, okka­sion­elle Kur­z­wort­b­ildun­gen, Emotikons u.Ä. erfassen. Nachfol­gend wurde das Kor­pus lem­mat­is­iert und get­ag­gt. In einem sep­ar­aten Ver­arbei­tungsschritt wur­den mith­ilfe des POS-Tag­gers nicht auto­mat­isch ver­arbeitete Text­merk­male ermit­telt, die anschließend entweder manuell annotiert oder dazu ver­wen­det wur­den, den Tag­ger neu zu train­ier­en. Der dadurch in Gang geset­zte iter­at­ive Prozess der Kor­puser­stel­lung ermög­licht es, die Qual­ität der Lem­ma- und POS-Annota­tionen des L1-Lern­erkor­pus sukzessiv zu verbessern. Diese iter­at­ive Her­ange­hens­weise kann auch für die mög­liche Annota­tion weit­erer Ebenen beibe­hal­ten wer­den (vgl. Voor­mann & Gut 2008).
    @misc{Abel2013,
      address = {Salzburg, Austria},
      author = {Abel, Andrea and Glaznieks, Aivars and Stemle, Egon~W.},
      booktitle = {Workshop from the "Arbeitsgruppe: Korpusbasierte Linguistik" at the 40. {\"{O}}sterreichische Linguistiktagung},
      institution = {Universit{\"{a}}t Salzburg},
      month = nov,
      title = {{Automatische Annotation von Sch{\"{u}}lertexten - Herausforderungen und L{\"{o}}sungsvorschl{\"{a}}ge am Beispiel des Projekts KoKo}},
      type = {talk},
      url = {https://www.researchgate.net/publication/259344914{\_}Automatische{\_}Annotation{\_}von{\_}Schlertexten{\_}--{\_}Herausforderungen{\_}und{\_}Lsungsvorschlge{\_}am{\_}Beispiel{\_}des{\_}Projekts{\_}KoKo?ev=prf{\_}pub},
      year = {2013}
    }
    

  14. Lyding, Ver­ena, Claudia Borghet­ti, Hen­rik Dittmann, Lionel Nic­olas, and Egon Stem­le. 2013. “Open Cor­pus Inter­face for Italian Lan­guage Learn­ing.” In Pro­ceed­ings of the Inter­na­tional Con­fer­ence ICT for Lan­guage Learn­ing, 6th edition. Florence, Ita­ly: libreriauniversitaria.it.
    In this art­icle, we present the mul­ti-­fa­ceted inter­face to the open PAISÀ cor­pus of Itali­an. Cre­ated within the pro­ject PAISÀ (Pi­at­ta­forma per l’Ap­pren­di­mento dell’Itali­ano Su cor­pora Annot­ati) [1], the cor­pus is designed to be freely avail­able for non-­com­mer­cial pro­cessing, usage and dis­tri­bu­tion by the pub­lic. Hence, this auto­mat­ic­ally annot­ated cor­pus (for lem­ma, part-of-speech and depend­ency inform­a­tion) is exclus­ively com­posed of doc­u­ments licensed under Cre­at­ive Com­mons (CC) licenses [2].The ded­ic­ated cor­pus inter­face is designed to provide flex­ible, power­ful, and easy-to-use modes of cor­pus access, with the object­ive to sup­port lan­guage learn­ing, lan­guage prac­ti­cing and lin­guistic ana­lyses. We present in detail the inter­face’s func­tion­al­it­ies and dis­cuss the under­ly­ing design decisions. We intro­duce the four prin­cipal com­pon­ents of the inter­face, describe sup­ported dis­play formats and present two spe­cific fea­tures added to increase the inter­face’s rel­ev­ance for lan­guage learn­ing. The main search com­pon­ents are (1) a basic search that adopts a "Google-­style" search box, (2) an advanced search that provides elab­or­ated graph­ical search options, and (3) a search that makes use of the power­ful CQP query lan­guage of the Open Cor­pus Work­bench [3]. In addi­tion, (4) a fil­ter inter­face for retriev­ing full-­text cor­pus doc­u­ments based on keyword searches is avail­able. It is like­wise provid­ing the means for build­ing tem­por­ary sub-cor­pora for spe­cific top­ics. Users can choose among dif­fer­ent dis­play formats for the search res­ults. Besides the estab­lished KWIC (KeyWord In Con­text) and full sen­tence views, graph­ical rep­res­ent­a­tions of the depend­ency rela­tion inform­a­tion as well as keyword dis­tri­bu­tions are avail­able. These dynamic dis­plays are based on a visu­al­isa­tion for depend­ency graphs [4] and one for Word Clouds [5], which build on latest devel­op­ments in inform­a­tion visu­al­isa­tion for lan­guage data. Two spe­cial fea­tures for novice learners are integ­rated into each search com­pon­ent. The first fea­ture is a func­tion for restrict­ing search res­ults to sen­tences of lim­ited com­plex­ity. Search res­ults are auto­mat­ic­ally fil­tered based on formal text char­ac­ter­ist­ics such as sen­tence length, vocab­u­lary, etc. The second is the sup­ply of pre-defined search quer­ies for lin­guistic con­struc­tions such as sen­tences in pass­ive voice, ques­tions, etc. Finally, we show how the PAISÀ inter­face can be employed in dif­fer­ent lan­guage teach­ing tasks. In par­tic­u­lar, we present a com­plete unit of work aimed at learners of Italian (CEFR level A2/B1) and centered on stu­dents’ dir­ect use of the inter­face and its func­tion­al­it­ies. By doing so, we are giv­ing con­crete examples for tar­geted searches and inter­ac­tions with the provided lan­guage mater­i­al, as well as an exem­pli­fic­a­tion of how the use of the cor­pus can be integ­rated with com­mu­nic­at­ive lan­guage activ­it­ies in the classroom.
    @inproceedings{Lyding2013a,
      address = {Florence, Italy},
      author = {Lyding, Verena and Borghetti, Claudia and Dittmann, Henrik and Nicolas, Lionel and Stemle, Egon},
      booktitle = {Proceedings of the International Conference ICT for Language Learning, 6th edition},
      isbn = {978-88-6292-423-8},
      keywords = {Corpus Linguistics,Linguistic Visualization,Visualization},
      month = nov,
      publisher = {libreriauniversitaria.it},
      title = {{Open Corpus Interface for Italian Language Learning}},
      url = {http://conference.pixel-online.net/ICT4LL2013/common/download/Paper\_pdf/270-ITL56-FP-Lyding-ICT2013.pdf},
      year = {2013}
    }
    

  15. Glaznieks, Aivars, and Egon W. Stem­le. 2013. “Heraus­for­der­ungen bei der auto­mat­ischen Ver­arbei­tung von dialektalen IBK-D­aten.” Talk. Work­shop on "Ve­rabei­tung und Annota­tion von Sprac­h­d­aten aus Genres inter­net­basierter Kom­munika­tion" at the Inter­na­tional Con­fer­ence of the Ger­man Soci­ety for Com­pu­ta­tional Lin­guist­ics and Lan­guage Tech­no­logy (GSCL 2013). Darm­stadt, Ger­man: TU Darmstadt.
    Die auto­mat­ische Ver­arbei­tung von IBK-D­aten stellt herkömm­liche Ver­fahren im Bereich der Sprac­h­tech­no­lo­gie vor große Heraus­for­der­ungen. Häufige Abweichun­gen von der Stand­ard­s­chreibung (z. B. Ver­sprach­lichung­s­prin­zipien der Nähe, Schnell­s­chreibphäno­me­ne) und gen­re­spezi­fis­che Ele­mente (z. B. Emoticons, Inflekt­ive, spezi­fis­che Ele­mente ein­zel­ner Kom­munika­tionsdi­en­ste) führen mit vorhandenen Ver­arbei­tungswerkzeugen häufig zu unbe­friedi­genden Ergeb­n­is­sen, weshalb die Werkzeuge eine Anpas­sung oder Über­arbei­tung, let­zt­lich viel­leicht sogar eine Neuentwicklung benöti­gen. Die vor­an­s­chreit­ende tech­no­lo­gis­che Durch­drin­gung unseres All­tags, der immer ein­fachere Zugang zu Kom­munika­tionsmedi­en, das Her­an­wach­sen von „Di­gital Nat­ives“ und schließ­lich das gewach­sene Bewusst­sein für die wis­senschaft­liche Rel­ev­anz der dabei prak­t­iz­ier­ten Kom­munika­tions­for­men und der produzier­ten Daten machen die Prob­leme für die aktuelle kor­pus­lin­guistische Forschung umso rel­ev­anter. Eine beson­dere Heraus­for­der­ung stel­len nähe­sprach­liche Phäno­mene dar. In einer vari­etäten­reichen Sprache wie dem Deutschen können sol­che Phäno­mene unzäh­lige For­men annehmen, wobei sozio-, regio- und dialektale Ele­mente eine entscheidende Rolle spielen. In Regionen des deutschen Spra­chraums, in denen eine Situ­ation der Diglos­sie zwis­chen Dialekt und Stand­ard­sprache vorherrscht, wie das etwa in der Sch­weiz oder in Südtirol der Fall ist, wird der Dialekt als die sprach­liche Vari­etät der Nähe in der IBK häufig voll­ständig ver­s­chrift­licht, d.h. ganze Kom­munika­tionen laufen im Dialekt ab. Inwiefern für sol­che Texte Ver­arbei­tungswerkzeuge ver­wen­det wer­den können, die an einer schrift­lichen Stand­ard­vari­etät aus­gerichtet sind, und welche prak­tikable Her­ange­hens­weise am vielver­sprechend­sten zu einer hin­reichend großen und aus­ge­wo­genen Abdeck­ung der Sprac­h­d­aten führt, ist unklar. In der Start­phase eines Pro­jektes, in dem aus IBK-Sprac­h­d­aten von Südtir­oler Nutzer­Innen ein Kor­pus erstellt wird, wurde ver­sucht, offene Fra­gen dieser Art zu klären. Ein Testkor­pus aus authen­tischen, im Südtir­oler Dialekt ver­fassten IBK-­Tex­ten wurde dazu mit herkömm­lichen Werkzeu­gen (Token­is­ier­ung, Satzgren­zen- und Wor­tarten­erken­nung, Lem­mat­is­ier­ung) ver­arbeitet. Die Aus­wirkun­gen unter­schied­licher Anpas­sungen (z.B. Erweit­er­ung des Lexikons, Hin­zufü­gen von „tar­get words“ u.a.) auf die Ver­arbei­tungsleistung wur­den dabei eval­u­iert. Der Vor­trag wird die ein­zelnen Anpas­sungen und die jew­ei­li­gen Ergeb­n­isse der Eval­u­ation vorstel­len.
    @misc{Glaznieks2013b,
      address = {Darmstadt, German},
      author = {Glaznieks, Aivars and Stemle, Egon~W.},
      booktitle = {Workshop on "Verabeitung und Annotation von Sprachdaten aus Genres internetbasierter Kommunikation" at the International Conference of the German Society for Computational Linguistics and Language Technology (GSCL 2013)},
      institution = {TU Darmstadt},
      month = sep,
      title = {{Herausforderungen bei der automatischen Verarbeitung von dialektalen IBK-Daten}},
      type = {talk},
      url = {https://www.researchgate.net/publication/259344920{\_}Herausforderungen{\_}bei{\_}der{\_}automatischen{\_}Verarbeitung{\_}von{\_}dialektalen{\_}IBK-Daten?ev=prf{\_}pub},
      year = {2013}
    }
    

  16. Nic­olas, Lionel, Egon W. Stem­le, Klara Krane­bit­ter, and Ver­ena Lyding. 2013. “High-Ac­cur­acy Phrase Trans­la­tion Acquis­i­tion Through Battle­-Roy­ale Selec­tion.” In Pro­ceed­ings of Recent Advances in Nat­ural Lan­guage Pro­cessing, RANLP 2013, edited by Galia Angelova, Kalina Bontche­va, and Ruslan Mitkov, 516–24. His­sar, Bul­garia: RANLP 2011 Organ­ising Com­mit­tee / ACL.
    In this paper, we report on an unsu­per­vised greedy-­style pro­cess for acquir­ing phrase trans­la­tions from sen­tence-­a­ligned par­al­lel cor­pora. Thanks to innov­at­ive selec­tion strategies, this pro­cess can acquire mul­tiple trans­la­tions without size cri­ter­ia, i.e. phrases can have sev­eral trans­la­tions, can be of any size, and their size is not con­sidered when select­ing their trans­la­tions. Even though the pro­cess is in an early devel­op­ment stage and has much room for improve­ments, eval­u­ation shows that it yields phrase trans­la­tions of high pre­ci­sion that are rel­ev­ant to machine trans­la­tion but also to a wider set of applic­a­tions includ­ing memory-­based trans­la­tion or mul­ti-­word acquisition.
    @inproceedings{Nicolas2013a,
      address = {Hissar, Bulgaria},
      author = {Nicolas, Lionel and Stemle, Egon W. and Kranebitter, Klara and Lyding, Verena},
      booktitle = {Proceedings of Recent Advances in Natural Language Processing, RANLP 2013},
      editor = {Angelova, Galia and Bontcheva, Kalina and Mitkov, Ruslan},
      keywords = {Bilingual lexicon,Parallel Corpora,Phrase Translation,Unsupervised Learning},
      month = sep,
      pages = {516--524},
      publisher = {RANLP 2011 Organising Committee / ACL},
      title = {{High-Accuracy Phrase Translation Acquisition Through Battle-Royale Selection}},
      url = {http://aclweb.org/anthology/R/R13/R13-1068.pdf},
      year = {2013}
    }
    

  17. Evert, Stefan, Egon Stem­le, and Paul Rayson, eds. 2013. Pro­ceed­ings of the 8th Web as Cor­pus Work­shop (WAC-8). Pro­ceed­ings. WAC-8 Organ­ising Committee.
    Web cor­pora and other Web-­de­rived data have become a gold mine for cor­pus lin­guist­ics and nat­ural lan­guage pro­cessing. The Web is an easy source of unpre­ced­en­ted amounts of lin­guistic data from a broad range of registers and text types. However, a col­lec­tion of Web pages is not imme­di­ately suit­able for explor­a­tion in the same way a tra­di­tional cor­pus is. Since the first Web as Cor­pus Work­shop organ­ised at the Cor­pus Lin­guist­ics 2005 Con­fer­ence, a highly suc­cess­ful series of yearly Web as Cor­pus work­shops provides a venue for inter­ested research­ers to meet, share ideas and dis­cuss the prob­lems and pos­sib­il­it­ies of com­pil­ing and using Web cor­pora. After a stronger focus on applic­a­tion-ori­ented nat­ural lan­guage pro­cessing andWeb tech­no­logy in recent years with work­shops tak­ing place at NAACL-HLT 2010, 2011 andWWW2012 the 8th­Web as Cor­pus Work­shop returns to its roots in the cor­pus lin­guist­ics com­munity. Accord­ingly, the lead­ing theme of this work­shop is the applic­a­tion of Web data in lan­guage research, includ­ing lin­guistic eval­u­ation of Web-­de­rived cor­pora as well as strategies and tools for high-qual­ity auto­matic annota­tion ofWeb text. The work­shop brings together present­a­tions on all aspects of build­ing, using and eval­u­at­ing Web cor­pora, with a par­tic­u­lar focus on the fol­low­ing top­ics: applic­a­tions of Web cor­pora and other Web-­de­rived data sets for lan­guage research auto­matic lin­guistic annota­tion of Web data such as token­isa­tion, part-of-speech tag­ging, lem­ma- tisa­tion and semantic tag­ging (the accur­acy of cur­rently avail­able off-the-shelf tools is still unsat­is­fact­ory for many types of Web data) crit­ical explor­a­tion of the char­ac­ter­ist­ics of Web data from a lin­guistic per­spect­ive and its applica- bil­ity to lan­guage research present­a­tion of Web cor­pus col­lec­tion pro­jects or soft­ware tools required for some part of this pro­cess (crawl­ing, fil­ter­ing, de-­du­plic­a­tion, lan­guage iden­ti­fic­a­tion, index­ing, ...)
    @book{WAC8,
      editor = {Evert, Stefan and Stemle, Egon and Rayson, Paul},
      month = jul,
      publisher = {WAC-8 Organising Committee},
      title = {{Proceedings of the 8th Web as Corpus Workshop (WAC-8)}},
      type = {proceedings},
      url = {http://sigwac.org.uk/raw-attachment/wiki/WAC8/wac8-proceedings.pdf},
      year = {2013}
    }
    

  18. Stem­le, Egon W., and Ver­ena Lyding. 2013. “The future of Boot­CaT: A Cre­at­ive Com­mons License fil­ter.” Talk. Boot­CaT­ters of the world unite! (BOT­WU), A work­shop (and a sur­vey) on the Boot­CaT toolkit. Forlì, Ita­ly: Depart­ment of Inter­pret­ing and Trans­la­tion, Uni­ver­sity of Bologna.
    "Copy­right issues remain a gray area in com­pil­ing and dis­trib­ut­ing Web cor­por­a"[1]; and even though "If a Web cor­pus is infringing copy­right, then it is merely doing on a small scale what search engines such as Google are doing on a colossal scale"[2], and "If you want your webpage to be removed from our cor­pora, please con­tact us"[3], are prac­tical stances the former, given the increased heat Google{Google{\&}Co.}­Co. are facing on this mat­ter, might be of lim­ited use, and the lat­ter still entails some legal risk. Also, "Even if the con­crete legal threats are prob­ably minor, they may have neg­at­ive impact on fun­d-rais­ing"[4]. So, (ad­ding the pos­sib­il­ity for) min­im­iz­ing the legal risks, or rather, act­ively facing and elim­in­at­ing them is para­mount to the WaCky ini­ti­at­ive. The­or­et­ical aspects of cre­at­ing 'a free' cor­pus are covered in [5]; one res­ult is that 'the Cre­at­ive Com­mons (CC) licenses' is the most prom­ising legal model to use as a fil­ter for web pages. Also, examples of 'free' (CC) cor­pora already exist, cf. [6,7]. On a tech­nical level, the change from Google/Ya­hoo! to Bing as a search API for Boot­CaT com­plic­ated things: Google and Yahoo! both allow for fil­tering search res­ults accord­ing to a - per­ceived - CC license of a page (for Yahoo! this fil­ter was part of Boot­CaT and was used in [7]); unfor­tu­nately, Bing does not sup­port this option. Then, the "Best Prac­tices for Mark­ing Con­tent with CC Licenses"[8] should be used as clues to fil­ter down­loaded con­tent - and given the nature of the Boot­CaT pipeline, i.e. the down­loaded pages are stripped early on (e.g. meta data from html pages; CC info in boil­er­plate, etc.), post-­pro­cessing of the pages is not prom­ising. The fil­ter option could be integ­rated along the other "vari­ous fil­ters", e.g. 'bad word threshold­s', in retrieve{_}and{_}clean{_}pages{_}­from{_}url{_}l­ist.pl because there the whole page, with meta data and boil­er­plate, is avail­able (for the first and the last time). Ref­er­ences: [1] Cor­pus Ana­lysis of the World Wide Web by Wil­liam H. Fletcher [2] Intro­duc­tion to the Spe­cial Issue on the Web as Cor­pus Com­pu­ta­tional Lin­guist­ics, Vol. 29, No. 3. (1 Septem­ber 2003), pp. 333-347 by Adam Kil­gar­riff, Gregory Grefen­stette [3] http://wacky.sslmit.unibo.it/doku.php?id=cor­pora [4] Using Web data for lin­guistic pur­poses in Cor­pus lin­guistics and the Web (2007), pp. 7-24 by Anke L{\"{u}}del­ing, Stefan Evert, Marco Bar­oni edited by Mari­anne Hun­dt, Nadjia Nes­sel­hauf, Car­oline Biewer [5] The cre­ation of free lin­guistic cor­pora from the web in Pro­ceed­ings of the Fifth Web as Cor­pus Work­shop (WAC5) (2009), pp. 9-16 by Marco Brunello [6] The Eng­lish CC cor­pus by The Centre for Trans­la­tion Stud­ies, Uni­ver­sity of Leeds; http://cor­pus.leeds.ac.uk/internet.html [7] The Pais{"Copy­right issues remain a gray area in com­pil­ing and dis­trib­ut­ing Web cor­por­a"[1]; and even though "If a Web cor­pus is infringing copy­right, then it is merely doing on a small scale what search engines such as Google are doing on a colossal scale"[2], and "If you want your webpage to be removed from our cor­pora, please con­tact us"[3], are prac­tical stances the former, given the increased heat Google{Google{\&}Co.}­Co. are facing on this mat­ter, might be of lim­ited use, and the lat­ter still entails some legal risk. Also, "Even if the con­crete legal threats are prob­ably minor, they may have neg­at­ive impact on fun­d-rais­ing"[4]. So, (ad­ding the pos­sib­il­ity for) min­im­iz­ing the legal risks, or rather, act­ively facing and elim­in­at­ing them is para­mount to the WaCky ini­ti­at­ive. The­or­et­ical aspects of cre­at­ing 'a free' cor­pus are covered in [5]; one res­ult is that 'the Cre­at­ive Com­mons (CC) licenses' is the most prom­ising legal model to use as a fil­ter for web pages. Also, examples of 'free' (CC) cor­pora already exist, cf. [6,7]. On a tech­nical level, the change from Google/Ya­hoo! to Bing as a search API for Boot­CaT com­plic­ated things: Google and Yahoo! both allow for fil­tering search res­ults accord­ing to a - per­ceived - CC license of a page (for Yahoo! this fil­ter was part of Boot­CaT and was used in [7]); unfor­tu­nately, Bing does not sup­port this option. Then, the "Best Prac­tices for Mark­ing Con­tent with CC Licenses"[8] should be used as clues to fil­ter down­loaded con­tent - and given the nature of the Boot­CaT pipeline, i.e. the down­loaded pages are stripped early on (e.g. meta data from html pages; CC info in boil­er­plate, etc.), post-­pro­cessing of the pages is not prom­ising. The fil­ter option could be integ­rated along the other "vari­ous fil­ters", e.g. 'bad word threshold­s', in retrieve{_}and{_}clean{_}pages{_}­from{_}url{_}l­ist.pl because there the whole page, with meta data and boil­er­plate, is avail­able (for the first and the last time). Ref­er­ences: [1] Cor­pus Ana­lysis of the World Wide Web by Wil­liam H. Fletcher [2] Intro­duc­tion to the Spe­cial Issue on the Web as Cor­pus Com­pu­ta­tional Lin­guist­ics, Vol. 29, No. 3. (1 Septem­ber 2003), pp. 333-347 by Adam Kil­gar­riff, Gregory Grefen­stette [3] http://wacky.sslmit.unibo.it/doku.php?id=cor­pora [4] Using Web data for lin­guistic pur­poses in Cor­pus lin­guistics and the Web (2007), pp. 7-24 by Anke L{\"{u}}del­ing, Stefan Evert, Marco Bar­oni edited by Mari­anne Hun­dt, Nadjia Nes­sel­hauf, Car­oline Biewer [5] The cre­ation of free lin­guistic cor­pora from the web in Pro­ceed­ings of the Fifth Web as Cor­pus Work­shop (WAC5) (2009), pp. 9-16 by Marco Brunello [6] The Eng­lish CC cor­pus by The Centre for Trans­la­tion Stud­ies, Uni­ver­sity of Leeds; http://cor­pus.leeds.ac.uk/internet.html [7] The {a}} (Pi­at­ta­forma per l'Ap­pren­di­mento dell'Itali­ano Su cor­pora Annot­ati) cor­pus by Uni­ver­sity of Bologna (Lead Part­ner) - Ser­gio Scal­ise with col­league Claudia Borghet­ti; CNR Pisa - Vito Pir­relli with col­leagues Aless­andro Len­ci, and Felice Dell'Or­letta; European Academy of Bozen/Bolzano - Andrea Abel with col­leagues Chris Culy, Hen­rik Dittmann, and Ver­ena Lyding; Uni­ver­sity of Trento - Marco Bar­oni with col­leagues Marco Brunel­lo, Sara Castagnoli, and Egon Stem­le; http://www.cor­pusitaliano.it [8] http://wiki.creativecommons.org/Mark­ing/Creators
    @misc{Stemle2013a,
      address = {Forl{\`{i}}, Italy},
      author = {Stemle, Egon W. and Lyding, Verena},
      booktitle = {BootCaTters of the world unite! (BOTWU), A workshop (and a survey) on the BootCaT toolkit},
      institution = {Department of Interpreting and Translation, University of Bologna},
      month = jun,
      title = {{The future of BootCaT: A Creative Commons License filter}},
      type = {talk},
      url = {https://www.researchgate.net/publication/259344928{\_}The{\_}future{\_}of{\_}BootCaT{\_}A{\_}Creative{\_}Commons{\_}License{\_}Filter?ev=prf{\_}pub},
      year = {2013}
    }
    

  19. Krane­bit­ter, Klara, and Egon W. Stem­le. 2013. “Con­struct­ing concept rela­tion maps to sup­port build­ing concept sys­tems in com­par­at­ive legal ter­min­o­logy.” In Ter­min­o­lo­gie & Onto­lo­gie: Théor­ies et Applic­a­tions. Actes de la sep­tième con­férence TOTh 2013, edited by Chris­tophe Roche, Rute Costa, Depecker Loı̈c, and Phil­ippe Thoiron, 97–116. Chamébry, France: Insti­tut Por­phyre, Savoir et Connaissance.
    Graph­ical tools to organ­ise and rep­res­ent know­ledge are use­ful in ter­min­o­logy work to facil­it­ate build­ing concept sys­tems. Cre­at­ing and main­tain­ing hier­arch­ic­ally struc­tured concept rela­tion maps while manu­ally gath­er­ing data for ter­min­o­lo­gical data­bases helps to gain and main­tain an over­view of concept rela­tions, sup­ports ter­min­o­logy work in groups, and helps new team mem­bers catch­ing up on the sub­ject field. This art­icle describes our approach to sup­port the build­ing of concept sys­tems in com­par­at­ive legal ter­min­o­logy using the concept map­ping soft­ware CmapTools (IH­M­C): we build hier­arch­ic­ally struc­tured concept rela­tion maps where link­ing lines with arrow­heads between con­cepts of the same legal sys­tem rep­res­ent gen­er­ic-spe­cific rela­tions, and com­bined concept rela­tion maps where dashed lines without arrow­heads con­nect sim­ilar con­cepts in dif­fer­ent legal sys­tems.
    @inproceedings{KranebitterStemle2013,
      address = {Cham\'{e}bry, France},
      author = {Kranebitter, Klara and Stemle, Egon W.},
      booktitle = {Terminologie \& Ontologie: Th\'{e}ories et Applications. Actes de la septi\`{e}me conf\'{e}rence TOTh 2013},
      editor = {Roche, Christophe and Costa, Rute and Depecker, Lo\"{\i}c and Thoiron, Philippe},
      month = jun,
      pages = {97--116},
      publisher = {Institut Porphyre, Savoir et Connaissance},
      title = {{Constructing concept relation maps to support building concept systems in comparative legal terminology}},
      year = {2013}
    }
    

  20. Stem­le, Egon W., and Aivars Glaznieks. 2013. “(Tech­nical Aspects of) Har­vest­ing Data from Social Net­work Sites.” Talk. Inter­na­tional work­shop "Build­ing Cor­pora of Com­puter­-­Me­di­ated Com­mu­nic­a­tion: Issues, Chal­lenges, and Perspectives". Dortmund, Ger­many: Depart­ment of Ger­man Lan­guage and Lit­er­at­ure, Fac­ulty of Cul­ture Stud­ies, TU Dortmund University.
    @misc{Stemle2013,
      address = {Dortmund, Germany},
      author = {Stemle, Egon~W. and Glaznieks, Aivars},
      booktitle = {International workshop "Building Corpora of Computer-Mediated Communication: Issues, Challenges, and Perspectives"},
      institution = {Department of German Language and Literature, Faculty of Culture Studies, TU Dortmund University},
      month = feb,
      title = {{(Technical Aspects of) Harvesting Data from Social Network Sites}},
      type = {talk},
      url = {https://www.researchgate.net/publication/259344708{\_}(Technical{\_}Aspects{\_}of){\_}Harvesting{\_}Data{\_}from{\_}Social{\_}Network{\_}Sites?ev=prf{\_}pub},
      year = {2013}
    }
    

  21. Nic­olas, Lionel, Egon W. Stem­le, and Klara Krane­bit­ter. 2012. “To­wards high-ac­cur­acy bilin­gual phrase acquis­i­tion from par­al­lel cor­por­a.” In 11th Con­fer­ence on Nat­ural Lan­guage Pro­cessing, KON­VENS 2012, Empir­ical Meth­ods in Nat­ural Lan­guage Processing, edited by Jeremy Janc­sary, 471–79. Vien­na, Aus­tria: ÖGAI.
    We report on on-go­ing work to derive trans­la­tions of phrases from par­al­lel cor­pora. We describe an unsu­per­vised and know­ledge-­free greedy-­style pro­cess rely­ing on innov­at­ive strategies for choos­ing and dis­card­ing can­did­ate trans­la­tions. This pro­cess man­ages to acquire mul­tiple trans­la­tions com­bin­ing phrases of equal or dif­fer­ent sizes. The pre­lim­in­ary eval­u­ation per­formed con­firms both its poten­tial and its interest.
    @inproceedings{NicolasStemleKranebitter2012,
      address = {Vienna, Austria},
      author = {Nicolas, Lionel and Stemle, Egon W. and Kranebitter, Klara},
      booktitle = {11th Conference on Natural Language Processing, KONVENS 2012, Empirical Methods in Natural Language Processing},
      editor = {Jancsary, Jeremy},
      keywords = {Bilingual lexicon,Parallel Corpora,Phrase Translation,Unsupervised Learning},
      month = sep,
      pages = {471--479},
      publisher = {\"{O}GAI},
      title = {{Towards high-accuracy bilingual phrase acquisition from parallel corpora}},
      url = {http://www.oegai.at/konvens2012/proceedings/68\_nicolas12w/},
      year = {2012}
    }
    

  22. Stem­le, Egon W. 2012. “Web Cor­pus Cre­ation and Clean­ing.” Plen­ary talk. Stu­dent Research Work­shop:­Com­puter Applic­a­tions in Lin­guist­ics (CSRW2012). Darm­stadt, Ger­man: Eng­lish Cor­pus Lin­guist­ics Group at the Insti­tute of Lin­guist­ics and Lit­er­ary Stud­ies, Tech­nis­che Uni­versität Darmstadt.
    It has proven very dif­fi­cult to obtain large quant­it­ies of ‘tra­di­tion­al’ text that is not overly restric­ted by author­ship or pub­lish­ing com­pan­ies and their terms of use, or other forms of intel­lec­tual prop­erty rights, is ver­sat­ile – and con­trol­lable – enough in type, and hence, suit­able for vari­ous sci­entific or com­mer­cial use-cases. [1,2,3] The growth of the World Wide Web as an inform­a­tion resource has been provid­ing an altern­at­ive to large cor­pora of news feeds, news­pa­per texts, books, and other elec­tronic ver­sions of clas­sic prin­ted mat­ters: The idea arose to gather data from the Web for it is an unpre­ced­en­ted and vir­tu­ally inex­haust­ible source of authen­tic nat­ural lan­guage data and offers the NLP com­munity an oppor­tun­ity to train stat­ist­ical mod­els on much lar­ger amounts of data than was pre­vi­ously pos­sible. [4,5,6] However, we observe that after crawl­ing con­tent from the Web the sub­sequent steps, namely, lan­guage iden­ti­fic­a­tion, token­ising, lem­mat­ising, part-of-speech tag­ging, index­ing, etc. suf­fer from ’large and messy’ train­ing cor­pora [. . . ] and inter­est­ing [. . . ] reg­u­lar­it­ies may eas­ily be lost among the count­less duplic­ates, index and dir­ect­ory pages, Web spam, open or dis­guised advert­ising, and boil­er­plate [7]. The con­sequence is that thor­ough pre-­pro­cessing and clean­ing of Web cor­pora is cru­cial in order to obtain reli­able fre­quency data. I will talk about Web cor­pora, their cre­ation, and the neces­sary clean­ing. [1] Adam Kil­gar­riff. Googleology is bad sci­ence. Com­put. Lin­guist., 33(1):147–151, 2007 [2] Süd­deutsche Zei­tung Archiv – Allge­meine Geschäfts­bedin­gun­gen. [3] The Brit­ish National Cor­pus (BNC) user licence. Online Ver­sion. [4] Gregory Grefen­stette and Julien Nioche. Estim­a­tion of eng­lish and non-eng­lish lan­guage use on the WWW. In In Recher­che d’In­form­a­tion Assistée par Ordin­ateur (RIAO), pages 237–246, 2000 [5] Per­nilla Daniels­son and Mar­tijn Wagen­makers, edit­ors. Pro­ceed­ings of Cor­pus Lin­guist­ics 2005, volume 1 of The Cor­pus Lin­guist­ics Con­fer­ence Series, 2005. ISSN 1747-9398 [6] Stefan Evert. A light­weight and effi­cient tool for clean­ing web pages. In Pro­ceed­ings of the 6th Inter­na­tional Con­fer­ence on Lan­guage Resources and Eval­u­ation (LREC 2008). [7] Daniel Bauer, Judith Degen, Xiaoye Deng, Priska Her­ger, Jan Gasthaus, Eugenie Gies­brecht, Lina Jansen, Christin Kal­ina, Thorben Krüger, Robert Märtin, Mar­tin Schmidt, Simon Scholler, Johannes Steger, Egon Stem­le, and Stefan Evert. FIASCO: Fil­ter­ing the Inter­net by Auto­matic Sub­tree Clas­si­fic­a­tion, Osnab­rück. In Build­ing and Explor­ing Web Cor­pora (WAC3 - 2007) – Pro­ceed­ings of the 3rd web as cor­pus work­shop, incor­porating CLEANEVAL.
    @misc{Stemle2012a,
      address = {Darmstadt, German},
      author = {Stemle, Egon~W.},
      booktitle = {Student Research Workshop:Computer Applications in Linguistics (CSRW2012)},
      institution = {English Corpus Linguistics Group at the Institute of Linguistics and Literary Studies, Technische Universit{\"{a}}t Darmstadt},
      month = jul,
      title = {{Web Corpus Creation and Cleaning}},
      type = {plenary talk},
      url = {https://www.researchgate.net/publication/259345019{\_}Web{\_}Corpus{\_}Creation{\_}and{\_}Cleaning?ev=prf{\_}pub},
      year = {2012}
    }
    

  23. Bon­in, Frances­ca, Fabio Cavul­li, Aronne Nor­iller, Massimo Poe­sio, and Egon W. Stem­le. 2012. “An­not­at­ing Archae­olo­gical Texts: An Example of Domain-Spe­cific Annota­tion in the Human­it­ies.” In Pro­ceed­ings of the Sixth Lin­guistic Annota­tion Workshop, 134–38. LAW VI ’12. Jeju, Repub­lic of Korea: Asso­ci­ation for Com­pu­ta­tional Linguistics.
    Devel­op­ing con­tent extrac­tion meth­ods for Human­it­ies domains raises a num­ber of chal- lenges, from the abund­ance of non-stand­ard entity types to their com­plex­ity to the scarcity of data. Close col­lab­or­a­tion with Humani- ties schol­ars is essen­tial to address these chal- lenges. We dis­cuss an annota­tion schema for Archae­olo­gical texts developed in col­labora- tion with domain experts. Its devel­op­ment re- quired a num­ber of iter­a­tions to make sure all the most import­ant entity types were included, as well as address­ing chal­lenges includ­ing a domain-spe­cific hand­ling of tem­poral expres- sions, and the exist­ence of many sys­tem­atic types of ambiguity.
    @inproceedings{Bonin:2012:AAT:2392747.2392768,
      address = {Jeju, Republic of Korea},
      author = {Bonin, Francesca and Cavulli, Fabio and Noriller, Aronne and Poesio, Massimo and Stemle, Egon W.},
      booktitle = {Proceedings of the Sixth Linguistic Annotation Workshop},
      month = jul,
      number = {July},
      pages = {134--138},
      publisher = {Association for Computational Linguistics},
      series = {LAW VI '12},
      title = {{Annotating Archaeological Texts: An Example of Domain-Specific Annotation in the Humanities}},
      url = {http://dl.acm.org/citation.cfm?id=2392747.2392768},
      year = {2012}
    }
    

  24. Stem­le, Egon W., Ver­ena Lyding, and Lionel Nic­olas. 2012. “On visual Approaches towards Cor­pus Explor­a­tion.” Short talk. 3rd work­shop of the aca­demic net­work on "In­ter­net Lexicography". Bozen/Bolzano, Ita­ly: EURAC research.
    @misc{Stemle2012,
      address = {Bozen/Bolzano, Italy},
      author = {Stemle, Egon~W. and Lyding, Verena and Nicolas, Lionel},
      booktitle = {3rd workshop of the academic network on "Internet Lexicography"},
      institution = {EURAC research},
      month = may,
      title = {{On visual Approaches towards Corpus Exploration}},
      type = {short talk},
      url = {https://www.researchgate.net/publication/259344950{\_}On{\_}visual{\_}Approaches{\_}towards{\_}Corpus{\_}Exploration?ev=prf{\_}pub},
      year = {2012}
    }
    

  25. Poe­sio, Massimo, Eduard Bar­bu, Egon Stem­le, and Chris­tian Gir­ardi. 2011. “Portale Ricerca Uman­ist­ica.” Live demo and poster. Live­Memor­ies Final Event - Inter­net, Memoria e Futuro and The Semantic Way. Povo di Trento, Italy.
    @misc{Poesio2011,
      address = {Povo di Trento, Italy},
      author = {Poesio, Massimo and Barbu, Eduard and Stemle, Egon and Girardi, Christian},
      booktitle = {LiveMemories Final Event - Internet, Memoria e Futuro and The Semantic Way},
      month = nov,
      title = {{Portale Ricerca Umanistica}},
      type = {live demo and poster},
      year = {2011}
    }
    

  26. Ekbal, Asif, Francesca Bon­in, Sri­parna Saha, Egon Stem­le, Eduard Bar­bu, Fabio Cavul­li, Chris­tian Gir­ardi, and Massimo Poe­sio. 2011. “Rapid Adapt­a­tion of NE Resolv­ers for Human­it­ies Domains using Act­ive Annotation.” Journal for Lan­guage Tech­no­logy and Com­pu­ta­tional Lin­guist­ics (JLCL) 26 (2): 39–51.
    @article{EkbalEtAl:2011,
      author = {Ekbal, Asif and Bonin, Francesca and Saha, Sriparna and Stemle, Egon and Barbu, Eduard and Cavulli, Fabio and Girardi, Christian and Poesio, Massimo},
      journal = {Journal for Language Technology and Computational Linguistics (JLCL)},
      month = nov,
      number = {2},
      pages = {39--51},
      title = {{Rapid Adaptation of NE Resolvers for Humanities Domains using Active Annotation}},
      url = {http://www.jlcl.org/2011\_Heft2/9.pdf},
      volume = {26},
      year = {2011}
    }
    

  27. Poe­sio, Massimo, Eduard Bar­bu, Francesca Bon­in, Fabio Cavul­li, Asif Ekbal, Egon Stem­le, and Chris­tian Gir­ardi. 2011. “The Human­it­ies Research Portal: Human Lan­guage Tech­no­logy Meets Human­it­ies Pub­lic­a­tion Archives.” In Pro­ceed­ings of Sup­port­ing Digital Human­it­ies (SDH2011): Answer­ing the unaskable, edited by Bente Mae­gaard. Copen­ha­gen, Denmark.
    @inproceedings{PoesioSDH2011,
      address = {Copenhagen, Denmark},
      author = {Poesio, Massimo and Barbu, Eduard and Bonin, Francesca and Cavulli, Fabio and Ekbal, Asif and Stemle, Egon and Girardi, Christian},
      booktitle = {Proceedings of Supporting Digital Humanities (SDH2011): Answering the unaskable},
      editor = {Maegaard, Bente},
      month = nov,
      title = {{The Humanities Research Portal: Human Language Technology Meets Humanities Publication Archives}},
      year = {2011}
    }
    

  28. Murphy, Bri­an, and Egon W. Stem­le. 2011. “Paddy­WaC: A Min­im­ally-Su­per­vised Web-­Cor­pus of Hiberno-Eng­lish.” In Pro­ceed­ings of the First Work­shop on Algorithms and Resources for Mod­el­ling of Dia­lects and Lan­guage Varieties, 22–29. Edin­burgh, Scot­land, UK: Asso­ci­ation for Com­pu­ta­tional Linguistics.
    Small, manu­ally assembled cor­pora may be avail­able for less dom­in­ant lan­guages and dia­lects, but pro­du­cing web-s­cale resources remains a chal­lenge. Even when con­sid­er­able quant­it­ies of text are present on the web, find­ing this text, and dis­tin­guish­ing it from related lan­guages in the same region can be dif­fi­cult. For example less dom­in­ant vari­ants of Eng­lish (e.g. New Zeal­ander, Singa­por­ean, Cana­dian, Irish, South Afric­an) may be found under their respect­ive national domains, but will be par­tially mixed with Eng­lishes of the Brit­ish and US vari­et­ies, per­haps through syn­dic­a­tion of journ­al­ism, or the local reuse of text by mul­tina­tional com­pan­ies. Less formal dia­lectal usage may be scattered more widely over the inter­net through mech­an­isms such as wiki or blog author­ing. Here we auto­mat­ic­ally con­struct a cor­pus of Hiberno-Eng­lish (Eng­lish as spoken in Ire­land) using a vari­ety of meth­ods: fil­ter­ing by national domain, fil­ter­ing by ortho­graphic con­ven­tions, and boot­strap­ping from a set of Ire­land-spe­cific terms (slang, place names, organ­isa­tion­s). We eval­u­ate the national spe­cificity of the res­ult­ing cor­pora by meas­ur­ing the incid­ence of top­ical terms, and sev­eral gram­mat­ical con­structions that are par­tic­u­lar to Hiberno-Eng­lish. The res­ults show that domain fil­ter­ing is very effect­ive for isol­at­ing text that is top­ic-spe­cific, and ortho­graphic clas­si­fic­a­tion can exclude some non-Ir­ish texts, but that selec­ted seeds are neces­sary to extract con­sid­er­able quant­it­ies of more inform­al, dia­lectal text.
    @inproceedings{murphy-stemle:2011:DIALECTS,
      address = {Edinburgh, Scotland, UK},
      author = {Murphy, Brian and Stemle, Egon W.},
      booktitle = {Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties},
      month = jul,
      pages = {22--29},
      publisher = {Association for Computational Linguistics},
      title = {{PaddyWaC: A Minimally-Supervised Web-Corpus of Hiberno-English}},
      url = {http://www.aclweb.org/anthology/W11-2603},
      year = {2011}
    }
    

  29. Poe­sio, Massimo, Eduard Bar­bu, Egon W. Stem­le, and Chris­tian Gir­ardi. 2011. “Struc­ture-­P­re­serving Pipelines for Digital Lib­rar­ies.” In Pro­ceed­ings of the 5th ACL-HLT Work­shop on Lan­guage Tech­no­logy for Cul­tural Her­it­age, Social Sci­ences, and Human­it­ies (LaT­eCH 2011), 54–62. Portland, OR, USA: Asso­ci­ation for Com­pu­ta­tional Linguistics.
    Most exist­ing HLT pipelines assume the input is pure text or, at most, HTML and either ignore (lo­gic­al) doc­u­ment struc­ture or remove it. We argue that identi­fy­ing the struc­ture of doc­u­ments is essen­tial in digital lib­rary and other types of applic­a­tions, and show that it is rel­at­ively straight­for­ward to extend exist­ing pipelines to achieve ones in which the struc­ture of a doc­u­ment is preserved.
    @inproceedings{poesio-EtAl:2011:LaTeCH-2011,
      address = {Portland, OR, USA},
      author = {Poesio, Massimo and Barbu, Eduard and Stemle, Egon W. and Girardi, Christian},
      booktitle = {Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH 2011)},
      month = jun,
      pages = {54--62},
      publisher = {Association for Computational Linguistics},
      title = {{Structure-Preserving Pipelines for Digital Libraries}},
      url = {http://www.aclweb.org/anthology/W11-1508},
      year = {2011}
    }
    

  30. Rodríguez, Kepa Jose­ba, Francesca Delogu, Jan­nick Vers­ley, Egon W. Stem­le, and Massimo Poe­sio. 2010. “Ana­phoric Annota­tion of Wiki­pe­dia and Blogs in the Live Memor­ies Cor­pus.” In Pro­ceed­ings of the Sev­enth Con­fer­ence on Inter­na­tional Lan­guage Resources and Eval­u­ation (LREC’10), edited by Nicoletta Calzolari, Khalid Choukri, Bente Mae­gaard, Joseph Mari­ani, Jan Odijk, Stelios Piperid­is, Mike Ros­ner, and Daniel Tapi­as. Val­letta, Malta: European Lan­guage Resources Asso­ci­ation (ELRA).
    @inproceedings{RodriguezDeloguVersleyStemlePoesio2010,
      address = {Valletta, Malta},
      author = {Rodr\'{i}guez, Kepa Joseba and Delogu, Francesca and Versley, Jannick and Stemle, Egon W. and Poesio, Massimo},
      booktitle = {Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC'10)},
      editor = {Calzolari, Nicoletta and Choukri, Khalid and Maegaard, Bente and Mariani, Joseph and Odijk, Jan and Piperidis, Stelios and Rosner, Mike and Tapias, Daniel},
      isbn = {2-9517408-6-7},
      month = may,
      publisher = {European Language Resources Association (ELRA)},
      title = {{Anaphoric Annotation of Wikipedia and Blogs in the Live Memories Corpus}},
      url = {http://www.lrec-conf.org/proceedings/lrec2010/pdf/431\_Paper.pdf},
      year = {2010}
    }
    

  31. The Krd­Wrd Team (krd­wrd.or­g). 2010. “Ad­d-on Manu­al.” Manu­al. The Krd­Wrd Project.
    @misc{krdwrd.org/manual,
      author = {{The KrdWrd Team (krdwrd.org)}},
      institution = {The KrdWrd Project},
      title = {{Add-on Manual}},
      type = {manual},
      url = {http://krdwrd.github.io/manual/},
      year = {2010}
    }
    

  32. Steger, Johan­nes, and Egon Stem­le. 2009. “Krd­Wrd: Archi­tec­ture for Uni­fied Pro­cessing of Web Con­tent.” In Pro­ceed­ings of the Fifth Web as Cor­pus Work­shop (WAC5), edited by Iñaki Alegria, Igor Letu­ria, and Serge Shar­off, 63–70. Dono­s­ti­a-San Sebasti­an, Basque Coun­try, Spain: Elhu­yar Fundazioa.
    Algorithmic pro­cessing of Web con­tent mostly works on tex­tual con­tents, neg­lect­ing visual inform­a­tion. Annota­tion tools largely share this defi­cit as well. We spe­cify require­ments for an archi­tec­ture to over­come both prob­lems and pro­pose an imple­ment­a­tion, the Krd­Wrd sys­tem. It uses the Gecko ren­der­ing engine for both annota­tion and fea­ture extrac­tion, provid­ing uni­fied data access in every pro­cessing step. Stable data stor­age and col­lab­or­a­tion con­trol scripts for group annota­tions of massive cor­pora are provided via a Web inter­face coupled with a HTTP proxy. A mod­u­lar inter­face allows for lin­guistic and visual data fea­ture extractor plu­gins. The imple­ment­a­tion is suit­able for many tasks in theWeb as cor­pus domain and beyond.
    @inproceedings{StegerStemle2009,
      address = {Donostia-San Sebastian, Basque Country, Spain},
      author = {Steger, Johannes and Stemle, Egon},
      booktitle = {Proceedings of the Fifth Web as Corpus Workshop (WAC5)},
      editor = {Alegria, I\~{n}aki and Leturia, Igor and Sharoff, Serge},
      month = sep,
      pages = {63--70},
      publisher = {Elhuyar Fundazioa},
      title = {{KrdWrd: Architecture for Unified Processing of Web Content}},
      url = {https://www.sigwac.org.uk/raw-attachment/wiki/WAC5/WAC5\_proceedings.pdf},
      year = {2009}
    }
    

  33. Stem­le, Egon W. 2009. “Hy­brid Sweep­ing: Stream­lined Per­cep­tual Struc­tured-­Text Refine­ment.” Mas­ter­sthes­is. unpublished.
    This thesis dis­cusses the Krd­Wrd Pro­ject. The Pro­ject goals are to provide tools and infra­struc­ture for acquis­i­tion, visual annota­tion, mer­ging and stor­age of Web pages as parts of big­ger cor­pora, and to develop a clas­si­fic­a­tion engine that learns to auto­mat­ic­ally annot­ate pages, oper­ate on the visual ren­der­ing of pages, and provide visual tools for inspec­tion of results.
    @unpublished{Stemle2009,
      author = {Stemle, Egon W.},
      institution = {University of Osnabr\"{u}ck},
      month = apr,
      publisher = {unpublished},
      title = {{Hybrid Sweeping: Streamlined Perceptual Structured-Text Refinement}},
      type = {mastersthesis},
      year = {2009}
    }
    

  34. Bauer, Daniel, Judith Degen, Xiaoye Deng, Priska Her­ger, Jan Gasthaus, Eugenie Gies­brecht, Lina Jansen, et al. 2007. “FIASCO: Fil­ter­ing the Inter­net by Auto­matic Sub­tree Clas­si­fic­a­tion, Osnab­rück.” In Pro­ceed­ings of the Third Web as Cor­pus Work­shop (WAC3), edited by Cédrick Fairon, Hubert Naets, Adam Kil­gar­riff, and Gilles-Maurice de Schryver. Louv­ain-la-Neuve: Presses uni­versitaires de Louvain.
    @inproceedings{FIASCO2007,
      address = {Louvain-la-Neuve},
      author = {Bauer, Daniel and Degen, Judith and Deng, Xiaoye and Herger, Priska and Gasthaus, Jan and Giesbrecht, Eugenie and Jansen, Lina and Kalina, Christin and Kr\"{u}ger, Thorben and M\"{a}rtin, Robert and Schmidt, Martin and Scholler, Simon and Steger, Johannes and Stemle, Egon and Evert, Stefan},
      booktitle = {Proceedings of the Third Web as Corpus Workshop (WAC3)},
      editor = {Fairon, C\'{e}drick and Naets, Hubert and Kilgarriff, Adam and de Schryver, Gilles-Maurice},
      month = sep,
      publisher = {Presses universitaires de Louvain},
      title = {{FIASCO: Filtering the Internet by Automatic Subtree Classification, Osnabr\"{u}ck}},
      url = {http://purl.org/stefan.evert/PUB/BauerEtc2007\_FIASCO.pdf},
      year = {2007}
    }
    

  35. Blohm, Sebasti­an, Phil­ipp Cimi­ano, and Egon Stem­le. 2007. “Har­vest­ing Rela­tions from the Web - Quan­ti­fiy­ing the Impact of Fil­ter­ing Func­tion­s.” In Pro­ceed­ings of the 22nd Con­fer­ence on Arti­fi­cial Intel­li­gence (AAAI-07), 1316–23. Asso­ci­ation for the Advance­ment of Arti­fi­cial Intelligence.
    @inproceedings{BlohmCimianoStemle2007,
      author = {Blohm, Sebastian and Cimiano, Philipp and Stemle, Egon},
      booktitle = {Proceedings of the 22nd Conference on Artificial Intelligence (AAAI-07)},
      isbn = {978-1-57735-323-2},
      month = jul,
      pages = {1316--1323},
      publisher = {Association for the Advancement of Artificial Intelligence},
      title = {{Harvesting Relations from the Web - Quantifiying the Impact of Filtering Functions}},
      url = {http://www.aaai.org/Papers/AAAI/2007/AAAI07-208.pdf},
      year = {2007}
    }
    

  36. Bleich­ner, Mar­tin, Eugenie Gies­brecht, Hel­mar Gust, Eva-­Maria Leicht, Petra Ludewig, Sabine Möller, Wiebke Müller, et al. 2005. ASADO: The Ana­lysis and Struc­tur­ing of Avi­ation Doc­u­ments - Final Report. Insti­tute of Cog­nit­ive Sci­ence at the Uni­ver­sity of Osnab­rück and Insti­tute of Applied Lin­guist­ics at the Uni­ver­sity of Hildesheim.
    Final Report of the one year cooper­a­tion between the Uni­ver­sit­ies of Osnab­rück and Hildesheim, and the air­craft man­u­fac­turer AIR­BUS to research meth­od­o­lo­gies and tech­no­lo­gies to ana­lyze and struc­ture the huge amount of doc­u­ment­a­tion pro­duced dur­ing air­craft con­struc­tion. The work was done in a study pro­ject car­ried out in close cooper­a­tion with seven stu­dents of cog­nit­ive sci­ence advised by two lec­tures of the Insti­tute of Cog­nit­ive Sci­ence of the Uni­ver­sity of Osnab­rück and with one stu­dent of inter­na­tional inform­a­tion man­age­ment advised by one pro­fessor of the Insti­tute of Applied Lin­guist­ics of the Uni­ver­sity of Hildesheim.
    @techreport{ASADO2005,
      author = {Bleichner, Martin and Giesbrecht, Eugenie and Gust, Helmar and Leicht, Eva-Maria and Ludewig, Petra and M\"{o}ller, Sabine and M\"{u}ller, Wiebke and Schmidt, Martin and Stefaner, Moritz and Stemle, Egon and Wilke, Katja},
      institution = {Institute of Cognitive Science at the University of Osnabr\"{u}ck and Institute of Applied Linguistics at the University of Hildesheim},
      month = nov,
      title = {{ASADO: The Analysis and Structuring of Aviation Documents - Final Report}},
      year = {2005}
    }
    

  37. Melzer, Christine B., and Egon W. Stem­le. 2003. The Com­plete Dictionary. Artist­s­proof. Vol. A–Z. Christine B. Melzer.
    @book{MelzerStemle2003,
      author = {Melzer, Christine B. and Stemle, Egon W.},
      publisher = {Christine B. Melzer},
      title = {{The Complete Dictionary}},
      type = {artistsproof},
      url = {http://www.tinemelzer.eu/works/the-complete-dictionary/},
      volume = {A--Z},
      year = {2003}
    }