Upcoming (or still Waiting)

  1. Stem­le, Egon W., Adri­ane Boyd, Maarten Janssen, Ther­ese Lind­ström Tiedemann, Nives Mikelić Preradović, Alex­andr Rosen, Dan Rosén, and Elena Volod­ina. 2019. “Work­ing together towards an ideal infra­struc­ture for lan­guage learner cor­por­a.” Post-­con­fer­ence volume. Accepted.
    In this art­icle we give an over­view of first-hand exper­i­ences and start­ing points for best prac­tices from pro­jects in seven European coun­tries ded­ic­ated to learner cor­pus research and the cre­ation of lan­guage learner cor­pora. The cor­pora and tools involved in LCR are becom­ing more and more import­ant, and the care­ful pre­par­a­tion and easy retriev­al, and reusab­il­ity of cor­pora and tools has like­wise become more import­ant. But with a lack of agreed solu­tions for many aspects of LCR, inter­op­er­ab­il­ity between learner cor­pora or exchan­ging data from dif­fer­ent learner cor­pus pro­jects is still chal­len­ging. We will illus­trate how con­cepts like metadata, anonym­iz­a­tion, error tax­onom­ies and lin­guistic annota­tions, as well as tools, tool­chains or data formats can indi­vidu­ally pose chal­lenges and how they might be solved.
    @unpublished{stemle-EtAl:2019:lcr-postconf,
      author = {Stemle, Egon W. and Boyd, Adriane and Janssen, Maarten and {Lindstr{\"{o}}m Tiedemann}, Therese and {Mikeli{\'{c}} Preradovi{\'{c}}}, Nives and Rosen, Alexandr and Ros{\'{e}}n, Dan and Volodina, Elena},
      publisher = {Accepted},
      title = {{Working together towards an ideal infrastructure for language learner corpora}},
      type = {Post-conference volume},
      year = {2019}
    }
    

  2. Wigham, Ciara R., and Egon W. Stem­le. 2019. “Build­ing com­puter­-­me­di­ated com­mu­nic­a­tion cor­pora for socio-­lin­guistic ana­lys­is.” Book. Edited by Ciara Wigham and Egon W. Stem­le. Cahiers du Labor­atoire de Recher­che sur le Lan­gage. In Press.
    Com­mu­nic­a­tion between humans via net­worked devices has become an every­day part of people’s lives across dif­fer­ent gen­er­a­tions, cul­tures, geo­graph­ical areas, and social classes. Shaped by the spe­cific social and tech­nical con­text in which it is pro­duced, syn­chron­ous and asyn­chron­ous com­puter­-­me­di­ated com­mu­nic­a­tion (CMC) has become increas­ingly par­ti­cip­at­ory, inter­act­ive, and mul­timod­al. User inter­ac­tions and user­-­gen­er­ated social media con­tent offer a wide range of research oppor­tun­it­ies for a grow­ing mul­tidiscip­lin­ary research com­munity to exam­ine themes that often relate to - but are not lim­ited to - the inter­ac­tion between lan­guage, CMC, and soci­ety. The ambi­tion of this still-grow­ing research com­munity is for the research into CMC to be based on the avail­ab­il­ity of large, struc­tured data sets, as is the case for many sci­entific com­munit­ies. These data sets (cor­pora) are often built col­lab­or­at­ively from the work of dif­fer­ent research teams and dis­sem­in­ated across the research com­munity so that they may form the basis for new ana­lyses and com­par­at­ive or counter-ana­lyses. With this in mind, in the mid-2000s, a grow­ing num­ber of pro­jects star­ted to col­lect and struc­ture CMC cor­pora and dif­fuse these empir­ical resources that cover a broad range of CMC genres and lan­guages to both the wider sci­entific com­munity and busi­ness enter­prises that develop approaches and tools for web min­ing, opin­ion and trend detec­tion, semantic con­tent ana­lys­is, or machine trans­la­tion. Since 2013, the CMC and Social Media Cor­pora con­fer­ence series1 has brought together research­ers, prin­cip­ally in the fields of social sci­ences and digital human­it­ies, with interests ran­ging from the col­lection, devel­op­ment of meth­od­o­logy, annota­tion, pro­cessing to the ana­lysis of CMC cor­pora, and rep­res­ent­at­ives of lan­guage resource infrastruc­ture ini­ti­at­ives2 and busi­ness enter­prises. Held annu­ally, the con­fer­ence series has helped anim­ate dis­cus­sions around the lin­guist­ic, tech­nical, and eth­ical chal­lenges involved in build­ing and ana­lys­ing CMC cor­pora and dif­fuse best prac­tices across Europe and bey­ond con­cern­ing approaches, resources, tools, and meth­od­o­lo­gies when work­ing with large col­lections of CMC data. The ambi­tion has been to encour­age research­ers to dif­fuse the CMC cor­pora cre­ated by local pro­ject teams and facil­it­ate dia­logue that will help the com­munity to work towards stand­ards in build­ing and using CMC cor­pora so as to encour­age inter­op­er­ab­il­ity between the resources cre­ated within dif­fer­ent research teams. Recent edi­tions of the con­fer­ence have also seen the inclu­sion of work­shops organ­ised as prac­tical intro­duc­tions to cod­ing schemes for CMC data. The res­ults of pre­vi­ous con­fer­ences have been pub­lished in the form of a spe­cial issue of the Journal of Lan­guage Tech­no­logy and Com­pu­ta­tional Lin­guist­ics (Beißwenger et al., 2014) and as mono­graphs: Cor­pus de com­mu­nic­a­tion médiée par les réseaux : Con­struc­tion, struc­tur­a­tion, ana­lyse (Wigham & Lede­gen, 2017) and Invest­ig­at­ing Com­puter­-­Me­di­ated Com­mu­nic­a­tion: Cor­pus-Based Approaches to Lan­guage in the Digital World (Fišer & Beißwenger, 2017). The call for papers for this edited volume was also open to authors who did not present papers at the con­fer­ence. It includes seven con­tri­bu­tions, all sub­ject to double peer review, writ­ten by 14 authors from eight dif­fer­ent insti­tu­tions in seven coun­tries. One paper is an ori­ginal paper and the other six are exten­ded papers from the 2017 edi­tion of the CMC and Social Media Cor­pora Con­fer­ence held in Bolzano, Ita­ly. Online pro­ceed­ings of all papers presen­ted at the con­fer­ence were pub­lished as an open-ac­cess resource (Stemle & Wigham, 2017). The con­tri­bu­tions to this edited volume include two meth­od­o­lo­gical papers that focus on build­ing and annot­at­ing CMC cor­pora and five con­tri­bu­tions that offer a soci­o­lin­guistic ana­lysis of dif­fer­ent CMC cor­pora. In the lat­ter con­tri­bu­tions, dis­tinct CMC genres are represen­ted, includ­ing the social media plat­form Twit­ter, the social net­work Face­book, online news­pa­pers and wikis (Wiki­pe­dia talk pages and Wiki­pe­dia art­icles). The volume is divided into three them­atic sec­tions: CMC Cor­pus repur­pos­ing, Lan­guage rep­res­ent­a­tion in CMC Cor­pora, and CMC Lan­guage use.
    @unpublished{cmc2017-postconfbook,
      author = {Wigham, Ciara R. and Stemle, Egon W.},
      editor = {Wigham, Ciara and Stemle, Egon W.},
      publisher = {In Press},
      series = {Cahiers du Laboratoire de Recherche sur le Langage},
      title = {{Building computer-mediated communication corpora for socio-linguistic analysis}},
      type = {Book},
      year = {2019}
    }
    

  3. Frey, Jen­nifer­-­Car­men, Egon W. Stem­le, and A. Seza Doğruöz. 2019. “Com­par­ison of Auto­matic vs. Manual Lan­guage Iden­ti­fic­a­tion in Mul­ti­lin­gual Social Media Texts.” Post-­con­fer­ence volume (cm­c-­cor­por­a2017). Edited by Ciara R. Wigham and Egon W. Stem­le. In Press.
    Mul­ti­lin­gual speak­ers com­mu­nic­ate in more than one lan­guage in daily life and on social media. In order to pro­cess or invest­ig­ate mul­ti­lin­gual com­mu­nic­a­tion, there is a need for lan­guage iden­ti­fic­a­tion. This study com­pares the per­form­ance of human annot­at­ors with auto­matic ways of lan­guage iden­ti­fic­a­tion on a mul­ti­lin­gual (mainly Ger­man-Itali­an-Eng­lish) social media data set col­lec­ted in Italy (i.e. South Tyr­ol). Our res­ults indic­ate that humans and NLP sys­tems fol­low their indi­vidual tech­niques to make a decision about mul­ti­lin­gual text mes­sages. This res­ults in low agree­ment when dif­fer­ent annot­at­ors or NLP sys­tems execute the same task. In gen­er­al, annot­at­ors agree with each other more than NLP sys­tems. However, there is also vari­ation in human agree­ment depend­ing on the prior estab­lish­ment of guidelines for the annota­tion task or not.
    @unpublished{frey-EtAl:2019:cmc2017-postconfbook,
      author = {Frey, Jennifer-Carmen and Stemle, Egon W. and Doğru{\"{o}}z, A. Seza},
      editor = {Wigham, Ciara R. and Stemle, Egon W.},
      publisher = {In Press},
      title = {{Comparison of Automatic vs. Manual Language Identification in Multilingual Social Media Texts}},
      type = {Post-conference volume (cmc-corpora2017)},
      year = {2019}
    }
    

Bleeding Edge

  1. Abel, Andrea, and Egon W. Stem­le. 2018. “On the Detec­tion of Neo­lo­gism Can­did­ates as Basis for Lan­guage Obser­va­tion and Lex­ico­graphic Endeav­ours: The STyr­Lo­gism Pro­ject.” Paper. In Pro­ceed­ings of the XVIII EURALEX Inter­na­tional Con­gress: Lex­ico­graphy in Global Contexts, edited by Jaka Čibej, Vojko Gor­janc, Iztok Kosem, and Simon Krek, 535–44. Ljubljana, SI: Ljubljana Uni­ver­sity Press, Fac­ulty of Arts. doi:10.4312/9789610600961.
    The goal of the pro­ject STyr­Lo­gisms is to semi-auto­mat­ic­ally extract neo­lo­gism (new lex­emes) can­did­ates for the Ger­man stand­ard vari­ety used in South Tyr­ol. We use a list of manu­ally vet­ted URLs from news, magazines and blog web­sites of South Tyrol and reg­u­larly crawl their data, clean and pro­cess it and com­pare this new data to ref­er­ence cor­pora and addi­tional regional word lists and the formerly crawled data sets. Our ref­er­ence cor­pora are DECOW14 with around 60m types, and the South Tyr­olean Web Cor­pus with around 2.4m types; the addi­tional word lists con­sist of named entit­ies, ter­min­o­lo­gical terms from the region, and spe­cific terms of the Ger­man stand­ard vari­ety used in South Tyrol (al­to­gether around 53k unique types). Here, we will report on the employed meth­od, a first round of can­did­ate extrac­tion with an approach for a clas­si­fic­a­tion schema for the selec­ted can­did­ates, and some remarks on a second extrac­tion round.
    @inproceedings{abel-stemle:2018:euralex,
      address = {Ljubljana, SI},
      author = {Abel, Andrea and Stemle, Egon W.},
      booktitle = {Proceedings of the XVIII EURALEX International Congress: Lexicography in Global Contexts},
      doi = {10.4312/9789610600961},
      editor = {{\v{C}}ibej, Jaka and Gorjanc, Vojko and Kosem, Iztok and Krek, Simon},
      isbn = {978-961-06-0097-8},
      month = aug,
      pages = {535--544},
      publisher = {Ljubljana University Press, Faculty of Arts},
      title = {{On the Detection of Neologism Candidates as Basis for Language Observation and Lexicographic Endeavours: The STyrLogism Project}},
      type = {Paper},
      year = {2018}
    }
    

  2. Abel*, Andrea, and Egon W. Stem­le*. 2018. “On the Detec­tion of Neo­lo­gism Can­did­ates as Basis for Lan­guage Obser­va­tion and Lex­ico­graphic Endeav­ours: the STyr­Lo­gisms Pro­ject.” Talk. XVIII EURALEX Inter­na­tional Congress. Ljubljana, SI.
    The goal of the pro­ject STyr­Lo­gisms is to semi-auto­mat­ic­ally extract neo­lo­gism (new lex­emes) can­did­ates for the Ger­man stand­ard vari­ety used in South Tyr­ol. We use a list of manu­ally vet­ted URLs from news, magazines and blog web­sites of South Tyrol and reg­u­larly crawl their data, clean and pro­cess it and com­pare this new data to ref­er­ence cor­pora and addi­tional regional word lists and the formerly crawled data sets. Our ref­er­ence cor­pora are DECOW14 with around 60m types, and the South Tyr­olean Web Cor­pus with around 2.4m types; the addi­tional word lists con­sist of named entit­ies, ter­min­o­lo­gical terms from the region, and spe­cific terms of the Ger­man stand­ard vari­ety used in South Tyrol (al­to­gether around 53k unique types). Here, we will report on the employed meth­od, a first round of can­did­ate extrac­tion with an approach for a clas­si­fic­a­tion schema for the selec­ted can­did­ates, and some remarks on a second extrac­tion round.
    @misc{abel-stemle:2018:euralex-talk,
      address = {Ljubljana, SI},
      author = {Abel*, Andrea and Stemle*, Egon W.},
      booktitle = {XVIII EURALEX International Congress},
      month = jul,
      title = {{On the Detection of Neologism Candidates as Basis for Language Observation and Lexicographic Endeavours: the STyrLogisms Project}},
      type = {talk},
      url = {http://videolectures.net/euralex2018_abel_stemle_endeavors/},
      year = {2018}
    }
    

  3. Stem­le*, Egon W., and Alex­an­der Onysko*. 2018. “Us­ing Lan­guage Learner Data for Meta­phor Detec­tion.” Talk. ’Work in Pro­gress Series.’ Bozen/Bolzano, Italy.
    This talk gives an over­view to our con­tribuition to the NAACL 2018 Work­shop on Fig­ur­at­ive Lan­guage Processing
    @misc{stemle-onysko:2018:wip-talk,
      address = {Bozen/Bolzano, Italy},
      author = {Stemle*, Egon~W. and Onysko*, Alexander},
      booktitle = {'Work in Progress Series'},
      month = jun,
      title = {{Using Language Learner Data for Metaphor Detection}},
      type = {talk},
      year = {2018}
    }
    

  4. Stem­le, Egon, and Alex­an­der Onysko. 2018. “Us­ing Lan­guage Learner Data for Meta­phor Detec­tion.” In Pro­ceed­ings of the Work­shop on Fig­ur­at­ive Lan­guage Processing, 133–38. Strouds­burg, PA, USA: Asso­ci­ation for Com­pu­ta­tional Lin­guist­ics. doi:10.18653/v1/W18-0918.
    This art­icle describes the sys­tem that par­ti­cip­ated in the shared task (ST) on meta­phor detec­tion on the Vrije Uni­ver­sity Ams­ter­dam Meta­phor Cor­pus (VUA). The ST was part of the work­shop on pro­cessing fig­ur­at­ive lan­guage at the 16th annual con­fer­ence of the North Amer­ican Chapter of the Asso­ci­ation for Com­pu­ta­tional Lin­guist­ics (NAACL2018). The sys­tem com­bines a small asser­tion of trend­ing tech­niques, which imple­ment matured meth­ods from NLP and ML; in par­tic­u­lar, the sys­tem uses word embed­dings from stand­ard cor­pora and from cor­pora rep­res­ent­ing dif­fer­ent pro­fi­ciency levels of lan­guage learners in a LSTM BiRNN archi­tec­ture. The sys­tem is avail­able under the APLv2 open-­source license.
    @inproceedings{stemle-onysko:2018:naacl-flpst,
      address = {Stroudsburg, PA, USA},
      author = {Stemle, Egon and Onysko, Alexander},
      booktitle = {Proceedings of the Workshop on Figurative Language Processing},
      doi = {10.18653/v1/W18-0918},
      month = jun,
      pages = {133--138},
      publisher = {Association for Computational Linguistics},
      title = {{Using Language Learner Data for Metaphor Detection}},
      url = {http://aclweb.org/anthology/W18-0918},
      year = {2018}
    }
    

  5. Melzer*, Tine, Frans Oost­er­hof*, Dorothea Franck*, and Egon Stem­le*. 2018. “Bilder kip­pen! Aspekt­se­hen in künst­lerischer Praxis und Lehre.” Talk. Bern, Switzer­land: HKB, FSP Intermedialität.
    In künst­lerischer Praxis und Lehre, im Diskurs über Bedeu­tung und Inter­pret­a­tion von "Bildern", spielt Aspekt­se­hen eine fun­da­mentale, aber oft unter­schätzte Rolle. Der Begriff "Aspekt­se­hen" wird aus der Sprach­philo­sophie Lud­wig Wit­tgen­steins über­führt und legt die (konzep­tion­elle und kon­tex­tuelle) Kon­struk­tion eines Bildes aus seiner Ver­ständ­nis­per­spekt­ive frei. Dabei ergän­zen sich vielfältige Ansichen zum Bild­be­griff, zu Bedeu­tungsambiguität, Sub­jekt­iv­ität, Per­spekt­ive und zur Sagen-Zei­gen-Di­cho­tom­ie. In enger Zusammen­arbeit mit dem Künst­ler Frans Oost­er­hof, der Lin­guistin Dorothea Franck und dem Kog­ni­tion­swis­senschaftler Egon Stemle wer­den trans­dis­zip­linäre Ver­fahren erschlossen, die in künst­lerischer und diskur­s­iver Praxis Aspekt­se­hen nutzbar machen. Erste Grundla­gen wur­den im wöchent­lichen Y-Ex­per­i­mental Bilder kip­pen! an der HKB an künst­lerische Praxis und Lehre gekop­pelt. Das Pro­jekt profit­iert vom Werk des nieder­ländis­chen Künst­lerkolletivs Instituut Houtap­pel, dessen Archiv exklusiv für diese Recher­che zur Ver­fü­gung steht.
    @misc{melzer-EtAl:2018:hkb-talk,
      address = {Bern, Switzerland},
      author = {Melzer*, Tine and Oosterhof*, Frans and Franck*, Dorothea and Stemle*, Egon},
      institution = {HKB, FSP Intermedialit{\"{a}}t},
      month = may,
      title = {{Bilder kippen! Aspektsehen in k{\"{u}}nstlerischer Praxis und Lehre}},
      type = {talk},
      url = {https://www.hkb.bfh.ch/de/forschung/veranstaltungen/forschungs-mittwoch/},
      year = {2018}
    }
    

  6. Egarter Vigl, Lukas, and Egon Stem­le. 2018. “Was darf Forschung mit Social Media Daten?” Magazine. Aca­demi­a-In­ter­view Titelthema.
    Inter­view in Aca­demia (science magazine by EURAC and unibz), Bolzano, Italy
    @misc{egartervigl-stemle:2018:academia,
      author = {{Egarter Vigl}, Lukas and Stemle, Egon},
      booktitle = {Academia-Interview Titelthema},
      month = may,
      pages = {10--11},
      title = {{Was darf Forschung mit Social Media Daten?}},
      type = {magazine},
      url = {http://www.academia.bz.it/articles/was-darf-forschung-mit-social-media-daten},
      volume = {78},
      year = {2018}
    }
    

  7. Stem­le, Egon W. 2017. “Learner Cor­pus Infra­struc­ture (LCI) @ Eurac Research.” Talk. SWE-CLARIN Work­shop on Inter­op­er­ab­il­ity of Second Lan­guage Resources and Tools. Gothen­burg, Sweden: Uni­ver­sity of Gothenburg.
    Learner cor­pora build a fun­da­mental basis for a notice­able part of the research activ­it­ies of the Insti­tute for Applied Lin­guist­ics. The pro­ject aims at enhan­cing the research poten­tial of the Insti­tute by cre­at­ing an always more effi­cient infra­struc­ture for the col­lec­tion, pro­cessing and main­ten­ance of learner cor­pora.
    @misc{stemle:2017:l2rt-talk,
      address = {Gothenburg, Sweden},
      author = {Stemle, Egon~W.},
      booktitle = {SWE-CLARIN Workshop on Interoperability of Second Language Resources and Tools},
      institution = {University of Gothenburg},
      month = dec,
      title = {{Learner Corpus Infrastructure (LCI) @ Eurac Research}},
      type = {talk},
      url = {https://sweclarin.se/swe/workshop-interoperability-l2-resources-and-tools},
      year = {2017}
    }
    

  8. Beißwenger, Michael, Ciara R. Wigham, Car­ole Etien­ne, Darja Fišer, Hol­ger Grumt Suárez, Laura Herzberg, Erhard Hin­richs, et al. 2017. “Con­nect­ing Resources: Which Issues have to be Solved to Integ­rate CMC Cor­pora from Het­ero­gen­eous Sources and for Dif­fer­ent Lan­guages?” In Pro­ceed­ings of the 5th Con­fer­ence on CMC and Social Media Cor­pora for the Humanities, edited by Egon W. Stemle and Ciara R. Wigham. Bolzano, Ita­ly. doi:10.5281/zenodo.1041877.
    The paper reports on the res­ults of a sci­entific col­loquium ded­ic­ated to the cre­ation of stand­ards and best prac­tices which are needed to facil­it­ate the integ­ra­tion of lan­guage resources for CMC stem­ming from dif­fer­ent ori­gins and the lin­guistic ana­lysis of CMC phe­nom­ena in dif­fer­ent lan­guages and gen­res. The key issue to be solved is that of inter­op­er­ab­il­ity – with respect to the struc­tural rep­res­ent­a­tion of CMC gen­res, lin­guistic annota­tions metadata, and anonym­iz­a­tion/pseud­onym­iz­a­tion schem­as. The object­ive of the paper is to con­vince more pro­jects to par­take in a dis­cus­sion about stand­ards for CMC cor­pora and for the cre­ation of a CMC cor­pus infra­struc­ture across lan­guages and gen­res. In view of the broad range of cor­pus pro­jects which are cur­rently under­way all over Europe, there is a great win­dow of oppor­tun­ity for the cre­ation of stand­ards in a bot­tom-up approach.
    @inproceedings{beisswenger-EtAl:2017:cmc-corpora,
      address = {Bolzano, Italy},
      author = {Bei{\ss}wenger, Michael and Wigham, Ciara R. and Etienne, Carole and Fi{\v{s}}er, Darja and Su{\'{a}}rez, Holger Grumt and Herzberg, Laura and Hinrichs, Erhard and Horsmann, Tobias and Karlova-Bourbonus, Natali and Lemnitzer, Lothar and Longhi, Julien and L{\"{u}}ngen, Harald and Ho-Dac, Lydia-Mai and Parisse, Christophe and Poudat, C{\'{e}}line and Schmidt, Thomas and Stemle, Egon and Storrer, Angelika and Zesch, Torsten},
      booktitle = {Proceedings of the 5th Conference on CMC and Social Media Corpora for the Humanities},
      doi = {10.5281/zenodo.1041877},
      editor = {Stemle, Egon W. and Wigham, Ciara R.},
      month = oct,
      title = {{Connecting Resources: Which Issues have to be Solved to Integrate CMC Corpora from Heterogeneous Sources and for Different Languages?}},
      year = {2017},
      url = {https://zenodo.org/record/1041877}
    }
    

  9. Stem­le, Egon W., and Ciara R. Wigham, eds. 2017. Pro­ceed­ings of the 5th Con­fer­ence on CMC and Social Media Cor­pora for the Humanities. Pro­ceed­ings. Bolzano, Ita­ly. doi:10.5281/zenodo.1040875.
    This volume presents the pro­ceed­ings of the 5th edi­tion of the annual con­fer­ence series on CMC and Social Media Cor­pora for the Human­it­ies (cm­c-­cor­por­a2017). This con­fer­ence series is ded­ic­ated to the col­lec­tion, annota­tion, pro­cessing, and exploit­a­tion of cor­pora of com­puter­-­me­di­ated com­mu­nic­a­tion (CMC) and social media for research in the human­it­ies. The annual event brings together lan­guage-­centered research on CMC and social media in lin­guist­ics, philo­lo­gies, com­mu­nic­a­tion sci­ences, media and social sci­ences with research ques­tions from the fields of cor­pus and com­pu­ta­tional lin­guist­ics, lan­guage tech­no­logy, text tech­no­logy, and machine learn­ing. The 5th Con­fer­ence on CMC and Social Media Cor­pora for the Human­it­ies was held at Eurac Research on Octo­ber, 4th and 5th, in Bolzano, Ita­ly. This volume con­tains exten­ded abstracts of the invited talks, papers, and exten­ded abstracts of posters presen­ted at the event. The con­fer­ence attrac­ted 26 valid sub­mis­sions. Each sub­mis­sion was reviewed by at least two mem­bers of the sci­entific com­mit­tee. This com­mit­tee decided to accept 16 papers and 8 posters of which 14 papers and 3 posters were presen­ted at the con­fer­ence. The pro­gramme also includes three invited talks: two key­note talks by Aivars Glaznieks (Eurac Research, Ita­ly) and A. Seza Doğruöz (Inde­pend­ent research­er) and an invited talk on the Com­mon Lan­guage Resources and Tech­no­logy Infra­struc­ture (CLAR­IN) given by Darja Fišer, the CLARIN ERIC Dir­ector of User Involvement.
    @book{cmc-corpora:2017,
      address = {Bolzano, Italy},
      doi = {10.5281/zenodo.1040875},
      editor = {Stemle, Egon W. and Wigham, Ciara R.},
      month = oct,
      title = {{Proceedings of the 5th Conference on CMC and Social Media Corpora for the Humanities}},
      type = {proceedings},
      url = {https://zenodo.org/record/1040875},
      year = {2017}
    }
    

  10. Stem­le, Egon W. 2017. “DiDi Cor­pus.” Talk. Integ­rat­ing a new type of lan­guage resource into the Digital Human­it­ies land­scape: French-­Ger­man col­loquium on stand­ards for cor­pora of com­puter­-­me­di­ated communication. Duis­burg, Ger­many: Uni­ver­sity of Duisburg-Essen.
    @misc{stemle:2017:dhcmc-talk,
      address = {Duisburg, Germany},
      author = {Stemle, Egon~W.},
      booktitle = {Integrating a new type of language resource into the Digital Humanities landscape: French-German colloquium on standards for corpora of computer-mediated communication},
      institution = {University of Duisburg-Essen},
      month = jun,
      title = {{DiDi Corpus}},
      type = {talk},
      url = {https://sites.google.com/view/dhcmc2017/},
      year = {2017}
    }
    

  11. Beißwenger, Michael, Thi­erry Chanier, Tomaž Erjavec, Darja Fišer, Axel Her­old, Nikola Lub­ešić, Har­ald Lün­gen, et al. 2017. “Clos­ing a Gap in the Lan­guage Resources Land­scape: Ground­work and Best Prac­tices from Pro­jects on Com­puter­-­me­di­ated Com­mu­nic­a­tion in four European Coun­tries.” In Selec­ted Papers from the CLARIN Annual Con­fer­ence 2016, Aix-en-­Provence, 26–28 Octo­ber 2016, CLARIN Com­mon Lan­guage Resources and Tech­no­logy Infrastructure, 1–18. Linköping Uni­ver­sity Elec­tronic Press, Linköpings universitet.
    The paper presents best prac­tices and res­ults from pro­jects ded­ic­ated to the cre­ation of cor­pora of com­puter­-­me­di­ated com­mu­nic­a­tion and social media inter­ac­tions (CMC) from four dif­fer­ent coun­tries. Even though there are still many open issues related to build­ing and annot­at­ing cor­pora of this type, there already exists a range of tested solu­tions which may serve as a start­ing point for a com­pre­hens­ive dis­cus­sion on how future stand­ards for CMC cor­pora could (and should) be shaped like.
    @inproceedings{beisswenger-EtAl:2016:clarin-long,
      author = {Bei{\ss}wenger, Michael and Chanier, Thierry and Erjavec, Toma{\v{z}} and Fi{\v{s}}er, Darja and Herold, Axel and Lube{\v{s}}i{\'{c}}, Nikola and L{\"{u}}ngen, Harald and Poudat, C{\'{e}}line and Stemle, Egon and Storrer, Angelika and Wigham, Ciara},
      booktitle = {Selected Papers from the CLARIN Annual Conference 2016, Aix-en-Provence, 26–28 October 2016, CLARIN Common Language Resources and Technology Infrastructure},
      issn = {1650-3740},
      keywords = {CMC corpora,TEI,community building,computer-mediated communication,corpus annotation,language resources,social media corpora},
      month = may,
      pages = {1--18},
      publisher = {Link{\"{o}}ping University Electronic Press, Link{\"{o}}pings universitet},
      title = {{Closing a Gap in the Language Resources Landscape: Groundwork and Best Practices from Projects on Computer-mediated Communication in four European Countries}},
      url = {http://www.ep.liu.se/ecp/article.asp?issue=136&article=001},
      year = {2017}
    }
    

Cutting Edge

  1. Abel, Andrea, Aivars Glaznieks, Lionel Nic­olas, and Egon Stem­le. 2016. “An exten­ded ver­sion of the KoKo Ger­man L1 Learner cor­pus.” In Pro­ceed­ings of Third Italian Con­fer­ence on Com­pu­ta­tional Lin­guist­ics (CLiC-it 2016) & Fifth Eval­u­ation Cam­paign of Nat­ural Lan­guage Pro­cessing and Speech Tools for Itali­an. Final Work­shop (EVAL­ITA 2016), edited by Pier­paolo Basile, Anna Corazza, Franco Cutugno, Simon­etta Mon­te­mag­ni, Malv­ina Nis­sim, Vivi­ana Pat­ti, Gio­vanni Sem­er­aro, and Rachele Sprugnoli. Napoli, Italy.
    This paper describes an exten­ded ver­sion of the KoKo cor­pus (ver­sion KoKo4, Dec 2015), a cor­pus of writ­ten Ger­man L1 learner texts from three dif­fer­ent Ger­man-speaking regions in three dif­fer­ent coun­tries. The KoKo cor­pus is richly annot­ated with learner lan­guage fea­tures on dif­fer­ent lin­guistic levels such as errors or other lin­guistic char­ac­ter­ist­ics that are not defi­cit-ori­ented, and is enriched with a wide range of metadata. This paper com­ple­ments a pre­vi­ous pub­lic­a­tion (Abel et al., 2014a) and reports on new tex­tual metadata and lex­ical annota­tions and on the meth­ods adop­ted for their manual annota­tion and lin­guistic ana­lyses. It also briefly intro­duces some lin­guistic find­ings that have been derived from the cor­pus.
    @inproceedings{abel-EtAl:2016:koko,
      address = {Napoli, Italy},
      author = {Abel, Andrea and Glaznieks, Aivars and Nicolas, Lionel and Stemle, Egon},
      booktitle = {Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) {\&} Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016)},
      editor = {Basile, Pierpaolo and Corazza, Anna and Cutugno, Franco and Montemagni, Simonetta and Nissim, Malvina and Patti, Viviana and Semeraro, Giovanni and Sprugnoli, Rachele},
      month = dec,
      title = {{An extended version of the KoKo German L1 Learner corpus}},
      year = {2016}
    }
    

  2. Frey, Jen­nifer­-­Car­men, Aivars Glaznieks, and Egon W. Stem­le. 2016. “The DiDi Cor­pus of South Tyr­olean CMC Data: A mul­ti­lin­gual cor­pus of Face­book texts.” In Pro­ceed­ings of Third Italian Con­fer­ence on Com­pu­ta­tional Lin­guist­ics (CLiC-it 2016) & Fifth Eval­u­ation Cam­paign of Nat­ural Lan­guage Pro­cessing and Speech Tools for Itali­an. Final Work­shop (EVAL­ITA 2016), edited by Pier­paolo Basile, Anna Corazza, Franco Cutugno, Simon­etta Mon­te­mag­ni, Malv­ina Nis­sim, Vivi­ana Pat­ti, Gio­vanni Sem­er­aro, and Rachele Sprugnoli. Napoli, Italy.
    The DiDi cor­pus of South Tyr­olean data of com­puter­-­me­di­ated com­mu­nic­a­tion (CMC) is a mul­ti­lin­gual soci­o­lin­guistic lan­guage cor­pus. It con­sists of around 600,000 tokens col­lec­ted from 136 pro­files of Face­book users resid­ing in South Tyr­ol, Ita­ly. In con­form­ity with the mul­ti­lin­gual situ­ation of the ter­rit­ory, the main lan­guages of the cor­pus are Ger­man and Italian (fol­lowed by Eng­lish). The data has been manu­ally anonymised and provides manu­ally cor­rec­ted part-of-speech tags for the Italian lan­guage texts and manu­ally nor­m­al­ised data for Ger­man texts. Moreover, it is annot­ated with user­-­provided socio-­demo­graphic data (among oth­ers L1, gender, age, edu­ca­tion, and inter­net com­mu­nic­a­tion habits) from a ques­tion­naire, and lin­guistic annota­tions regard­ing CMC phe­nom­ena, lan­guages and vari­et­ies. The anonymised cor­pus is freely avail­able for research purposes.
    @inproceedings{frey-glaznieks-stemle:2016:didi,
      address = {Napoli, Italy},
      author = {Frey, Jennifer-Carmen and Glaznieks, Aivars and Stemle, Egon W.},
      booktitle = {Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) {\&} Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016)},
      editor = {Basile, Pierpaolo and Corazza, Anna and Cutugno, Franco and Montemagni, Simonetta and Nissim, Malvina and Patti, Viviana and Semeraro, Giovanni and Sprugnoli, Rachele},
      month = dec,
      title = {{The DiDi Corpus of South Tyrolean CMC Data: A multilingual corpus of Facebook texts}},
      year = {2016}
    }
    

  3. Stem­le, Egon W. 2016. “bot.zen @ EVAL­ITA 2016 - A min­im­ally-deep learn­ing PoS-tag­ger (trained for Italian Tweet­s).” In Pro­ceed­ings of Third Italian Con­fer­ence on Com­pu­ta­tional Lin­guist­ics (CLiC-it 2016) & Fifth Eval­u­ation Cam­paign of Nat­ural Lan­guage Pro­cessing and Speech Tools for Itali­an. Final Work­shop (EVAL­ITA 2016), edited by Pier­paolo Basile, Anna Corazza, Franco Cutugno, Simon­etta Mon­te­mag­ni, Malv­ina Nis­sim, Vivi­ana Pat­ti, Gio­vanni Sem­er­aro, and Rachele Sprugnoli. Napoli, Italy.
    This art­icle describes the sys­tem that par­ti­cip­ated in the POS tag­ging for Italian Social Media Texts (PoST­WITA) task of the 5th peri­odic eval­u­ation cam­paign of Nat­ural Lan­guage Pro­cessing (NLP) and speech tools for the Italian lan­guage EVAL­ITA 2016. The work is a con­tinu­ation of Stemle (2016) with minor modi­fic­a­tions to the sys­tem and dif­fer­ent data sets. It com­bines a small asser­tion of trend­ing tech­niques, which imple­ment matured meth­ods, from NLP and ML to achieve com­pet­it­ive res­ults on PoS tag­ging of Italian Twit­ter texts; in par­tic­u­lar, the sys­tem uses word embed­dings and char­ac­ter­-­level rep­res­ent­a­tions of word begin­nings and end­ings in a LSTM RNN archi­tec­ture. Labelled data (Italian UD cor­pus, DiDi and PoST­WITA) and unlab­belled data (Italian C4Cor­pus and PAISA’) were used for train­ing. The sys­tem is avail­able under the APLv2 open-­source license.
    @inproceedings{stemle:2016:evalita,
      address = {Napoli, Italy},
      author = {Stemle, Egon W},
      booktitle = {Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) {\&} Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016)},
      editor = {Basile, Pierpaolo and Corazza, Anna and Cutugno, Franco and Montemagni, Simonetta and Nissim, Malvina and Patti, Viviana and Semeraro, Giovanni and Sprugnoli, Rachele},
      month = dec,
      title = {{bot.zen @ EVALITA 2016 - A minimally-deep learning PoS-tagger (trained for Italian Tweets)}},
      year = {2016}
    }
    

  4. Lyding*, Ver­ena, Lionel Nic­olas*, and Egon W. Stem­le*. 2016. “Cross-in­sti­tu­tional cooper­a­tion ini­ti­at­ives in the Digital Human­it­ies - chal­lenges and infra­struc­tures.” Talk. Inter­na­tional Work­shop:"Ini­ti­at­iven und Innov­a­tionen in den Digital Humanities". Mer­an­o/M­er­an, Ita­ly: Akademie Deutsch-It­ali­en­is­cher Stud­i­en, Mer­an, IT, in Koop­er­a­tion mit dem Forschung­szen­trum Digital Human­it­ies, Uni­versität Inns­bruck (Bren­ner­-Archiv), AT.
    @misc{lyding-nicolas-stemle:2016:iidh-talk,
      address = {Merano/Meran, Italy},
      author = {Lyding*, Verena and Nicolas*, Lionel and Stemle*, Egon~W.},
      booktitle = {International Workshop:"Initiativen und Innovationen in den Digital Humanities"},
      institution = {Akademie Deutsch-Italienischer Studien, Meran, IT, in Kooperation mit dem Forschungszentrum Digital Humanities, Universit{\"{a}}t Innsbruck (Brenner-Archiv), AT},
      month = nov,
      title = {{Cross-institutional cooperation initiatives in the Digital Humanities - challenges and infrastructures}},
      type = {talk},
      year = {2016}
    }
    

  5. Beißwenger, Michael, Thi­erry Chanier, Isa­bella Chiari, Tomaž Erjavec, Darja Fišer, Axel Her­old, Nikola Lub­ešić, et al. 2016. “In­teg­rat­ing cor­pora of com­puter­-­me­di­ated com­mu­nic­a­tion into the lan­guage resources land­scape: Ini­ti­at­ives and best prac­tices from French, Ger­man, Italian and Slov­e­nian pro­ject­s.” In Pro­ceed­ings of the CLARIN Annual Con­fer­ence 2016.
    The paper presents best prac­tices and res­ults from pro­jects in four CLARIN mem­ber coun­tries ded­ic­ated to the cre­ation of cor­pora of com­puter­-­me­di­ated com­mu­nic­a­tion and social media inter­ac­tions (CM­C). Even though there are still many open issues related to build­ing and annot­at­ing cor­pora of that type, there already exists a range of access­ible solu­tions which have been tested in pro­jects and which may serve as a start­ing point for a more pre­cise dis­cus­sion of how future stand­ards for CMC cor­pora may (and should) be shaped like.
    @inproceedings{beisswenger-EtAl:2016:clarin,
      author = {Bei{\ss}wenger, Michael and Chanier, Thierry and Chiari, Isabella and Erjavec, Toma{\v{z}} and Fi{\v{s}}er, Darja and Herold, Axel and Lube{\v{s}}i{\'{c}}, Nikola and L{\"{u}}ngen, Harald and Poudat, C{\'{e}}line and Stemle, Egon and Storrer, Angelika and Wigham, Ciara},
      booktitle = {Proceedings of the CLARIN Annual Conference 2016},
      keywords = {CMC corpora,TEI,community building,computer-mediated communication,corpus annotation,language resources,social media corpora},
      month = oct,
      title = {{Integrating corpora of computer-mediated communication into the language resources landscape: Initiatives and best practices from French, German, Italian and Slovenian projects}},
      year = {2016}
    }
    

  6. Stem­le, Egon W. 2016. Sci­entific Report of Short Term Sci­entific Mis­sion COST-STSM-IS1305-34353. Ljubljana, SI: EURAC (Bolzano, IT) and Centre for Lan­guage Resources and Tech­no­lo­gies (Ljubljana, SI).
    ENeL’s WG3 con­cerns innov­at­ive e-dic­tion­ar­ies with a focus on the devel­op­ment of digit­ally born dic­tion­ar­ies. The train­ing school 2016 in Ljubljana (SI), May 17-20, intro­duced par­ti­cipants, among oth­ers, to col­lect­ing, ana­lys­ing, and auto­mat­ic­ally extract­ing data from web cor­pora. Albeit related, the task of pro­cessing data from cor­pora of com­puter­-­me­di­ated com­mu­nic­a­tion and social media inter­ac­tions (hence­forth referred to as CMC) has been delib­er­ately excluded from the train­ing school’s pro­gram­me. But we know that "new vocab­u­lary is char­ac­ter­istic for CMC dis­course, e.g. ‘fun­zen’ (an abbre­vi­ated vari­ant of the Ger­man verb ‘funk­tionier­en’, en.: ‘to func­tion’) or ‘gruscheln’ (verb denot­ing a func­tion of a Ger­man social net­work plat­form, most likely a blend­ing of ‘grüßen’, en.: ‘to greet’ and ‘kuschel­n’, en.: ‘to cuddle’)" and there­fore rel­ev­ant to WG3; the goal of this STSM is to apply the meth­ods and tools from the train­ing school to CMC data.
    @techreport{stemle:2016:enel-stsm,
      address = {Ljubljana, SI},
      author = {Stemle, Egon W.},
      institution = {EURAC (Bolzano, IT) and Centre for Language Resources and Technologies (Ljubljana, SI)},
      month = sep,
      title = {{Scientific Report of Short Term Scientific Mission COST-STSM-IS1305-34353}},
      url = {http://www.elexicography.eu/wp-content/uploads/2017/02/ScientificReportSTSM-IS1305-34353-EgonStemle.pdf},
      year = {2016}
    }
    

  7. Cook, Paul, Stefan Evert, Roland Schäfer, and Egon Stem­le, eds. 2016. Pro­ceed­ings of the 10th Web as Cor­pus Work­shop (WAC-X) and the Empir­iST Shared Task. Pro­ceed­ings. Asso­ci­ation for Com­pu­ta­tional Linguistics.
    The World Wide Web has become increas­ingly pop­u­lar as a source of lin­guistic data, not only within the NLP com­munit­ies, but also with the­or­et­ical lin­guists facing prob­lems of data sparse­ness or data diversity. Accord­ingly, web cor­pora con­tinue to gain import­ance, given their size and diversity in terms of gen­res/­text types. The field is still new, though, and a num­ber of issues in web cor­pus con­struc­tion need much addi­tional research, both fun­da­mental and applied. These issues range from ques­tions of cor­pus design (e.g., assess­ment of cor­pus com­pos­i­tion, sampling strategies and their rela­tion to crawl­ing algorithms, and hand­ling of duplic­ated mater­i­al) to more tech­nical aspects (e.g., effi­cient imple­ment­a­tion of indi­vidual post-­pro­cessing steps in doc­u­ment clean­ing and lin­guistic annota­tion, or large-s­cale par­al­lel­iz­a­tion to achieve web-s­cale cor­pus con­struc­tion). Sim­il­arly, the sys­tem­atic eval­u­ation of web cor­pora, for example in the form of task based com­par­is­ons to tra­di­tional cor­pora, has only recently shif­ted into focus. For almost a dec­ade, the ACL SIG­WAC (ht­tp://www.sig­wac.or­g.uk/), and espe­cially the highly suc­cess­ful Web as Cor­pus (WAC) work­shops have served as a plat­form for research­ers inter­ested in com­pil­a­tion, pro­cessing and applic­a­tion of web-­de­rived cor­pora. Past work­shops were co-­located with major con­fer­ences on com­pu­ta­tional lin­guistics and/or cor­pus lin­guistics (such as EACL, NAACL, LREC, WWW, and Cor­pus Lin­guist­ic­s). WAC-X also fea­tured the final work­shop of the Empir­iST 2015 shared task "Auto­matic Lin­guistic Annota­tion of Com­puter­-­Me­di­ated Com­mu­nic­a­tion / Social Media" (see https://s­ites.­google.­com/s­ite/em­pir­ist2015/ for details) and the panel dis­cus­sion "Cor­pora, open sci­ence, and copy­right reforms" (see https://www.sig­wac.or­g.uk/wiki/WAC-X{\#}pan­eldisc for details).
    @book{WAC-X:2016,
      title = {{Proceedings of the 10th Web as Corpus Workshop (WAC-X) and the EmpiriST Shared Task}},
      editor = {Cook, Paul and Evert, Stefan and Sch{\"{a}}fer, Roland and Stemle, Egon},
      month = aug,
      publisher = {Association for Computational Linguistics},
      type = {proceedings},
      url = {http://anthology.aclweb.org/W/W16/W16-26},
      year = {2016}
    }
    

  8. Stem­le, Egon W. 2016. “bot.zen @ Empir­iST 2015 - A min­im­ally-deep learn­ing PoS-tag­ger (trained for Ger­man CMC and Web data).” In Pro­ceed­ings of the 10th Web as Cor­pus Work­shop (WAC-X) and the Empir­iST Shared Task, 115–19. Asso­ci­ation for Com­pu­ta­tional Linguistics.
    This art­icle describes the sys­tem that par­ti­cip­ated in the Part-of-speech tag­ging sub­task of the "Em­pir­iST 2015 shared task on auto­matic lin­guistic annota­tion of com­puter­-­me­di­ated com­mu­nic­a­tion / social medi­a". The sys­tem com­bines a small asser­tion of trend­ing tech­niques, which imple­ment matured meth­ods, from NLP and ML to achieve com­pet­it­ive res­ults on PoS tag­ging of Ger­man CMC and Web cor­pus data; in par­tic­u­lar, the sys­tem uses word embed­dings and char­ac­ter­-­level rep­res­ent­a­tions of word begin­nings and end­ings in a LSTM RNN archi­tec­ture. Labelled data (Ti­ger v2.2 and Empir­iST) and unla­belled data (Ger­man Wiki­pe­dia) were used for train­ing. The sys­tem is avail­able under the APLv2 open-­source license.
    @inproceedings{stemle:2016:WAC-X,
      title = {{bot.zen @ EmpiriST 2015 - A minimally-deep learning PoS-tagger (trained for German CMC and Web data)}},
      author = {Stemle, Egon W.},
      booktitle = {Proceedings of the 10th Web as Corpus Workshop (WAC-X) and the EmpiriST Shared Task},
      month = aug,
      pages = {115--119},
      publisher = {Association for Computational Linguistics},
      url = {http://anthology.aclweb.org/W/W16/W16-2614},
      year = {2016}
    }
    

  9. Stem­le, Egon W. 2015. “The DiDi Pro­ject: Col­lect­ing, Annot­at­ing, and Ana­lys­ing South Tyr­olean Data of Com­puter­-­me­di­ated Com­mu­nic­a­tion.” Invited talk. First inter­na­tional research days on Social Media and CMC Cor­pora for the eHu­man­it­ies (ird-cmc-rennes). Ren­nes, France: Rennes 2 University.
    Fol­low­ing a soci­o­lin­guistic user­-­based per­spect­ive on lan­guage data, the pro­ject DiDi invest­ig­ated the lin­guistic strategies employed by South Tyr­olean users on Face­book. South Tyrol is a mul­ti­lin­gual region (Itali­an, Ger­man, and Ladin are offi­cial lan­guages) where the South Tyr­olean dia­lect of Ger­man is fre­quently used in dif­fer­ent com­mu­nic­at­ive con­texts. Thus, regional and social codes are often also used in writ­ten com­mu­nic­a­tion and in com­puter medi­ated com­mu­nic­a­tion. With a research focus on users with L1 Ger­man liv­ing in South Tyr­ol, the main research ques­tion was whether people of dif­fer­ent age use lan­guage in a sim­ilar way or in an age-spe­cific man­ner. The pro­ject las­ted 2 years (June 2013 - May 2015). We cre­ated a cor­pus of Face­book com­mu­nic­a­tion that can be linked to other user­-­based data such as age, web exper­i­ence and com­mu­nic­a­tion habits. We gathered socio-­demo­graphic inform­a­tion through an online ques­tionnaire and col­lec­ted the lan­guage data of the entire range of social inter­ac­tions, i.e. pub­licly access­ible data as well as non-pub­lic con­ver­sa­tions (status updates and com­ments, private mes­sages, and chat con­ver­sa­tions) writ­ten and pub­lished just for friends or a lim­ited audi­ence. The data acquis­i­tion com­prised about 150 users inter­act­ing with the app, offer­ing access to their lan­guage data and answer­ing the ques­tionnaire. In this talk, I will present the pro­ject, its data acquis­i­tion app and text annota­tion pro­cesses (auto­mat­ic, semi-auto­mat­ic, and manu­al), dis­cuss their strengths and lim­it­a­tions, and present res­ults from our data analyses.
    @misc{stemle:2015:ird,
      address = {Rennes, France},
      author = {Stemle, Egon~W.},
      booktitle = {First international research days on Social Media and CMC Corpora for the eHumanities (ird-cmc-rennes)},
      institution = {Rennes 2 University},
      month = oct,
      title = {{The DiDi Project: Collecting, Annotating, and Analysing South Tyrolean Data of Computer-mediated Communication}},
      type = {invited talk},
      url = {http://ird-cmc-rennes.sciencesconf.org/},
      year = {2015}
    }
    

  10. Frey, Jen­nifer­-­Car­men, Aivars Glaznieks, and Egon W. Stem­le. 2015. “The DiDi Cor­pus of South Tyr­olean CMC Data.” Art­icle. In Pro­ceed­ings of the 2nd Work­shop on Nat­ural Lan­guage Pro­cessing for Com­puter­-­Me­di­ated Com­mu­nic­a­tion / Social Media at GSCL2015 (NLP4CMC2015). Essen: Ger­man Soci­ety for Com­pu­ta­tional Lin­guist­ics & Lan­guage Technology.
    This paper presents the DiDi Cor­pus, a cor­pus of South Tyr­olean Data of Com­puter­-­me­di­ated Com­mu­nic­a­tion (CM­C). The cor­pus com­prises around 650,000 tokens from Face­book wall posts, com­ments on wall posts and private mes­sages, as well as socio-­demo­graphic data of par­ti­cipants. All data was auto­mat­ic­ally annot­ated with lan­guage inform­a­tion (de, it, en and oth­er­s), and manu­ally nor­m­al­ised and anonymised. Fur­ther­more, semi-auto­matic token level annota­tions include part-of-speech and CMC phe­nom­ena (e.g. emoticons, emojis, and iter­a­tion of graph­emes and punc­tu­ation). The anonymised cor­pus without the private mes­sages is freely avail­able for research­ers; the com­plete and anonymised cor­pus is avail­able after sign­ing a non- dis­clos­ure agreement.
    @inproceedings{FreyGlaznieksStemle2015,
      address = {Essen},
      author = {Frey, Jennifer-Carmen and Glaznieks, Aivars and Stemle, Egon W.},
      booktitle = {Proceedings of the 2nd Workshop on Natural Language Processing for Computer-Mediated Communication / Social Media at GSCL2015 (NLP4CMC2015)},
      month = sep,
      publisher = {German Society for Computational Linguistics \& Language Technology},
      title = {{The DiDi Corpus of South Tyrolean CMC Data}},
      url = {https://sites.google.com/site/nlp4cmc2015/NLP4CMC-2015.pdf},
      type = {article},
      year = {2015}
    }