Publications

2020

Alexander König, Egon W. Stemle, André Moreira, Willem Elbers

July 2020 Selected papers from the CLARIN Annual Conference 2019

Technical Solutions for Reproducible Research

In recent years, the reproducibility of scientific research has increasingly come into focus, both by ex ternal stakeholders (e.g. funders) and by the research communities themselves. Corpus linguistics, with its methods for creating, processing and analysing corpora, is an integral part of many other disciplines that work with language data and therefore plays a special role. Moreover, language corpora are often living objects that are regularly improved and revised. At the same time, tools for the automatic processing of human language are also being developed further, which can lead to different results with the same processing steps and the same data. This article argues that modern software technologies, such as version control and containerisation, can mitigate the following problems: Software packaging, installation and execution and, equally important, the tracking of corpus modifications throughout its life-cycle. All in all, this leads to transparency of changes to raw data and software tools and thereby enhanced reproducibility.

2020

Chee Wee (Ben) Leong, Beata Beigman Klebanov, Chris Hamill, Egon Stemle, Rutuja Ubale, Xianyang Chen

July 2020 Proceedings of the Second Workshop on Figurative Language Processing (FigLang2020)

A Report on the 2020 VUA and TOEFL Metaphor Detection Shared Task

In this paper, we report on the shared task on metaphor identification on VU Amsterdam Metaphor Corpus and on a subset of the TOEFL Native Language Identification Corpus. The shared task was conducted as apart of the ACL 2020 Workshop on Processing Figurative Language.

2020

Egon W. Stemle, Alexander Onysko

July 2020 Proceedings of the Second Workshop on Figurative Language Processing (FigLang2020)

Testing the role of metadata in metaphor identification

This paper describes the adaptation and application of a neural network system for the automatic detection of metaphors. The LSTM BiRNN system participated in the shared task of metaphor identification that was part of the Second Workshop of Figurative Language Processing (FigLang2020) held at the Annual Conference of the Association for Computational Linguistics (ACL2020). The particular focus of our approach is on the potential influence that the metadata given in the ETS Corpus of Non-Native Written English might have on the automatic detection of metaphors in this dataset. The article first discusses the annotated ETS learner data, highlighting some of its peculiarities and inherent biases of metaphor use. A series of evaluations follow in order to test whether specific metadata influence the system performance in the task of automatic metaphor identification. The system is available under the APLv2 open-source license.

2020

Adrien Barbaresi, Felix Bildhauer, Roland Schäfer, Egon Stemle

May 2020 European Language Resources Association

Proceedings of the 12th Web as Corpus Workshop

For almost fifteen years, the ACL SIGWAC, and most notably the Web as Corpus (WAC) workshops, have served as a platform for researchers interested in the compilation, processing and use of webderived corpora as well as computer-mediated communication. Past workshops were co-located with major conferences on corpus linguistics and computational linguistics (such as ACL, EACL, Corpus Linguistics, LREC, NAACL, WWW). In corpus linguistics and theoretical linguistics, the World Wide Web has become increasingly popular as a source of linguistic evidence, especially in the face of data sparseness or the lack of variation in traditional corpora of written language. In lexicography, web data have become a major and wellestablished resource with dedicated research data and specialised tools. In other areas of theoretical linguistics, the adoption rate of web corpora has been slower but steady. Furthermore, some completely new areas of linguistic research dealing exclusively with web (or similar) data have emerged, such as the construction and utilisation of corpora based on short messages. Another example is the (manual or automatic) classification of web texts by genre, register, or – more generally speaking – “text type”, as well as topic area. In computational linguistics, web corpora have become an established source of data for the creation of language models, word embeddings, and all types of machine learning. The 12th Web as Corpus workshop (WAC-XII) looks at the past, present, and future of web corpora given the fact that large web corpora are nowadays provided mostly by a few major initiatives and companies, and the diversity of the early years appears to have faded slightly. Also, we acknowledge the fact that alternative sources of data (such as data from Twitter and similar platforms) have emerged, some of them only available to large companies and their affiliates, such as linguistic data from social media and other forms of the deep web. At the same time, gathering interesting and relevant web data (web crawling) is becoming an ever more intricate task as the nature of the data offered on the web changes (for example the death of forums in favour of more closed platforms).

2020

Jennifer-Carmen Frey, Alexander König, Egon Stemle, Achille Falaise, Darja Fišer, Harald Lüngen

May 2020 CMC Corpora through the prism of digital humanities

The FAIR Index of CMC Corpora

In this article, we examine the current situation of data dissemination and provision for CMC corpora. By that we aim to give a guiding grid for future projects that will improve the transparency and replicability of research results as well as the reusability of the created resources. Based on the FAIR guiding principles for research data management, we evaluate the 20 European CMC corpora listed in the CLARIN CMC Resource family, individuate successful strategies among the existing corpora and establish best practices for future projects. We give an overview of existing approaches to data referencing, dissemination and provision in European CMC corpora, and discuss the methods, formats and strategies used. Furthermore, we discuss the need for community standards and offer recommendations for best practices when creating a new CMC corpus.

2019

Willem Elbers, Egon W. Stemle, André Moreira, Alexander König, Luca Cattani, Martin Palma

November 2019

The CLARIN ERIC deployment infrastructure and its applicability to reproducible research

This paper is describing the needs and technological preconditions of the CLARIN ERIC infrastructure. It introduces how containerization using Docker can help to meet these requirements and fleshes out the build and deployment workflow that CLARIN ERIC is employing to ensure that all the goals of their infrastructure are met in an efficient and sustainable way. In a second step, it is also shown how these same workflows can help researchers, especially in the fields of computational and corpus linguistics, to provide for more easily reproducible research by creating a virtual environment that can provide specific versions of data, programs and algorithms used for certain research questions and make sure that the exact same versions can still be used at a later stage to reproduce the results.

2019

Egon W. Stemle, Adriane Boyd, Maarten Janssen, Therese Lindström Tiedemann, Nives Mikelić Preradović, Alexandr Rosen, Dan Rosén, Elena Volodina

October 2019 Widening the Scope of Learner Corpus Research. Selected Papers from the Fourth Learner Corpus Research Conference 2017

Working together towards an ideal infrastructure for language learner corpora

In this article we give an overview of first-hand experiences and starting points for best practices from projects in seven European countries dedicated to learner corpus research and the creation of language learner corpora. The corpora and tools involved in LCR are becoming more and more important, and the careful preparation and easy retrieval, and reusability of corpora and tools has likewise become more important. But with a lack of agreed solutions for many aspects of LCR, interoperability between learner corpora or exchanging data from different learner corpus projects is still challenging. We will illustrate how concepts like metadata, anonymization, error taxonomies and linguistic annotations, as well as tools, toolchains or data formats can individually pose challenges and how they might be solved.

2019

Egon W. Stemle, Andrea Abel, Verena Lyding

October 2019 Electronic lexicography in the 21st century. Proceedings of the eLex 2019 conference

Language varieties meet One-Click Dictionary

The goal of the STyrLogism Project is to semi-automatically extract neologism candidates (new lexemes) for the German standard variety used in South Tyrol, and generally create the basis for long-term monitoring of its development. We use automatic lexico-semantic analytics for the lexicographic processing, but instead of continuing to develop our independent neologism detection application, we have recently become part of a thriving community of users and developers within the EU infrastructure project ELEXIS, which aims to harmonise efforts that relate to producing and making dictionary resources available, and to develop tools with consistent standards and increased interoperability. Consequently, we moved the development of our neologism application into Lexonomy, one of ELEXIS' promoted open-source projects. In the following, we report on the current state of this ongoing development by describing how we integrate our work with the Sketch Engine and Lexonomy tools, pointing out the challenges involved, and discussing how our work on language varieties can be evaluated.

2019

Jennifer-Carmen Frey, Alexander König, Egon W. Stemle

September 2019 Proceedings of the 7th Conference on CMC and Social Media Corpora for the Humanities (CMC-Corpora2019)

How FAIR are CMC Corpora?

In recent years, research data management has also become an important topic in the less data-intensive areas of the Social Sciences and Humanities (SSH). Funding agencies as well as research communities demand that empirical data collected and used for scientific research is managed and preserved in a way that research results are reproducible. In order to account for this the FAIR guiding principles for data stewardship have been established as a framework for good data management, aiming at the findability, accessibility, interoperability, and reusability of research data. This article investigates 24 European CMC corpora with regard to their compliance with the FAIR principles and discusses to what extent the deposit of research data in repositories of data preservation initiatives such as CLARIN, Zenodo or Metashare can assist in the provision of FAIR corpora.

2019

Alexander König, Egon W. Stemle

September 2019 Proceedings of CLARIN Annual Conference 2019

Technical Solutions for Reproducible Research

In recent years, the reproducibility of scientific research has more and more come into focus, both from external stakeholders (e.g. funders) and from within research communities themselves. Corpus linguistics and its methods, which are an integral component of many other disciplines working with language data, play a special role here – language corpora are often living objects: they are constantly being improved and revised, and at the same time, the tools for the automatic processing of human language are also regularly updated, both of which can lead to different results for the same processing steps. This article argues that modern software technologies such as version control and containerization can address both issues, namely make reproducible the process of software packaging, installation, and execution and, more importantly, the tracking of corpora throughout their life cycle, thereby making the changes to the raw data reproducible for many subsequent analyses.

2019

Ciara R. Wigham, Egon W. Stemle

June 2019 Presses Universitaires Blaise Pascal

Building Computer-Mediated Communication Corpora for sociolinguistic Analysis

Communication between humans via networked devices has become an everyday part of people’s lives across generations, cultures, geographical areas, and social classes. Shaped by the specific social and technical context in which it is produced, synchronous and asynchronous computer-mediated communication (CMC) has become increasingly participatory, interactive, and multimodal. User interactions and user-generated social media content offer a wide range of research opportunities for a growing multidisciplinary research community. This edited volume combines methodological papers that focus on building and annotating CMC corpora and papers that offer a sociolinguistic analysis of different CMC corpora. The diversity of languages represented in the corpora include Arabic, French, German, Italian, English and Slovenian. In fact, the increasingly multilingual nature of CMC data is a recurring theme throughout the volume, as are the references to the importance of and compliance with standards for CMC corpora development in order to facilitate (the?) re-examination of corpora for reproducibility, and for other areas and objectives of investigation. All but one paper are extended papers from the 2017 edition of the CMC and Social Media Corpora Conference held in Bolzano, Italy where the community met to discuss themes that related to the interaction between language, CMC, and society.

2019

Jennifer-Carmen Frey, Egon W. Stemle, A. Seza Doğruöz

June 2019 Building Computer-Mediated Communication Corpora for sociolinguistic Analysis

Comparison of Automatic vs. Manual Language Identification in Multilingual Social Media Texts

Multilingual speakers communicate in more than one language in daily life and on social media. In order to process or investigate multilingual communication, there is a need for language identification. This study compares the performance of human annotators with automatic ways of language identification on a multilingual (mainly German-Italian-English) social media data set collected in Italy (i.e. South Tyrol). Our results indicate that humans and NLP systems follow their individual techniques to make a decision about multilingual text messages. This results in low agreement when different annotators or NLP systems execute the same task. In general, annotators agree with each other more than NLP systems. However, there is also variation in human agreement depending on the prior establishment of guidelines for the annotation task or not.

2018

Andrea Abel, Egon W. Stemle

August 2018 Proceedings of the XVIII EURALEX International Congress: Lexicography in Global Contexts

On the Detection of Neologism Candidates as Basis for Language Observation and Lexicographic Endeavours: The STyrLogism Project

The goal of the project STyrLogisms is to semi-automatically extract neologism (new lexemes) candidates for the German standard variety used in South Tyrol. We use a list of manually vetted URLs from news, magazines and blog websites of South Tyrol and regularly crawl their data, clean and process it and compare this new data to reference corpora and additional regional word lists and the formerly crawled data sets. Our reference corpora are DECOW14 with around 60m types, and the South Tyrolean Web Corpus with around 2.4m types; the additional word lists consist of named entities, terminological terms from the region, and specific terms of the German standard variety used in South Tyrol (altogether around 53k unique types). Here, we will report on the employed method, a first round of candidate extraction with an approach for a classification schema for the selected candidates, and some remarks on a second extraction round.

2018

Egon Stemle, Alexander Onysko

June 2018 Proceedings of the Workshop on Figurative Language Processing

Using Language Learner Data for Metaphor Detection

This article describes the system that participated in the shared task (ST) on metaphor detection on the Vrije University Amsterdam Metaphor Corpus (VUA). The ST was part of the workshop on processing figurative language at the 16th annual conference of the North American Chapter of the Association for Computational Linguistics (NAACL2018). The system combines a small assertion of trending techniques, which implement matured methods from NLP and ML; in particular, the system uses word embeddings from standard corpora and from corpora representing different proficiency levels of language learners in a LSTM BiRNN architecture. The system is available under the APLv2 open-source license.

2018

Barbara Baumgartner, Martin Angler

May 2018 Academia-Interview Titelthema

Was darf Forschung mit Social Media Daten?

Interview in Academia (science magazine by EURAC and unibz), Bolzano, Italy

2017

Michael Beißwenger, Ciara R. Wigham, Carole Etienne, Darja Fišer, Holger Grumt Suárez, Laura Herzberg, Erhard Hinrichs, Tobias Horsmann, Natali Karlova-Bourbonus, Lothar Lemnitzer, Julien Longhi, Harald Lüngen, Lydia-Mai Ho-Dac, Christophe Parisse, Céline Poudat, Thomas Schmidt, Egon Stemle, Angelika Storrer, Torsten Zesch

October 2017 Proceedings of the 5th Conference on CMC and Social Media Corpora for the Humanities

Connecting Resources: Which Issues have to be Solved to Integrate CMC Corpora from Heterogeneous Sources and for Different Languages?

The paper reports on the results of a scientific colloquium dedicated to the creation of standards and best practices which are needed to facilitate the integration of language resources for CMC stemming from different origins and the linguistic analysis of CMC phenomena in different languages and genres. The key issue to be solved is that of interoperability – with respect to the structural representation of CMC genres, linguistic annotations metadata, and anonymization/pseudonymization schemas. The objective of the paper is to convince more projects to partake in a discussion about standards for CMC corpora and for the creation of a CMC corpus infrastructure across languages and genres. In view of the broad range of corpus projects which are currently underway all over Europe, there is a great window of opportunity for the creation of standards in a bottom-up approach.

2017

Egon W. Stemle, Ciara R. Wigham

October 2017

Proceedings of the 5th Conference on CMC and Social Media Corpora for the Humanities

This volume presents the proceedings of the 5th edition of the annual conference series on CMC and Social Media Corpora for the Humanities (cmc-corpora2017). This conference series is dedicated to the collection, annotation, processing, and exploitation of corpora of computer-mediated communication (CMC) and social media for research in the humanities. The annual event brings together language-centered research on CMC and social media in linguistics, philologies, communication sciences, media and social sciences with research questions from the fields of corpus and computational linguistics, language technology, text technology, and machine learning. The 5th Conference on CMC and Social Media Corpora for the Humanities was held at Eurac Research on October, 4th and 5th, in Bolzano, Italy. This volume contains extended abstracts of the invited talks, papers, and extended abstracts of posters presented at the event. The conference attracted 26 valid submissions. Each submission was reviewed by at least two members of the scientific committee. This committee decided to accept 16 papers and 8 posters of which 14 papers and 3 posters were presented at the conference. The programme also includes three invited talks: two keynote talks by Aivars Glaznieks (Eurac Research, Italy) and A. Seza Doğruöz (Independent researcher) and an invited talk on the Common Language Resources and Technology Infrastructure (CLARIN) given by Darja Fišer, the CLARIN ERIC Director of User Involvement.

2017

Michael Beißwenger, Thierry Chanier, Tomaž Erjavec, Darja Fišer, Axel Herold, Nikola Lubešić, Harald Lüngen, Céline Poudat, Egon Stemle, Angelika Storrer, Ciara Wigham

May 2017 Selected Papers from the CLARIN Annual Conference 2016, Aix-en-Provence, 26–28 October 2016, CLARIN Common Language Resources and Technology Infrastructure

Closing a Gap in the Language Resources Landscape: Groundwork and Best Practices from Projects on Computer-mediated Communication in four European Countries

The paper presents best practices and results from projects dedicated to the creation of corpora of computer-mediated communication and social media interactions (CMC) from four different countries. Even though there are still many open issues related to building and annotating corpora of this type, there already exists a range of tested solutions which may serve as a starting point for a comprehensive discussion on how future standards for CMC corpora could (and should) be shaped like.

2016

Andrea Abel, Aivars Glaznieks, Lionel Nicolas, Egon Stemle

December 2016 Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016)

An extended version of the KoKo German L1 Learner corpus

This paper describes an extended version of the KoKo corpus (version KoKo4, Dec 2015), a corpus of written German L1 learner texts from three different German-speaking regions in three different countries. The KoKo corpus is richly annotated with learner language features on different linguistic levels such as errors or other linguistic characteristics that are not deficit-oriented, and is enriched with a wide range of metadata. This paper complements a previous publication (Abel et al., 2014a) and reports on new textual metadata and lexical annotations and on the methods adopted for their manual annotation and linguistic analyses. It also briefly introduces some linguistic findings that have been derived from the corpus.

2016

December 2016 Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016)

bot.zen @ EVALITA 2016 - A minimally-deep learning PoS-tagger (trained for Italian Tweets)

This article describes the system that participated in the POS tagging for Italian Social Media Texts (PoSTWITA) task of the 5th periodic evaluation campaign of Natural Language Processing (NLP) and speech tools for the Italian language EVALITA 2016. The work is a continuation of Stemle (2016) with minor modifications to the system and different data sets. It combines a small assertion of trending techniques, which implement matured methods, from NLP and ML to achieve competitive results on PoS tagging of Italian Twitter texts; in particular, the system uses word embeddings and character-level representations of word beginnings and endings in a LSTM RNN architecture. Labelled data (Italian UD corpus, DiDi and PoSTWITA) and unlabbelled data (Italian C4Corpus and PAISA') were used for training. The system is available under the APLv2 open-source license.

2016

Jennifer-Carmen Frey, Aivars Glaznieks, Egon W. Stemle

December 2016 Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016)

The DiDi Corpus of South Tyrolean CMC Data: A multilingual corpus of Facebook texts

The DiDi corpus of South Tyrolean data of computer-mediated communication (CMC) is a multilingual sociolinguistic language corpus. It consists of around 600,000 tokens collected from 136 profiles of Facebook users residing in South Tyrol, Italy. In conformity with the multilingual situation of the territory, the main languages of the corpus are German and Italian (followed by English). The data has been manually anonymised and provides manually corrected part-of-speech tags for the Italian language texts and manually normalised data for German texts. Moreover, it is annotated with user-provided socio-demographic data (among others L1, gender, age, education, and internet communication habits) from a questionnaire, and linguistic annotations regarding CMC phenomena, languages and varieties. The anonymised corpus is freely available for research purposes.

2016

Michael Beißwenger, Thierry Chanier, Isabella Chiari, Tomaž Erjavec, Darja Fišer, Axel Herold, Nikola Lubešić, Harald Lüngen, Céline Poudat, Egon Stemle, Angelika Storrer, Ciara Wigham

October 2016 Proceedings of the CLARIN Annual Conference 2016

Integrating corpora of computer-mediated communication into the language resources landscape: Initiatives and best practices from French, German, Italian and Slovenian projects

2016

September 2016

Scientific Report of Short Term Scientific Mission COST-STSM-IS1305-34353

ENeL’s WG3 concerns innovative e-dictionaries with a focus on the development of digitally born dictionaries. The training school 2016 in Ljubljana (SI), May 17-20, introduced participants, among others, to collecting, analysing, and automatically extracting data from web corpora. Albeit related, the task of processing data from corpora of computer-mediated communication and social media interactions (henceforth referred to as CMC) has been deliberately excluded from the training school’s programme. But we know that “new vocabulary is characteristic for CMC discourse, e.g. ‘funzen’ (an abbreviated variant of the German verb ‘funktionieren’, en.: ‘to function’) or ‘gruscheln’ (verb denoting a function of a German social network platform, most likely a blending of ‘grüßen’, en.: ‘to greet’ and ‘kuscheln’, en.: ‘to cuddle’)” and therefore relevant to WG3; the goal of this STSM is to apply the methods and tools from the training school to CMC data.

2016

August 2016 Proceedings of the 10th Web as Corpus Workshop (WAC-X) and the EmpiriST Shared Task

bot.zen @ EmpiriST 2015 - A minimally-deep learning PoS-tagger (trained for German CMC and Web data)

This article describes the system that participated in the Part-of-speech tagging subtask of the “EmpiriST 2015 shared task on automatic linguistic annotation of computer-mediated communication / social media”. The system combines a small assertion of trending techniques, which implement matured methods, from NLP and ML to achieve competitive results on PoS tagging of German CMC and Web corpus data; in particular, the system uses word embeddings and character-level representations of word beginnings and endings in a LSTM RNN architecture. Labelled data (Tiger v2.2 and EmpiriST) and unlabelled data (German Wikipedia) were used for training. The system is available under the APLv2 open-source license.

2016

Paul Cook, Stefan Evert, Roland Schäfer, Egon Stemle

August 2016 Association for Computational Linguistics

Proceedings of the 10th Web as Corpus Workshop (WAC-X) and the EmpiriST Shared Task

The World Wide Web has become increasingly popular as a source of linguistic data, not only within the NLP communities, but also with theoretical linguists facing problems of data sparseness or data diversity. Accordingly, web corpora continue to gain importance, given their size and diversity in terms of genres/text types. The field is still new, though, and a number of issues in web corpus construction need much additional research, both fundamental and applied. These issues range from questions of corpus design (e.g., assessment of corpus composition, sampling strategies and their relation to crawling algorithms, and handling of duplicated material) to more technical aspects (e.g., efficient implementation of individual post-processing steps in document cleaning and linguistic annotation, or large-scale parallelization to achieve web-scale corpus construction). Similarly, the systematic evaluation of web corpora, for example in the form of task based comparisons to traditional corpora, has only recently shifted into focus. For almost a decade, the ACL SIGWAC (http://www.sigwac.org.uk/), and especially the highly successful Web as Corpus (WAC) workshops have served as a platform for researchers interested in compilation, processing and application of web-derived corpora. Past workshops were co-located with major conferences on computational linguistics and/or corpus linguistics (such as EACL, NAACL, LREC, WWW, and Corpus Linguistics). WAC-X also featured the final workshop of the EmpiriST 2015 shared task “Automatic Linguistic Annotation of Computer-Mediated Communication / Social Media” (see https://sites.google.com/site/empirist2015/ for details) and the panel discussion “Corpora, open science, and copyright reforms” (see https://www.sigwac.org.uk/wiki/WAC-X#paneldisc for details).

2015

Jennifer-Carmen Frey, Aivars Glaznieks, Egon W. Stemle

September 2015 Proceedings of the 2nd Workshop on Natural Language Processing for Computer-Mediated Communication / Social Media at GSCL2015 (NLP4CMC2015)

The DiDi Corpus of South Tyrolean CMC Data

This paper presents the DiDi Corpus, a corpus of South Tyrolean Data of Computer-mediated Communication (CMC). The corpus comprises around 650,000 tokens from Facebook wall posts, comments on wall posts and private messages, as well as socio-demographic data of participants. All data was automatically annotated with language information (de, it, en and others), and manually normalised and anonymised. Furthermore, semi-automatic token level annotations include part-of-speech and CMC phenomena (e.g. emoticons, emojis, and iteration of graphemes and punctuation). The anonymised corpus without the private messages is freely available for researchers; the complete and anonymised corpus is available after signing a non- disclosure agreement.

2015

Lionel Nicolas, Egon Stemle, Aivars Glaznieks, Andrea Abel

June 2015 Studies in Learner Corpus Linguistics: Research and Applications for Foreign Language Teaching and Assessment

A Generic Data Workflow for Building Annotated Text Corpora

We present an abstract and generic workflow, and detail how it has been implemented to build and annotate learner corpora. This workflow has been developed through an interdisciplinary collaboration between linguists, who annotate and use corpora, and computational linguists and computer scientists, who are responsible for providing technical support and adaptation or implementation of software components.

2015

Egon Stemle, Alexander Onysko

April 2015 Transfer Effects in Multilingual Language Development

Automated L1 identification in English learner essays and its implications for language transfer

This article focuses on automatic text classification which aims at identifying the first language (L1) background of learners of English. A particular question arising in the context of automated L1 identification is whether any features that are informative for a machine learning algorithm relate to L1-specific transfer phenomena. In order to explore this issue further, we discuss the results of a study carried out in the wake of a Native Language Identification Task. The task is based on the TOEFL11 corpus (cf. Blanchard et al. 2013), which involves a sample of 12,100 essays written by participants in the TOEFL® test from 11 different language backgrounds (Arabic, Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu, and Turkish). The article will show our results in automatic L1 detection in the TOEFL11 corpus. These results are discussed in light of relevant transfer features which turned out to be particularly informative for automatic detection of L1 German and L1 Italian.

PDF Publisher DOI

2014

Aivars Glaznieks, Egon Stemle

December 2014 Journal for Language Technology and Computational Linguistics (JLCL)

Challenges of building a CMC corpus for analyzing writer's style by age: The DiDi project

Special Issue: Building and annotating corpora of computer-mediated discourse. Issues and Challenges at the Inteface of Corpus and Computational Linguistics

2014

Michel Généreux, Egon W. Stemle, Lionel Nicolas, Verena Lyding

December 2014 Proceedings of the First Italian Conference on Computational Linguistics (CLiC-it 2014)

Correcting OCR errors for German in Fraktur font

In this paper, we present ongoing experiments for correcting OCR errors on German newspapers in Fraktur font. Our approach borrows from techniques for spelling correction in context using a probabilistic edit-operation error model and lexical resources. We highlight conditions in which high error reduction rates can be obtained and where the approach currently stands with real data.

2014

Aivars Glaznieks, Andrea Abel, Verena Lyding, Lionel Nicolas, Egon Stemle

December 2014 Apples - Journal of Applied Language Studies

Establishing a Standardised Procedure for Building Learner Corpora

Decisions at the outset of preparing a learner corpus are of crucial importance for how the corpus can be built and how it can be analysed later on. This paper presents a generic workflow to build learner corpora while taking into account the needs of the users. The workflow results from an extensive collaboration between linguists that annotate and use the corpus and computer linguists that are responsible for providing technical support. The paper addresses the linguists' research needs as well as the availability and usability of language technology tools necessary to meet them. We demonstrate and illustrate the relevance of the workflow using results and examples from our L1 learner corpus of German (“KoKo”).

2014

Jennifer-Carmen Frey, Egon W. Stemle, Aivars Glaznieks

October 2014 Workshop Proceedings of the 12th Edition of the KONVENS Conference

Collecting language data of non-public social media profiles

In this paper, we propose an integrated web strategy for mixed sociolinguistic research methodologies in the context of social media corpora. After stating the particular challenges for building corpora of private, non-public computer-mediated communication, we will present our solution to these problems: a Facebook web application for the acquisition of such data and the corresponding meta data. Finally, we will discuss positive and negative implications for this method.

2014

Verena Lyding, Lionel Nicolas, Egon Stemle

May 2014 Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

'interHist' - an interactive visual interface for corpus exploration

In this article, we present interHist, a compact visualization for the interactive exploration of results to complex corpus queries. Integrated with a search interface to the PAISÀ corpus of Italian web texts, interHist aims at facilitating the exploration of large results sets to linguistic corpus searches. This objective is approached by providing an interactive visual overview of the data, which supports the user-steered navigation by means of interactive filtering. It allows to dynamically switch between an overview on the data and a detailed view on results in their immediate textual context, thus helping to detect and inspect relevant hits more efficiently. We provide background information on corpus linguistics and related work on visualizations for language and linguistic data. We introduce the architecture of interHist, by detailing the data structure it relies on, describing the visualization design and providing technical details of the implementation and its integration with the corpus querying environment. Finally, we illustrate its usage by presenting a use case for the analysis of the composition of Italian noun phrases.

2014

Andrea Abel, Aivars Glaznieks, Lionel Nicolas, Egon Stemle

May 2014 Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

KoKo: An L1 Learner Corpus for German

We introduce the KoKo corpus, a collection of German L1 learner texts annotated with learner errors, along with the methods and tools used in its construction and evaluation. The corpus contains both texts and corresponding survey information from 1,319 pupils and amounts to around 716,000 tokens. The evaluation of the quality of the performed transcriptions and annotations shows an accuracy of orthographic error annotations of approximately 80% as well as high accuracy of transcriptions ($>$ 99%), automatic tokenisation ($>$ 99%), sentence splitting ($>$ 96%) and POS-tagging ($>$ 94%). The KoKo corpus will be published at the end of 2014 and be the first accessible linguistically annotated German L1 learner corpus. It will represent a valuable source for research and teaching on German as L1 language, in particular with regards to writing skills.

2014

Verena Lyding, Egon Stemle, Claudia Borghetti, Marco Brunello, Sara Castagnoli, Felice Dell Orletta, Henrik Dittmann, Alessandro Lenci, Vito Pirrelli

April 2014 Proceedings of the 9th Web as Corpus Workshop (WaC-9)

The PAISÀ Corpus of Italian Web Texts

PAISÀ is a Creative Commons licensed, large web corpus of contemporary Italian. We describe the design, harvesting, and processing steps involved in its creation.

2013

Egon W. Stemle, Alexander Onysko

December 2013 Academia

Language as a Detective Story

Article in Academia (science magazine by EURAC and unibz), Bolzano, Italy

2013

Verena Lyding, Claudia Borghetti, Henrik Dittmann, Lionel Nicolas, Egon Stemle

November 2013 Proceedings of the International Conference ICT for Language Learning, 6th edition

Open Corpus Interface for Italian Language Learning

In this article, we present the multi-faceted interface to the open PAISÀ corpus of Italian. Created within the project PAISÀ (Piattaforma per l’Apprendimento dell’Italiano Su corpora Annotati) [1], the corpus is designed to be freely available for non-commercial processing, usage and distribution by the public. Hence, this automatically annotated corpus (for lemma, part-of-speech and dependency information) is exclusively composed of documents licensed under Creative Commons (CC) licenses [2].The dedicated corpus interface is designed to provide flexible, powerful, and easy-to-use modes of corpus access, with the objective to support language learning, language practicing and linguistic analyses. We present in detail the interface’s functionalities and discuss the underlying design decisions. We introduce the four principal components of the interface, describe supported display formats and present two specific features added to increase the interface’s relevance for language learning. The main search components are (1) a basic search that adopts a “Google-style” search box, (2) an advanced search that provides elaborated graphical search options, and (3) a search that makes use of the powerful CQP query language of the Open Corpus Workbench [3]. In addition, (4) a filter interface for retrieving full-text corpus documents based on keyword searches is available. It is likewise providing the means for building temporary sub-corpora for specific topics. Users can choose among different display formats for the search results. Besides the established KWIC (KeyWord In Context) and full sentence views, graphical representations of the dependency relation information as well as keyword distributions are available. These dynamic displays are based on a visualisation for dependency graphs [4] and one for Word Clouds [5], which build on latest developments in information visualisation for language data. Two special features for novice learners are integrated into each search component. The first feature is a function for restricting search results to sentences of limited complexity. Search results are automatically filtered based on formal text characteristics such as sentence length, vocabulary, etc. The second is the supply of pre-defined search queries for linguistic constructions such as sentences in passive voice, questions, etc. Finally, we show how the PAISÀ interface can be employed in different language teaching tasks. In particular, we present a complete unit of work aimed at learners of Italian (CEFR level A2/B1) and centered on students’ direct use of the interface and its functionalities. By doing so, we are giving concrete examples for targeted searches and interactions with the provided language material, as well as an exemplification of how the use of the corpus can be integrated with communicative language activities in the classroom.

2013

Lionel Nicolas, Egon W. Stemle, Klara Kranebitter, Verena Lyding

September 2013 Proceedings of Recent Advances in Natural Language Processing, RANLP 2013

High-Accuracy Phrase Translation Acquisition Through Battle-Royale Selection

In this paper, we report on an unsupervised greedy-style process for acquiring phrase translations from sentence-aligned parallel corpora. Thanks to innovative selection strategies, this process can acquire multiple translations without size criteria, i.e. phrases can have several translations, can be of any size, and their size is not considered when selecting their translations. Even though the process is in an early development stage and has much room for improvements, evaluation shows that it yields phrase translations of high precision that are relevant to machine translation but also to a wider set of applications including memory-based translation or multi-word acquisition.

2013

Stefan Evert, Egon Stemle, Paul Rayson

July 2013 WAC-8 Organising Committee

Proceedings of the 8th Web as Corpus Workshop (WAC-8)

Web corpora and other Web-derived data have become a gold mine for corpus linguistics and natural language processing. The Web is an easy source of unprecedented amounts of linguistic data from a broad range of registers and text types. However, a collection of Web pages is not immediately suitable for exploration in the same way a traditional corpus is. Since the first Web as Corpus Workshop organised at the Corpus Linguistics 2005 Conference, a highly successful series of yearly Web as Corpus workshops provides a venue for interested researchers to meet, share ideas and discuss the problems and possibilities of compiling and using Web corpora. After a stronger focus on application-oriented natural language processing andWeb technology in recent years with workshops taking place at NAACL-HLT 2010, 2011 andWWW2012 the 8thWeb as Corpus Workshop returns to its roots in the corpus linguistics community. Accordingly, the leading theme of this workshop is the application of Web data in language research, including linguistic evaluation of Web-derived corpora as well as strategies and tools for high-quality automatic annotation ofWeb text. The workshop brings together presentations on all aspects of building, using and evaluating Web corpora, with a particular focus on the following topics: applications of Web corpora and other Web-derived data sets for language research automatic linguistic annotation of Web data such as tokenisation, part-of-speech tagging, lemma- tisation and semantic tagging (the accuracy of currently available off-the-shelf tools is still unsatisfactory for many types of Web data) critical exploration of the characteristics of Web data from a linguistic perspective and its applica- bility to language research presentation of Web corpus collection projects or software tools required for some part of this process (crawling, filtering, de-duplication, language identification, indexing, …)

2013

Klara Kranebitter, Egon W. Stemle

June 2013 TOTh 2013 Proceedings - Terminology & Ontology: Theories and applications

Constructing concept relation maps to support building concept systems in comparative legal terminology

Graphical tools to organise and represent knowledge are useful in terminology work to facilitate building concept systems. Creating and maintaining hierarchically structured concept relation maps while manually gathering data for terminological databases helps to gain and maintain an overview of concept relations, supports terminology work in groups, and helps new team members catching up on the subject field. This article describes our approach to support the building of concept systems in comparative legal terminology using the concept mapping software CmapTools (IHMC): we build hierarchically structured concept relation maps where linking lines with arrowheads between concepts of the same legal system represent generic-specific relations, and combined concept relation maps where dashed lines without arrowheads connect similar concepts in different legal systems.

2012

Lionel Nicolas, Egon W. Stemle, Klara Kranebitter

September 2012 Proceedings of the 11th Conference on Natural Language Processing, KONVENS 2012, Empirical Methods in Natural Language Processing

Towards high-accuracy bilingual phrase acquisition from parallel corpora

We report on on-going work to derive translations of phrases from parallel corpora. We describe an unsupervised and knowledge-free greedy-style process relying on innovative strategies for choosing and discarding candidate translations. This process manages to acquire multiple translations combining phrases of equal or different sizes. The preliminary evaluation performed confirms both its potential and its interest.

2012

Francesca Bonin, Fabio Cavulli, Aronne Noriller, Massimo Poesio, Egon W. Stemle

July 2012 Proceedings of the Sixth Linguistic Annotation Workshop (LAW 2012)

Annotating Archaeological Texts: An Example of Domain-Specific Annotation in the Humanities

Developing content extraction methods for Humanities domains raises a number of chal- lenges, from the abundance of non-standard entity types to their complexity to the scarcity of data. Close collaboration with Humani- ties scholars is essential to address these chal- lenges. We discuss an annotation schema for Archaeological texts developed in collabora- tion with domain experts. Its development re- quired a number of iterations to make sure all the most important entity types were included, as well as addressing challenges including a domain-specific handling of temporal expres- sions, and the existence of many systematic types of ambiguity.

2011

Asif Ekbal, Francesca Bonin, Sriparna Saha, Egon Stemle, Eduard Barbu, Fabio Cavulli, Christian Girardi, Massimo Poesio

November 2011 JLCL

Rapid Adaptation of NE Resolvers for Humanities Domains using Active Annotation

2011

Massimo Poesio, Eduard Barbu, Francesca Bonin, Fabio Cavulli, Asif Ekbal, Egon Stemle, Christian Girardi

November 2011 Proceedings of Supporting Digital Humanities (SDH2011): Answering the unaskable

The Humanities Research Portal: Human Language Technology Meets Humanities Publication Archives

2011

Brian Murphy, Egon W. Stemle

July 2011 Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties

PaddyWaC: A Minimally-Supervised Web-Corpus of Hiberno-English

Small, manually assembled corpora may be available for less dominant languages and dialects, but producing web-scale resources remains a challenge. Even when considerable quantities of text are present on the web, finding this text, and distinguishing it from related languages in the same region can be difficult. For example less dominant variants of English (e.g. New Zealander, Singaporean, Canadian, Irish, South African) may be found under their respective national domains, but will be partially mixed with Englishes of the British and US varieties, perhaps through syndication of journalism, or the local reuse of text by multinational companies. Less formal dialectal usage may be scattered more widely over the internet through mechanisms such as wiki or blog authoring. Here we automatically construct a corpus of Hiberno-English (English as spoken in Ireland) using a variety of methods: filtering by national domain, filtering by orthographic conventions, and bootstrapping from a set of Ireland-specific terms (slang, place names, organisations). We evaluate the national specificity of the resulting corpora by measuring the incidence of topical terms, and several grammatical constructions that are particular to Hiberno-English. The results show that domain filtering is very effective for isolating text that is topic-specific, and orthographic classification can exclude some non-Irish texts, but that selected seeds are necessary to extract considerable quantities of more informal, dialectal text.

2011

Massimo Poesio, Eduard Barbu, Egon W. Stemle, Christian Girardi

June 2011 Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH 2011)

Structure-Preserving Pipelines for Digital Libraries

Most existing HLT pipelines assume the input is pure text or, at most, HTML and either ignore (logical) document structure or remove it. We argue that identifying the structure of documents is essential in digital library and other types of applications, and show that it is relatively straightforward to extend existing pipelines to achieve ones in which the structure of a document is preserved.

2010

Kepa Joseba Rodríguez, Francesca Delogu, Jannick Versley, Egon W. Stemle, Massimo Poesio

May 2010 Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC'10)

Anaphoric Annotation of Wikipedia and Blogs in the Live Memories Corpus

2009

Johannes Steger, Egon Stemle

September 2009 Proceedings of the Fifth Web as Corpus Workshop (WAC5)

KrdWrd: Architecture for Unified Processing of Web Content

Algorithmic processing of Web content mostly works on textual contents, neglecting visual information. Annotation tools largely share this deficit as well. We specify requirements for an architecture to overcome both problems and propose an implementation, the KrdWrd system. It uses the Gecko rendering engine for both annotation and feature extraction, providing unified data access in every processing step. Stable data storage and collaboration control scripts for group annotations of massive corpora are provided via a Web interface coupled with a HTTP proxy. A modular interface allows for linguistic and visual data feature extractor plugins. The implementation is suitable for many tasks in theWeb as corpus domain and beyond.

2009

April 2009

Hybrid Sweeping: Streamlined Perceptual Structured-Text Refinement

This thesis discusses the KrdWrd Project. The Project goals are to provide tools and infrastructure for acquisition, visual annotation, merging and storage of Web pages as parts of bigger corpora, and to develop a classification engine that learns to automatically annotate pages, operate on the visual rendering of pages, and provide visual tools for inspection of results.

2007

Daniel Bauer, Judith Degen, Xiaoye Deng, Priska Herger, Jan Gasthaus, Eugenie Giesbrecht, Lina Jansen, Christin Kalina, Thorben Krüger, Robert Märtin, Martin Schmidt, Simon Scholler, Johannes Steger, Egon Stemle, Stefan Evert

September 2007 Proceedings of the Third Web as Corpus Workshop (WAC3)

FIASCO: Filtering the Internet by Automatic Subtree Classification, Osnabrück

2007

Sebastian Blohm, Philipp Cimiano, Egon Stemle

July 2007 Proceedings of the 22nd Conference on Artificial Intelligence (AAAI-07)

Harvesting Relations from the Web - Quantifiying the Impact of Filtering Functions

2005

Martin Bleichner, Eugenie Giesbrecht, Helmar Gust, Eva-Maria Leicht, Petra Ludewig, Sabine Möller, Wiebke Müller, Martin Schmidt, Moritz Stefaner, Egon Stemle, Katja Wilke

November 2005

ASADO: The Analysis and Structuring of Aviation Documents - Final Report

Final Report of the one year cooperation between the Universities of Osnabrück and Hildesheim, and the aircraft manufacturer AIRBUS to research methodologies and technologies to analyze and structure the huge amount of documentation produced during aircraft construction. The work was done in a study project carried out in close cooperation with seven students of cognitive science advised by two lectures of the Institute of Cognitive Science of the University of Osnabrück and with one student of international information management advised by one professor of the Institute of Applied Linguistics of the University of Hildesheim.