Publications

2019

In press

In this article we give an overview of first-hand experiences and starting points for best practices from projects in seven European countries dedicated to learner corpus research and the creation of language learner corpora. The corpora and tools involved in LCR are becoming more and more important, and the careful preparation and easy retrieval, and reusability of corpora and tools has likewise become more important. But with a lack of agreed solutions for many aspects of LCR, interoperability between learner corpora or exchanging data from different learner corpus projects is still challenging. We will illustrate how concepts like metadata, anonymization, error taxonomies and linguistic annotations, as well as tools, toolchains or data formats can individually pose challenges and how they might be solved.

2019

Communication between humans via networked devices has become an everyday part of people’s lives across generations, cultures, geographical areas, and social classes. Shaped by the specific social and technical context in which it is produced, synchronous and asynchronous computer-mediated communication (CMC) has become increasingly participatory, interactive, and multimodal. User interactions and user-generated social media content offer a wide range of research opportunities for a growing multidisciplinary research community. This edited volume combines methodological papers that focus on building and annotating CMC corpora and papers that offer a sociolinguistic analysis of different CMC corpora. The diversity of languages represented in the corpora include Arabic, French, German, Italian, English and Slovenian. In fact, the increasingly multilingual nature of CMC data is a recurring theme throughout the volume, as are the references to the importance of and compliance with standards for CMC corpora development in order to facilitate (the?) re-examination of corpora for reproducibility, and for other areas and objectives of investigation. All but one paper are extended papers from the 2017 edition of the CMC and Social Media Corpora Conference held in Bolzano, Italy where the community met to discuss themes that related to the interaction between language, CMC, and society.

2019

Multilingual speakers communicate in more than one language in daily life and on social media. In order to process or investigate multilingual communication, there is a need for language identification. This study compares the performance of human annotators with automatic ways of language identification on a multilingual (mainly German-Italian-English) social media data set collected in Italy (i.e. South Tyrol). Our results indicate that humans and NLP systems follow their individual techniques to make a decision about multilingual text messages. This results in low agreement when different annotators or NLP systems execute the same task. In general, annotators agree with each other more than NLP systems. However, there is also variation in human agreement depending on the prior establishment of guidelines for the annotation task or not.

2018

The goal of the project STyrLogisms is to semi-automatically extract neologism (new lexemes) candidates for the German standard variety used in South Tyrol. We use a list of manually vetted URLs from news, magazines and blog websites of South Tyrol and regularly crawl their data, clean and process it and compare this new data to reference corpora and additional regional word lists and the formerly crawled data sets. Our reference corpora are DECOW14 with around 60m types, and the South Tyrolean Web Corpus with around 2.4m types; the additional word lists consist of named entities, terminological terms from the region, and specific terms of the German standard variety used in South Tyrol (altogether around 53k unique types). Here, we will report on the employed method, a first round of candidate extraction with an approach for a classification schema for the selected candidates, and some remarks on a second extraction round.

2018

This article describes the system that participated in the shared task (ST) on metaphor detection on the Vrije University Amsterdam Metaphor Corpus (VUA). The ST was part of the workshop on processing figurative language at the 16th annual conference of the North American Chapter of the Association for Computational Linguistics (NAACL2018). The system combines a small assertion of trending techniques, which implement matured methods from NLP and ML; in particular, the system uses word embeddings from standard corpora and from corpora representing different proficiency levels of language learners in a LSTM BiRNN architecture. The system is available under the APLv2 open-source license.

2018

Interview in Academia (science magazine by EURAC and unibz), Bolzano, Italy

2017

The paper reports on the results of a scientific colloquium dedicated to the creation of standards and best practices which are needed to facilitate the integration of language resources for CMC stemming from different origins and the linguistic analysis of CMC phenomena in different languages and genres. The key issue to be solved is that of interoperability – with respect to the structural representation of CMC genres, linguistic annotations metadata, and anonymization/pseudonymization schemas. The objective of the paper is to convince more projects to partake in a discussion about standards for CMC corpora and for the creation of a CMC corpus infrastructure across languages and genres. In view of the broad range of corpus projects which are currently underway all over Europe, there is a great window of opportunity for the creation of standards in a bottom-up approach.

2017

This volume presents the proceedings of the 5th edition of the annual conference series on CMC and Social Media Corpora for the Humanities (cmc-corpora2017). This conference series is dedicated to the collection, annotation, processing, and exploitation of corpora of computer-mediated communication (CMC) and social media for research in the humanities. The annual event brings together language-centered research on CMC and social media in linguistics, philologies, communication sciences, media and social sciences with research questions from the fields of corpus and computational linguistics, language technology, text technology, and machine learning. The 5th Conference on CMC and Social Media Corpora for the Humanities was held at Eurac Research on October, 4th and 5th, in Bolzano, Italy. This volume contains extended abstracts of the invited talks, papers, and extended abstracts of posters presented at the event. The conference attracted 26 valid submissions. Each submission was reviewed by at least two members of the scientific committee. This committee decided to accept 16 papers and 8 posters of which 14 papers and 3 posters were presented at the conference. The programme also includes three invited talks: two keynote talks by Aivars Glaznieks (Eurac Research, Italy) and A. Seza Doğruöz (Independent researcher) and an invited talk on the Common Language Resources and Technology Infrastructure (CLARIN) given by Darja Fišer, the CLARIN ERIC Director of User Involvement.

2017

The paper presents best practices and results from projects dedicated to the creation of corpora of computer-mediated communication and social media interactions (CMC) from four different countries. Even though there are still many open issues related to building and annotating corpora of this type, there already exists a range of tested solutions which may serve as a starting point for a comprehensive discussion on how future standards for CMC corpora could (and should) be shaped like.

2016

This paper describes an extended version of the KoKo corpus (version KoKo4, Dec 2015), a corpus of written German L1 learner texts from three different German-speaking regions in three different countries. The KoKo corpus is richly annotated with learner language features on different linguistic levels such as errors or other linguistic characteristics that are not deficit-oriented, and is enriched with a wide range of metadata. This paper complements a previous publication (Abel et al., 2014a) and reports on new textual metadata and lexical annotations and on the methods adopted for their manual annotation and linguistic analyses. It also briefly introduces some linguistic findings that have been derived from the corpus.

2016

The DiDi corpus of South Tyrolean data of computer-mediated communication (CMC) is a multilingual sociolinguistic language corpus. It consists of around 600,000 tokens collected from 136 profiles of Facebook users residing in South Tyrol, Italy. In conformity with the multilingual situation of the territory, the main languages of the corpus are German and Italian (followed by English). The data has been manually anonymised and provides manually corrected part-of-speech tags for the Italian language texts and manually normalised data for German texts. Moreover, it is annotated with user-provided socio-demographic data (among others L1, gender, age, education, and internet communication habits) from a questionnaire, and linguistic annotations regarding CMC phenomena, languages and varieties. The anonymised corpus is freely available for research purposes.

2016

This article describes the system that participated in the POS tagging for Italian Social Media Texts (PoSTWITA) task of the 5th periodic evaluation campaign of Natural Language Processing (NLP) and speech tools for the Italian language EVALITA 2016. The work is a continuation of Stemle (2016) with minor modifications to the system and different data sets. It combines a small assertion of trending techniques, which implement matured methods, from NLP and ML to achieve competitive results on PoS tagging of Italian Twitter texts; in particular, the system uses word embeddings and character-level representations of word beginnings and endings in a LSTM RNN architecture. Labelled data (Italian UD corpus, DiDi and PoSTWITA) and unlabbelled data (Italian C4Corpus and PAISA’) were used for training. The system is available under the APLv2 open-source license.

2016

ENeL’s WG3 concerns innovative e-dictionaries with a focus on the development of digitally born dictionaries. The training school 2016 in Ljubljana (SI), May 17-20, introduced parti­cipants, among others, to collecting, analysing, and automatically extracting data from web corpora. Albeit related, the task of processing data from corpora of computer-mediated communica­tion and social media interactions (henceforth referred to as CMC) has been deliberately ex­cluded from the training school’s programme. But we know that “new vocabulary is charac­teristic for CMC discourse, e.g. ‘funzen’ (an abbreviated variant of the German verb ‘funk­tionieren’, en.: ‘to function’) or ‘gruscheln’ (verb denoting a function of a German social net­work platform, most likely a blending of ‘grüßen’, en.: ‘to greet’ and ‘kuscheln’, en.: ‘to cuddle’)” and therefore relevant to WG3; the goal of this STSM is to apply the meth­ods and tools from the training school to CMC data.

2016

The World Wide Web has become increasingly popular as a source of linguistic data, not only within the NLP communities, but also with theoretical linguists facing problems of data sparseness or data diversity. Accordingly, web corpora continue to gain importance, given their size and diversity in terms of genres/text types. The field is still new, though, and a number of issues in web corpus construction need much additional research, both fundamental and applied. These issues range from questions of corpus design (e.g., assessment of corpus composition, sampling strategies and their relation to crawling algorithms, and handling of duplicated material) to more technical aspects (e.g., efficient implementation of individual post-processing steps in document cleaning and linguistic annotation, or large-scale parallelization to achieve web-scale corpus construction). Similarly, the systematic evaluation of web corpora, for example in the form of task based comparisons to traditional corpora, has only recently shifted into focus. For almost a decade, the ACL SIGWAC (http://www.sigwac.org.uk/), and especially the highly successful Web as Corpus (WAC) workshops have served as a platform for researchers interested in compilation, processing and application of web-derived corpora. Past workshops were co-located with major conferences on computational linguistics and/or corpus linguistics (such as EACL, NAACL, LREC, WWW, and Corpus Linguistics). WAC-X also featured the final workshop of the EmpiriST 2015 shared task “Automatic Linguistic Annotation of Computer-Mediated Communication / Social Media” (see https://sites.google.com/site/empirist2015/ for details) and the panel discussion “Corpora, open science, and copyright reforms” (see https://www.sigwac.org.uk/wiki/WAC-X#paneldisc for details).

2016

This article describes the system that participated in the Part-of-speech tagging subtask of the “EmpiriST 2015 shared task on automatic linguistic annotation of computer-mediated communication / social media”. The system combines a small assertion of trending techniques, which implement matured methods, from NLP and ML to achieve competitive results on PoS tagging of German CMC and Web corpus data; in particular, the system uses word embeddings and character-level representations of word beginnings and endings in a LSTM RNN architecture. Labelled data (Tiger v2.2 and EmpiriST) and unlabelled data (German Wikipedia) were used for training. The system is available under the APLv2 open-source license.

2015

This paper presents the DiDi Corpus, a corpus of South Tyrolean Data of Computer-mediated Communication (CMC). The corpus comprises around 650,000 tokens from Facebook wall posts, comments on wall posts and private messages, as well as socio-demographic data of participants. All data was automatically annotated with language information (de, it, en and others), and manually normalised and anonymised. Furthermore, semi-automatic token level annotations include part-of-speech and CMC phenomena (e.g. emoticons, emojis, and iteration of graphemes and punctuation). The anonymised corpus without the private messages is freely available for researchers; the complete and anonymised corpus is available after signing a non- disclosure agreement.

2015

We present an abstract and generic workflow, and detail how it has been implemented to build and annotate learner corpora. This workflow has been developed through an interdisciplinary collaboration between linguists, who annotate and use corpora, and computational linguists and computer scientists, who are responsible for providing technical support and adaptation or implementation of software components.

2015

This article focuses on automatic text classification which aims at identifying the first language (L1) background of learners of English. A particular question arising in the context of automated L1 identification is whether any features that are informative for a machine learning algorithm relate to L1-specific transfer phenomena. In order to explore this issue further, we discuss the results of a study carried out in the wake of a Native Language Identification Task. The task is based on the TOEFL11 corpus (cf. Blanchard et al. 2013), which involves a sample of 12,100 essays written by participants in the TOEFL® test from 11 different language backgrounds (Arabic, Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu, and Turkish). The article will show our results in automatic L1 detection in the TOEFL11 corpus. These results are discussed in light of relevant transfer features which turned out to be particularly informative for automatic detection of L1 German and L1 Italian.

2014

Special Issue: Building and annotating corpora of computer-mediated discourse. Issues and Challenges at the Inteface of Corpus and Computational Linguistics

2014

In this paper, we present ongoing experiments for correcting OCR errors on German newspapers in Fraktur font. Our approach borrows from techniques for spelling correction in context using a probabilistic edit-operation error model and lexical resources. We highlight conditions in which high error reduction rates can be obtained and where the approach currently stands with real data.

2014

Decisions at the outset of preparing a learner corpus are of crucial importance for how the corpus can be built and how it can be analysed later on. This paper presents a generic workflow to build learner corpora while taking into account the needs of the users. The workflow results from an extensive collaboration between linguists that annotate and use the corpus and computer linguists that are responsible for providing technical support. The paper addresses the linguists’ research needs as well as the availability and usability of language technology tools necessary to meet them. We demonstrate and illustrate the relevance of the workflow using results and examples from our L1 learner corpus of German (“KoKo”).

2014

In this paper, we propose an integrated web strategy for mixed sociolinguistic research methodologies in the context of social media corpora. After stating the particular challenges for building corpora of private, non-public computer-mediated communication, we will present our solution to these problems: a Facebook web application for the acquisition of such data and the corresponding meta data. Finally, we will discuss positive and negative implications for this method.

2014

In this article, we present interHist, a compact visualization for the interactive exploration of results to complex corpus queries. Integrated with a search interface to the PAISÀ corpus of Italian web texts, interHist aims at facilitating the exploration of large results sets to linguistic corpus searches. This objective is approached by providing an interactive visual overview of the data, which supports the user-steered navigation by means of interactive filtering. It allows to dynamically switch between an overview on the data and a detailed view on results in their immediate textual context, thus helping to detect and inspect relevant hits more efficiently. We provide background information on corpus linguistics and related work on visualizations for language and linguistic data. We introduce the architecture of interHist, by detailing the data structure it relies on, describing the visualization design and providing technical details of the implementation and its integration with the corpus querying environment. Finally, we illustrate its usage by presenting a use case for the analysis of the composition of Italian noun phrases.

2014

We introduce the KoKo corpus, a collection of German L1 learner texts annotated with learner errors, along with the methods and tools used in its construction and evaluation. The corpus contains both texts and corresponding survey information from 1,319 pupils and amounts to around 716,000 tokens. The evaluation of the quality of the performed transcriptions and annotations shows an accuracy of orthographic error annotations of approximately 80% as well as high accuracy of transcriptions ($>$ 99%), automatic tokenisation ($>$ 99%), sentence splitting ($>$ 96%) and POS-tagging ($>$ 94%). The KoKo corpus will be published at the end of 2014 and be the first accessible linguistically annotated German L1 learner corpus. It will represent a valuable source for research and teaching on German as L1 language, in particular with regards to writing skills.

2014

PAISÀ is a Creative Commons licensed, large web corpus of contemporary Italian. We describe the design, harvesting, and processing steps involved in its creation.

2013

Article in Academia (science magazine by EURAC and unibz), Bolzano, Italy

2013

In this article, we present the multi-faceted interface to the open PAISÀ corpus of Italian. Created within the project PAISÀ (Piattaforma per l’Apprendimento dell’Italiano Su corpora Annotati) [1], the corpus is designed to be freely available for non-commercial processing, usage and distribution by the public. Hence, this automatically annotated corpus (for lemma, part-of-speech and dependency information) is exclusively composed of documents licensed under Creative Commons (CC) licenses [2].The dedicated corpus interface is designed to provide flexible, powerful, and easy-to-use modes of corpus access, with the objective to support language learning, language practicing and linguistic analyses. We present in detail the interface’s functionalities and discuss the underlying design decisions. We introduce the four principal components of the interface, describe supported display formats and present two specific features added to increase the interface’s relevance for language learning. The main search components are (1) a basic search that adopts a “Google-style” search box, (2) an advanced search that provides elaborated graphical search options, and (3) a search that makes use of the powerful CQP query language of the Open Corpus Workbench [3]. In addition, (4) a filter interface for retrieving full-text corpus documents based on keyword searches is available. It is likewise providing the means for building temporary sub-corpora for specific topics. Users can choose among different display formats for the search results. Besides the established KWIC (KeyWord In Context) and full sentence views, graphical representations of the dependency relation information as well as keyword distributions are available. These dynamic displays are based on a visualisation for dependency graphs [4] and one for Word Clouds [5], which build on latest developments in information visualisation for language data. Two special features for novice learners are integrated into each search component. The first feature is a function for restricting search results to sentences of limited complexity. Search results are automatically filtered based on formal text characteristics such as sentence length, vocabulary, etc. The second is the supply of pre-defined search queries for linguistic constructions such as sentences in passive voice, questions, etc. Finally, we show how the PAISÀ interface can be employed in different language teaching tasks. In particular, we present a complete unit of work aimed at learners of Italian (CEFR level A2/B1) and centered on students’ direct use of the interface and its functionalities. By doing so, we are giving concrete examples for targeted searches and interactions with the provided language material, as well as an exemplification of how the use of the corpus can be integrated with communicative language activities in the classroom.

2013

In this paper, we report on an unsupervised greedy-style process for acquiring phrase translations from sentence-aligned parallel corpora. Thanks to innovative selection strategies, this process can acquire multiple translations without size criteria, i.e. phrases can have several translations, can be of any size, and their size is not considered when selecting their translations. Even though the process is in an early development stage and has much room for improvements, evaluation shows that it yields phrase translations of high precision that are relevant to machine translation but also to a wider set of applications including memory-based translation or multi-word acquisition.

2013

Web corpora and other Web-derived data have become a gold mine for corpus linguistics and natural language processing. The Web is an easy source of unprecedented amounts of linguistic data from a broad range of registers and text types. However, a collection of Web pages is not immediately suitable for exploration in the same way a traditional corpus is. Since the first Web as Corpus Workshop organised at the Corpus Linguistics 2005 Conference, a highly successful series of yearly Web as Corpus workshops provides a venue for interested researchers to meet, share ideas and discuss the problems and possibilities of compiling and using Web corpora. After a stronger focus on application-oriented natural language processing andWeb technology in recent years with workshops taking place at NAACL-HLT 2010, 2011 andWWW2012 the 8thWeb as Corpus Workshop returns to its roots in the corpus linguistics community. Accordingly, the leading theme of this workshop is the application of Web data in language research, including linguistic evaluation of Web-derived corpora as well as strategies and tools for high-quality automatic annotation ofWeb text. The workshop brings together presentations on all aspects of building, using and evaluating Web corpora, with a particular focus on the following topics: applications of Web corpora and other Web-derived data sets for language research automatic linguistic annotation of Web data such as tokenisation, part-of-speech tagging, lemma- tisation and semantic tagging (the accuracy of currently available off-the-shelf tools is still unsatisfactory for many types of Web data) critical exploration of the characteristics of Web data from a linguistic perspective and its applica- bility to language research presentation of Web corpus collection projects or software tools required for some part of this process (crawling, filtering, de-duplication, language identification, indexing, …)

2013

Graphical tools to organise and represent knowledge are useful in terminology work to facilitate building concept systems. Creating and maintaining hierarchically structured concept relation maps while manually gathering data for terminological databases helps to gain and maintain an overview of concept relations, supports terminology work in groups, and helps new team members catching up on the subject field. This article describes our approach to support the building of concept systems in comparative legal terminology using the concept mapping software CmapTools (IHMC): we build hierarchically structured concept relation maps where linking lines with arrowheads between concepts of the same legal system represent generic-specific relations, and combined concept relation maps where dashed lines without arrowheads connect similar concepts in different legal systems.

2012

We report on on-going work to derive translations of phrases from parallel corpora. We describe an unsupervised and knowledge-free greedy-style process relying on innovative strategies for choosing and discarding candidate translations. This process manages to acquire multiple translations combining phrases of equal or different sizes. The preliminary evaluation performed confirms both its potential and its interest.

2012

Developing content extraction methods for Humanities domains raises a number of chal- lenges, from the abundance of non-standard entity types to their complexity to the scarcity of data. Close collaboration with Humani- ties scholars is essential to address these chal- lenges. We discuss an annotation schema for Archaeological texts developed in collabora- tion with domain experts. Its development re- quired a number of iterations to make sure all the most important entity types were included, as well as addressing challenges including a domain-specific handling of temporal expres- sions, and the existence of many systematic types of ambiguity.

2011

Small, manually assembled corpora may be available for less dominant languages and dialects, but producing web-scale resources remains a challenge. Even when considerable quantities of text are present on the web, finding this text, and distinguishing it from related languages in the same region can be difficult. For example less dominant variants of English (e.g. New Zealander, Singaporean, Canadian, Irish, South African) may be found under their respective national domains, but will be partially mixed with Englishes of the British and US varieties, perhaps through syndication of journalism, or the local reuse of text by multinational companies. Less formal dialectal usage may be scattered more widely over the internet through mechanisms such as wiki or blog authoring. Here we automatically construct a corpus of Hiberno-English (English as spoken in Ireland) using a variety of methods: filtering by national domain, filtering by orthographic conventions, and bootstrapping from a set of Ireland-specific terms (slang, place names, organisations). We evaluate the national specificity of the resulting corpora by measuring the incidence of topical terms, and several grammatical constructions that are particular to Hiberno-English. The results show that domain filtering is very effective for isolating text that is topic-specific, and orthographic classification can exclude some non-Irish texts, but that selected seeds are necessary to extract considerable quantities of more informal, dialectal text.

2011

Most existing HLT pipelines assume the input is pure text or, at most, HTML and either ignore (logical) document structure or remove it. We argue that identifying the structure of documents is essential in digital library and other types of applications, and show that it is relatively straightforward to extend existing pipelines to achieve ones in which the structure of a document is preserved.

2010

2009

Algorithmic processing of Web content mostly works on textual contents, neglecting visual information. Annotation tools largely share this deficit as well. We specify requirements for an architecture to overcome both problems and propose an implementation, the KrdWrd system. It uses the Gecko rendering engine for both annotation and feature extraction, providing unified data access in every processing step. Stable data storage and collaboration control scripts for group annotations of massive corpora are provided via a Web interface coupled with a HTTP proxy. A modular interface allows for linguistic and visual data feature extractor plugins. The implementation is suitable for many tasks in theWeb as corpus domain and beyond.

2009

Unpublished

This thesis discusses the KrdWrd Project. The Project goals are to provide tools and infrastructure for acquisition, visual annotation, merging and storage of Web pages as parts of bigger corpora, and to develop a classification engine that learns to automatically annotate pages, operate on the visual rendering of pages, and provide visual tools for inspection of results.

2007

2005

Final Report of the one year cooperation between the Universities of Osnabrück and Hildesheim, and the aircraft manufacturer AIRBUS to research methodologies and technologies to analyze and structure the huge amount of documentation produced during aircraft construction. The work was done in a study project carried out in close cooperation with seven students of cognitive science advised by two lectures of the Institute of Cognitive Science of the University of Osnabrück and with one student of international information management advised by one professor of the Institute of Applied Linguistics of the University of Hildesheim.