In recent years, the reproducibility of scientific research has increasingly come into focus, both by ex ternal stakeholders (e.g. funders) and by the research communities themselves. Corpus linguistics, with its methods for creating, processing and analysing corpora, is an integral part of many other disciplines that work with language data and therefore plays a special role. Moreover, language corpora are often living objects that are regularly improved and revised. At the same time, tools for the automatic processing of human language are also being developed further, which can lead to different results with the same processing steps and the same data. This article argues that modern software technologies, such as version control and containerisation, can mitigate the following problems: Software packaging, installation and execution and, equally important, the tracking of corpus modifications throughout its life-cycle. All in all, this leads to transparency of changes to raw data and software tools and thereby enhanced reproducibility.
In this paper, we report on the shared task on metaphor identification on VU Amsterdam Metaphor Corpus and on a subset of the TOEFL Native Language Identification Corpus. The shared task was conducted as apart of the ACL 2020 Workshop on Processing Figurative Language.
This paper describes the adaptation and application of a neural network system for the automatic detection of metaphors. The LSTM BiRNN system participated in the shared task of metaphor identification that was part of the Second Workshop of Figurative Language Processing (FigLang2020) held at the Annual Conference of the Association for Computational Linguistics (ACL2020). The particular focus of our approach is on the potential influence that the metadata given in the ETS Corpus of Non-Native Written English might have on the automatic detection of metaphors in this dataset. The article first discusses the annotated ETS learner data, highlighting some of its peculiarities and inherent biases of metaphor use. A series of evaluations follow in order to test whether specific metadata influence the system performance in the task of automatic metaphor identification. The system is available under the APLv2 open-source license.
For almost fifteen years, the ACL SIGWAC, and most notably the Web as Corpus (WAC) workshops, have served as a platform for researchers interested in the compilation, processing and use of webderived corpora as well as computer-mediated communication. Past workshops were co-located with major conferences on corpus linguistics and computational linguistics (such as ACL, EACL, Corpus Linguistics, LREC, NAACL, WWW). In corpus linguistics and theoretical linguistics, the World Wide Web has become increasingly popular as a source of linguistic evidence, especially in the face of data sparseness or the lack of variation in traditional corpora of written language. In lexicography, web data have become a major and wellestablished resource with dedicated research data and specialised tools. In other areas of theoretical linguistics, the adoption rate of web corpora has been slower but steady. Furthermore, some completely new areas of linguistic research dealing exclusively with web (or similar) data have emerged, such as the construction and utilisation of corpora based on short messages. Another example is the (manual or automatic) classification of web texts by genre, register, or – more generally speaking – “text type”, as well as topic area. In computational linguistics, web corpora have become an established source of data for the creation of language models, word embeddings, and all types of machine learning. The 12th Web as Corpus workshop (WAC-XII) looks at the past, present, and future of web corpora given the fact that large web corpora are nowadays provided mostly by a few major initiatives and companies, and the diversity of the early years appears to have faded slightly. Also, we acknowledge the fact that alternative sources of data (such as data from Twitter and similar platforms) have emerged, some of them only available to large companies and their affiliates, such as linguistic data from social media and other forms of the deep web. At the same time, gathering interesting and relevant web data (web crawling) is becoming an ever more intricate task as the nature of the data offered on the web changes (for example the death of forums in favour of more closed platforms).
In this article, we examine the current situation of data dissemination and provision for CMC corpora. By that we aim to give a guiding grid for future projects that will improve the transparency and replicability of research results as well as the reusability of the created resources. Based on the FAIR guiding principles for research data management, we evaluate the 20 European CMC corpora listed in the CLARIN CMC Resource family, individuate successful strategies among the existing corpora and establish best practices for future projects. We give an overview of existing approaches to data referencing, dissemination and provision in European CMC corpora, and discuss the methods, formats and strategies used. Furthermore, we discuss the need for community standards and offer recommendations for best practices when creating a new CMC corpus.
This paper is describing the needs and technological preconditions of the CLARIN ERIC infrastructure. It introduces how containerization using Docker can help to meet these requirements and fleshes out the build and deployment workflow that CLARIN ERIC is employing to ensure that all the goals of their infrastructure are met in an efficient and sustainable way. In a second step, it is also shown how these same workflows can help researchers, especially in the fields of computational and corpus linguistics, to provide for more easily reproducible research by creating a virtual environment that can provide specific versions of data, programs and algorithms used for certain research questions and make sure that the exact same versions can still be used at a later stage to reproduce the results.
In this article we give an overview of first-hand experiences and starting points for best practices from projects in seven European countries dedicated to learner corpus research and the creation of language learner corpora. The corpora and tools involved in LCR are becoming more and more important, and the careful preparation and easy retrieval, and reusability of corpora and tools has likewise become more important. But with a lack of agreed solutions for many aspects of LCR, interoperability between learner corpora or exchanging data from different learner corpus projects is still challenging. We will illustrate how concepts like metadata, anonymization, error taxonomies and linguistic annotations, as well as tools, toolchains or data formats can individually pose challenges and how they might be solved.
The goal of the STyrLogism Project is to semi-automatically extract neologism candidates (new lexemes) for the German standard variety used in South Tyrol, and generally create the basis for long-term monitoring of its development. We use automatic lexico-semantic analytics for the lexicographic processing, but instead of continuing to develop our independent neologism detection application, we have recently become part of a thriving community of users and developers within the EU infrastructure project ELEXIS, which aims to harmonise efforts that relate to producing and making dictionary resources available, and to develop tools with consistent standards and increased interoperability. Consequently, we moved the development of our neologism application into Lexonomy, one of ELEXIS' promoted open-source projects. In the following, we report on the current state of this ongoing development by describing how we integrate our work with the Sketch Engine and Lexonomy tools, pointing out the challenges involved, and discussing how our work on language varieties can be evaluated.
In recent years, research data management has also become an important topic in the less data-intensive areas of the Social Sciences and Humanities (SSH). Funding agencies as well as research communities demand that empirical data collected and used for scientific research is managed and preserved in a way that research results are reproducible. In order to account for this the FAIR guiding principles for data stewardship have been established as a framework for good data management, aiming at the findability, accessibility, interoperability, and reusability of research data. This article investigates 24 European CMC corpora with regard to their compliance with the FAIR principles and discusses to what extent the deposit of research data in repositories of data preservation initiatives such as CLARIN, Zenodo or Metashare can assist in the provision of FAIR corpora.
In recent years, the reproducibility of scientific research has more and more come into focus, both from external stakeholders (e.g. funders) and from within research communities themselves. Corpus linguistics and its methods, which are an integral component of many other disciplines working with language data, play a special role here – language corpora are often living objects: they are constantly being improved and revised, and at the same time, the tools for the automatic processing of human language are also regularly updated, both of which can lead to different results for the same processing steps. This article argues that modern software technologies such as version control and containerization can address both issues, namely make reproducible the process of software packaging, installation, and execution and, more importantly, the tracking of corpora throughout their life cycle, thereby making the changes to the raw data reproducible for many subsequent analyses.
Communication between humans via networked devices has become an everyday part of people’s lives across generations, cultures, geographical areas, and social classes. Shaped by the specific social and technical context in which it is produced, synchronous and asynchronous computer-mediated communication (CMC) has become increasingly participatory, interactive, and multimodal. User interactions and user-generated social media content offer a wide range of research opportunities for a growing multidisciplinary research community. This edited volume combines methodological papers that focus on building and annotating CMC corpora and papers that offer a sociolinguistic analysis of different CMC corpora. The diversity of languages represented in the corpora include Arabic, French, German, Italian, English and Slovenian. In fact, the increasingly multilingual nature of CMC data is a recurring theme throughout the volume, as are the references to the importance of and compliance with standards for CMC corpora development in order to facilitate (the?) re-examination of corpora for reproducibility, and for other areas and objectives of investigation. All but one paper are extended papers from the 2017 edition of the CMC and Social Media Corpora Conference held in Bolzano, Italy where the community met to discuss themes that related to the interaction between language, CMC, and society.
Multilingual speakers communicate in more than one language in daily life and on social media. In order to process or investigate multilingual communication, there is a need for language identification. This study compares the performance of human annotators with automatic ways of language identification on a multilingual (mainly German-Italian-English) social media data set collected in Italy (i.e. South Tyrol). Our results indicate that humans and NLP systems follow their individual techniques to make a decision about multilingual text messages. This results in low agreement when different annotators or NLP systems execute the same task. In general, annotators agree with each other more than NLP systems. However, there is also variation in human agreement depending on the prior establishment of guidelines for the annotation task or not.
The goal of the project STyrLogisms is to semi-automatically extract neologism (new lexemes) candidates for the German standard variety used in South Tyrol. We use a list of manually vetted URLs from news, magazines and blog websites of South Tyrol and regularly crawl their data, clean and process it and compare this new data to reference corpora and additional regional word lists and the formerly crawled data sets. Our reference corpora are DECOW14 with around 60m types, and the South Tyrolean Web Corpus with around 2.4m types; the additional word lists consist of named entities, terminological terms from the region, and specific terms of the German standard variety used in South Tyrol (altogether around 53k unique types). Here, we will report on the employed method, a first round of candidate extraction with an approach for a classification schema for the selected candidates, and some remarks on a second extraction round.
This article describes the system that participated in the shared task (ST) on metaphor detection on the Vrije University Amsterdam Metaphor Corpus (VUA). The ST was part of the workshop on processing figurative language at the 16th annual conference of the North American Chapter of the Association for Computational Linguistics (NAACL2018). The system combines a small assertion of trending techniques, which implement matured methods from NLP and ML; in particular, the system uses word embeddings from standard corpora and from corpora representing different proficiency levels of language learners in a LSTM BiRNN architecture. The system is available under the APLv2 open-source license.
The paper reports on the results of a scientific colloquium dedicated to the creation of standards and best practices which are needed to facilitate the integration of language resources for CMC stemming from different origins and the linguistic analysis of CMC phenomena in different languages and genres. The key issue to be solved is that of interoperability – with respect to the structural representation of CMC genres, linguistic annotations metadata, and anonymization/pseudonymization schemas. The objective of the paper is to convince more projects to partake in a discussion about standards for CMC corpora and for the creation of a CMC corpus infrastructure across languages and genres. In view of the broad range of corpus projects which are currently underway all over Europe, there is a great window of opportunity for the creation of standards in a bottom-up approach.
This volume presents the proceedings of the 5th edition of the annual conference series on CMC and Social Media Corpora for the Humanities (cmc-corpora2017). This conference series is dedicated to the collection, annotation, processing, and exploitation of corpora of computer-mediated communication (CMC) and social media for research in the humanities. The annual event brings together language-centered research on CMC and social media in linguistics, philologies, communication sciences, media and social sciences with research questions from the fields of corpus and computational linguistics, language technology, text technology, and machine learning. The 5th Conference on CMC and Social Media Corpora for the Humanities was held at Eurac Research on October, 4th and 5th, in Bolzano, Italy. This volume contains extended abstracts of the invited talks, papers, and extended abstracts of posters presented at the event. The conference attracted 26 valid submissions. Each submission was reviewed by at least two members of the scientific committee. This committee decided to accept 16 papers and 8 posters of which 14 papers and 3 posters were presented at the conference. The programme also includes three invited talks: two keynote talks by Aivars Glaznieks (Eurac Research, Italy) and A. Seza Doğruöz (Independent researcher) and an invited talk on the Common Language Resources and Technology Infrastructure (CLARIN) given by Darja Fišer, the CLARIN ERIC Director of User Involvement.
The paper presents best practices and results from projects dedicated to the creation of corpora of computer-mediated communication and social media interactions (CMC) from four different countries. Even though there are still many open issues related to building and annotating corpora of this type, there already exists a range of tested solutions which may serve as a starting point for a comprehensive discussion on how future standards for CMC corpora could (and should) be shaped like.
This paper describes an extended version of the KoKo corpus (version KoKo4, Dec 2015), a corpus of written German L1 learner texts from three different German-speaking regions in three different countries. The KoKo corpus is richly annotated with learner language features on different linguistic levels such as errors or other linguistic characteristics that are not deficit-oriented, and is enriched with a wide range of metadata. This paper complements a previous publication (Abel et al., 2014a) and reports on new textual metadata and lexical annotations and on the methods adopted for their manual annotation and linguistic analyses. It also briefly introduces some linguistic findings that have been derived from the corpus.
This article describes the system that participated in the POS tagging for Italian Social Media Texts (PoSTWITA) task of the 5th periodic evaluation campaign of Natural Language Processing (NLP) and speech tools for the Italian language EVALITA 2016. The work is a continuation of Stemle (2016) with minor modifications to the system and different data sets. It combines a small assertion of trending techniques, which implement matured methods, from NLP and ML to achieve competitive results on PoS tagging of Italian Twitter texts; in particular, the system uses word embeddings and character-level representations of word beginnings and endings in a LSTM RNN architecture. Labelled data (Italian UD corpus, DiDi and PoSTWITA) and unlabbelled data (Italian C4Corpus and PAISA') were used for training. The system is available under the APLv2 open-source license.
The DiDi corpus of South Tyrolean data of computer-mediated communication (CMC) is a multilingual sociolinguistic language corpus. It consists of around 600,000 tokens collected from 136 profiles of Facebook users residing in South Tyrol, Italy. In conformity with the multilingual situation of the territory, the main languages of the corpus are German and Italian (followed by English). The data has been manually anonymised and provides manually corrected part-of-speech tags for the Italian language texts and manually normalised data for German texts. Moreover, it is annotated with user-provided socio-demographic data (among others L1, gender, age, education, and internet communication habits) from a questionnaire, and linguistic annotations regarding CMC phenomena, languages and varieties. The anonymised corpus is freely available for research purposes.
ENeL’s WG3 concerns innovative e-dictionaries with a focus on the development of digitally born dictionaries. The training school 2016 in Ljubljana (SI), May 17-20, introduced participants, among others, to collecting, analysing, and automatically extracting data from web corpora. Albeit related, the task of processing data from corpora of computer-mediated communication and social media interactions (henceforth referred to as CMC) has been deliberately excluded from the training school’s programme. But we know that “new vocabulary is characteristic for CMC discourse, e.g. ‘funzen’ (an abbreviated variant of the German verb ‘funktionieren’, en.: ‘to function’) or ‘gruscheln’ (verb denoting a function of a German social network platform, most likely a blending of ‘grüßen’, en.: ‘to greet’ and ‘kuscheln’, en.: ‘to cuddle’)” and therefore relevant to WG3; the goal of this STSM is to apply the methods and tools from the training school to CMC data.
This article describes the system that participated in the Part-of-speech tagging subtask of the “EmpiriST 2015 shared task on automatic linguistic annotation of computer-mediated communication / social media”. The system combines a small assertion of trending techniques, which implement matured methods, from NLP and ML to achieve competitive results on PoS tagging of German CMC and Web corpus data; in particular, the system uses word embeddings and character-level representations of word beginnings and endings in a LSTM RNN architecture. Labelled data (Tiger v2.2 and EmpiriST) and unlabelled data (German Wikipedia) were used for training. The system is available under the APLv2 open-source license.
The World Wide Web has become increasingly popular as a source of linguistic data, not only within the NLP communities, but also with theoretical linguists facing problems of data sparseness or data diversity. Accordingly, web corpora continue to gain importance, given their size and diversity in terms of genres/text types. The field is still new, though, and a number of issues in web corpus construction need much additional research, both fundamental and applied. These issues range from questions of corpus design (e.g., assessment of corpus composition, sampling strategies and their relation to crawling algorithms, and handling of duplicated material) to more technical aspects (e.g., efficient implementation of individual post-processing steps in document cleaning and linguistic annotation, or large-scale parallelization to achieve web-scale corpus construction). Similarly, the systematic evaluation of web corpora, for example in the form of task based comparisons to traditional corpora, has only recently shifted into focus. For almost a decade, the ACL SIGWAC (http://www.sigwac.org.uk/), and especially the highly successful Web as Corpus (WAC) workshops have served as a platform for researchers interested in compilation, processing and application of web-derived corpora. Past workshops were co-located with major conferences on computational linguistics and/or corpus linguistics (such as EACL, NAACL, LREC, WWW, and Corpus Linguistics). WAC-X also featured the final workshop of the EmpiriST 2015 shared task “Automatic Linguistic Annotation of Computer-Mediated Communication / Social Media” (see https://sites.google.com/site/empirist2015/ for details) and the panel discussion “Corpora, open science, and copyright reforms” (see https://www.sigwac.org.uk/wiki/WAC-X#paneldisc for details).
This paper presents the DiDi Corpus, a corpus of South Tyrolean Data of Computer-mediated Communication (CMC). The corpus comprises around 650,000 tokens from Facebook wall posts, comments on wall posts and private messages, as well as socio-demographic data of participants. All data was automatically annotated with language information (de, it, en and others), and manually normalised and anonymised. Furthermore, semi-automatic token level annotations include part-of-speech and CMC phenomena (e.g. emoticons, emojis, and iteration of graphemes and punctuation). The anonymised corpus without the private messages is freely available for researchers; the complete and anonymised corpus is available after signing a non- disclosure agreement.
We present an abstract and generic workflow, and detail how it has been implemented to build and annotate learner corpora. This workflow has been developed through an interdisciplinary collaboration between linguists, who annotate and use corpora, and computational linguists and computer scientists, who are responsible for providing technical support and adaptation or implementation of software components.
This article focuses on automatic text classification which aims at identifying the first language (L1) background of learners of English. A particular question arising in the context of automated L1 identification is whether any features that are informative for a machine learning algorithm relate to L1-specific transfer phenomena. In order to explore this issue further, we discuss the results of a study carried out in the wake of a Native Language Identification Task. The task is based on the TOEFL11 corpus (cf. Blanchard et al. 2013), which involves a sample of 12,100 essays written by participants in the TOEFL® test from 11 different language backgrounds (Arabic, Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu, and Turkish). The article will show our results in automatic L1 detection in the TOEFL11 corpus. These results are discussed in light of relevant transfer features which turned out to be particularly informative for automatic detection of L1 German and L1 Italian.
In this paper, we present ongoing experiments for correcting OCR errors on German newspapers in Fraktur font. Our approach borrows from techniques for spelling correction in context using a probabilistic edit-operation error model and lexical resources. We highlight conditions in which high error reduction rates can be obtained and where the approach currently stands with real data.
Decisions at the outset of preparing a learner corpus are of crucial importance for how the corpus can be built and how it can be analysed later on. This paper presents a generic workflow to build learner corpora while taking into account the needs of the users. The workflow results from an extensive collaboration between linguists that annotate and use the corpus and computer linguists that are responsible for providing technical support. The paper addresses the linguists' research needs as well as the availability and usability of language technology tools necessary to meet them. We demonstrate and illustrate the relevance of the workflow using results and examples from our L1 learner corpus of German (“KoKo”).
In this paper, we propose an integrated web strategy for mixed sociolinguistic research methodologies in the context of social media corpora. After stating the particular challenges for building corpora of private, non-public computer-mediated communication, we will present our solution to these problems: a Facebook web application for the acquisition of such data and the corresponding meta data. Finally, we will discuss positive and negative implications for this method.
In this article, we present interHist, a compact visualization for the interactive exploration of results to complex corpus queries. Integrated with a search interface to the PAISÀ corpus of Italian web texts, interHist aims at facilitating the exploration of large results sets to linguistic corpus searches. This objective is approached by providing an interactive visual overview of the data, which supports the user-steered navigation by means of interactive filtering. It allows to dynamically switch between an overview on the data and a detailed view on results in their immediate textual context, thus helping to detect and inspect relevant hits more efficiently. We provide background information on corpus linguistics and related work on visualizations for language and linguistic data. We introduce the architecture of interHist, by detailing the data structure it relies on, describing the visualization design and providing technical details of the implementation and its integration with the corpus querying environment. Finally, we illustrate its usage by presenting a use case for the analysis of the composition of Italian noun phrases.
We introduce the KoKo corpus, a collection of German L1 learner texts annotated with learner errors, along with the methods and tools used in its construction and evaluation. The corpus contains both texts and corresponding survey information from 1,319 pupils and amounts to around 716,000 tokens. The evaluation of the quality of the performed transcriptions and annotations shows an accuracy of orthographic error annotations of approximately 80% as well as high accuracy of transcriptions ($>$ 99%), automatic tokenisation ($>$ 99%), sentence splitting ($>$ 96%) and POS-tagging ($>$ 94%). The KoKo corpus will be published at the end of 2014 and be the first accessible linguistically annotated German L1 learner corpus. It will represent a valuable source for research and teaching on German as L1 language, in particular with regards to writing skills.
In this article, we present the multi-faceted interface to the open PAISÀ corpus of Italian. Created within the project PAISÀ (Piattaforma per l’Apprendimento dell’Italiano Su corpora Annotati) , the corpus is designed to be freely available for non-commercial processing, usage and distribution by the public. Hence, this automatically annotated corpus (for lemma, part-of-speech and dependency information) is exclusively composed of documents licensed under Creative Commons (CC) licenses .The dedicated corpus interface is designed to provide flexible, powerful, and easy-to-use modes of corpus access, with the objective to support language learning, language practicing and linguistic analyses. We present in detail the interface’s functionalities and discuss the underlying design decisions. We introduce the four principal components of the interface, describe supported display formats and present two specific features added to increase the interface’s relevance for language learning. The main search components are (1) a basic search that adopts a “Google-style” search box, (2) an advanced search that provides elaborated graphical search options, and (3) a search that makes use of the powerful CQP query language of the Open Corpus Workbench . In addition, (4) a filter interface for retrieving full-text corpus documents based on keyword searches is available. It is likewise providing the means for building temporary sub-corpora for specific topics. Users can choose among different display formats for the search results. Besides the established KWIC (KeyWord In Context) and full sentence views, graphical representations of the dependency relation information as well as keyword distributions are available. These dynamic displays are based on a visualisation for dependency graphs  and one for Word Clouds , which build on latest developments in information visualisation for language data. Two special features for novice learners are integrated into each search component. The first feature is a function for restricting search results to sentences of limited complexity. Search results are automatically filtered based on formal text characteristics such as sentence length, vocabulary, etc. The second is the supply of pre-defined search queries for linguistic constructions such as sentences in passive voice, questions, etc. Finally, we show how the PAISÀ interface can be employed in different language teaching tasks. In particular, we present a complete unit of work aimed at learners of Italian (CEFR level A2/B1) and centered on students’ direct use of the interface and its functionalities. By doing so, we are giving concrete examples for targeted searches and interactions with the provided language material, as well as an exemplification of how the use of the corpus can be integrated with communicative language activities in the classroom.
In this paper, we report on an unsupervised greedy-style process for acquiring phrase translations from sentence-aligned parallel corpora. Thanks to innovative selection strategies, this process can acquire multiple translations without size criteria, i.e. phrases can have several translations, can be of any size, and their size is not considered when selecting their translations. Even though the process is in an early development stage and has much room for improvements, evaluation shows that it yields phrase translations of high precision that are relevant to machine translation but also to a wider set of applications including memory-based translation or multi-word acquisition.
Web corpora and other Web-derived data have become a gold mine for corpus linguistics and natural language processing. The Web is an easy source of unprecedented amounts of linguistic data from a broad range of registers and text types. However, a collection of Web pages is not immediately suitable for exploration in the same way a traditional corpus is. Since the first Web as Corpus Workshop organised at the Corpus Linguistics 2005 Conference, a highly successful series of yearly Web as Corpus workshops provides a venue for interested researchers to meet, share ideas and discuss the problems and possibilities of compiling and using Web corpora. After a stronger focus on application-oriented natural language processing andWeb technology in recent years with workshops taking place at NAACL-HLT 2010, 2011 andWWW2012 the 8thWeb as Corpus Workshop returns to its roots in the corpus linguistics community. Accordingly, the leading theme of this workshop is the application of Web data in language research, including linguistic evaluation of Web-derived corpora as well as strategies and tools for high-quality automatic annotation ofWeb text. The workshop brings together presentations on all aspects of building, using and evaluating Web corpora, with a particular focus on the following topics: applications of Web corpora and other Web-derived data sets for language research automatic linguistic annotation of Web data such as tokenisation, part-of-speech tagging, lemma- tisation and semantic tagging (the accuracy of currently available off-the-shelf tools is still unsatisfactory for many types of Web data) critical exploration of the characteristics of Web data from a linguistic perspective and its applica- bility to language research presentation of Web corpus collection projects or software tools required for some part of this process (crawling, filtering, de-duplication, language identification, indexing, …)
Graphical tools to organise and represent knowledge are useful in terminology work to facilitate building concept systems. Creating and maintaining hierarchically structured concept relation maps while manually gathering data for terminological databases helps to gain and maintain an overview of concept relations, supports terminology work in groups, and helps new team members catching up on the subject field. This article describes our approach to support the building of concept systems in comparative legal terminology using the concept mapping software CmapTools (IHMC): we build hierarchically structured concept relation maps where linking lines with arrowheads between concepts of the same legal system represent generic-specific relations, and combined concept relation maps where dashed lines without arrowheads connect similar concepts in different legal systems.
We report on on-going work to derive translations of phrases from parallel corpora. We describe an unsupervised and knowledge-free greedy-style process relying on innovative strategies for choosing and discarding candidate translations. This process manages to acquire multiple translations combining phrases of equal or different sizes. The preliminary evaluation performed confirms both its potential and its interest.
Developing content extraction methods for Humanities domains raises a number of chal- lenges, from the abundance of non-standard entity types to their complexity to the scarcity of data. Close collaboration with Humani- ties scholars is essential to address these chal- lenges. We discuss an annotation schema for Archaeological texts developed in collabora- tion with domain experts. Its development re- quired a number of iterations to make sure all the most important entity types were included, as well as addressing challenges including a domain-specific handling of temporal expres- sions, and the existence of many systematic types of ambiguity.
Small, manually assembled corpora may be available for less dominant languages and dialects, but producing web-scale resources remains a challenge. Even when considerable quantities of text are present on the web, finding this text, and distinguishing it from related languages in the same region can be difficult. For example less dominant variants of English (e.g. New Zealander, Singaporean, Canadian, Irish, South African) may be found under their respective national domains, but will be partially mixed with Englishes of the British and US varieties, perhaps through syndication of journalism, or the local reuse of text by multinational companies. Less formal dialectal usage may be scattered more widely over the internet through mechanisms such as wiki or blog authoring. Here we automatically construct a corpus of Hiberno-English (English as spoken in Ireland) using a variety of methods: filtering by national domain, filtering by orthographic conventions, and bootstrapping from a set of Ireland-specific terms (slang, place names, organisations). We evaluate the national specificity of the resulting corpora by measuring the incidence of topical terms, and several grammatical constructions that are particular to Hiberno-English. The results show that domain filtering is very effective for isolating text that is topic-specific, and orthographic classification can exclude some non-Irish texts, but that selected seeds are necessary to extract considerable quantities of more informal, dialectal text.
Most existing HLT pipelines assume the input is pure text or, at most, HTML and either ignore (logical) document structure or remove it. We argue that identifying the structure of documents is essential in digital library and other types of applications, and show that it is relatively straightforward to extend existing pipelines to achieve ones in which the structure of a document is preserved.
Algorithmic processing of Web content mostly works on textual contents, neglecting visual information. Annotation tools largely share this deficit as well. We specify requirements for an architecture to overcome both problems and propose an implementation, the KrdWrd system. It uses the Gecko rendering engine for both annotation and feature extraction, providing unified data access in every processing step. Stable data storage and collaboration control scripts for group annotations of massive corpora are provided via a Web interface coupled with a HTTP proxy. A modular interface allows for linguistic and visual data feature extractor plugins. The implementation is suitable for many tasks in theWeb as corpus domain and beyond.
This thesis discusses the KrdWrd Project. The Project goals are to provide tools and infrastructure for acquisition, visual annotation, merging and storage of Web pages as parts of bigger corpora, and to develop a classification engine that learns to automatically annotate pages, operate on the visual rendering of pages, and provide visual tools for inspection of results.
Final Report of the one year cooperation between the Universities of Osnabrück and Hildesheim, and the aircraft manufacturer AIRBUS to research methodologies and technologies to analyze and structure the huge amount of documentation produced during aircraft construction. The work was done in a study project carried out in close cooperation with seven students of cognitive science advised by two lectures of the Institute of Cognitive Science of the University of Osnabrück and with one student of international information management advised by one professor of the Institute of Applied Linguistics of the University of Hildesheim.