Background: Previous research has shown a positive information gain when using word embeddings from learner corpus data for metaphor classification in a neural network (Stemle & Onysko, 2018) Aim: Explore the potential influence of the data structure in the annotated part of the ETS Corpus of Non-Native written English; particular focus on: proficiency ratings, essay prompt and L1 of the learner System: fastText word embeddings from different corpora in a bi-directional recursive neural network with long-term short-term memory (LSTM BiRNN); a flat sequence to sequence neural network with one hidden layer using TensorFlow+Keras (Abadi et al., 2015) in Python.
In recent years, the reproducibility of scientific research has become increasingly important, both for external stakeholders and for the research communities themselves. They all demand that empirical data collected and used for scientific research is managed and preserved in a way that research results are reproducible. In order to account for this, the FAIR guiding principles for data stewardship have been established as a framework for good data management aiming at the findability, accessibility, interoperability, and reusability of research data. A special role is played by natural language processing and its methods, which are an integral part of many other disciplines working with language data: Language corpora are often living objects – they are constantly being improved and revised, and at the same time the processing tools are also regularly updated, which can lead to different results for the same processing steps. In this presentation I will first investigate CMC corpora, which resemble language learner corpora in some core aspects, with regard to their compliance with the FAIR principles and discuss to what extent the deposit of research data in repositories of data preservation initiatives such as CLARIN, Zenodo or META-SHARE can assist in the provision of FAIR corpora. Second, I will show some modern software technologies and how they make the process of software packaging, installation, and execution and, more importantly, the tracking of corpora throughout their life cycle reproducible. This in turn makes changes to raw data reproducible for many subsequent analyses.
In this talk, I will report on our previous work (Abel and Stemle 2018) and will relate it to our current work that is conducted as part of our institution’s observer status in the European Lexicographic Infrastructure (ELEXIS) project (Simon Krek et al. 2018). ELEXIS features the One-Click Dictionary tool chain to automatically generate, for example, headword lists, word (and other lexical unit) senses, definitions, and corpus based examples. The tool chain consists of the corpus query system Sketch Engine (Kilgarriff et al. 2014) and the dictionary writing system Lexonomy (Měchura 2017); together they are supposed to support lexicographers along the entire pipeline of producing a dictionary, from corpus to screen, where dictionaries are pre-generated automatically from a corpus (using the Sketch Engine) and then post-edited (using Lexonomy).
The goal of the project STyrLogisms is to semi-automatically extract neologism (new lexemes) candidates for the German standard variety used in South Tyrol. We use a list of manually vetted URLs from news, magazines and blog websites of South Tyrol and regularly crawl their data, clean and process it and compare this new data to reference corpora and additional regional word lists and the formerly crawled data sets. Our reference corpora are DECOW14 with around 60m types, and the South Tyrolean Web Corpus with around 2.4m types; the additional word lists consist of named entities, terminological terms from the region, and specific terms of the German standard variety used in South Tyrol (altogether around 53k unique types). Here, we will report on the employed method, a first round of candidate extraction with an approach for a classification schema for the selected candidates, and some remarks on a second extraction round.
This talk gives an overview to our contribuition to the NAACL 2018 Workshop on Figurative Language Processing
In künstlerischer Praxis und Lehre, im Diskurs über Bedeutung und Interpretation von “Bildern”, spielt Aspektsehen eine fundamentale, aber oft unterschätzte Rolle. Der Begriff “Aspektsehen” wird aus der Sprachphilosophie Ludwig Wittgensteins überführt und legt die (konzeptionelle und kontextuelle) Konstruktion eines Bildes aus seiner Verständnisperspektive frei. Dabei ergänzen sich vielfältige Ansichen zum Bildbegriff, zu Bedeutungsambiguität, Subjektivität, Perspektive und zur Sagen-Zeigen-Dichotomie. In enger Zusammenarbeit mit dem Künstler Frans Oosterhof, der Linguistin Dorothea Franck und dem Kognitionswissenschaftler Egon Stemle werden transdisziplinäre Verfahren erschlossen, die in künstlerischer und diskursiver Praxis Aspektsehen nutzbar machen. Erste Grundlagen wurden im wöchentlichen Y-Experimental Bilder kippen! an der HKB an künstlerische Praxis und Lehre gekoppelt. Das Projekt profitiert vom Werk des niederländischen Künstlerkolletivs Instituut Houtappel, dessen Archiv exklusiv für diese Recherche zur Verfügung steht.
Learner corpora build a fundamental basis for a noticeable part of the research activities of the Institute for Applied Linguistics. The project aims at enhancing the research potential of the Institute by creating an always more efficient infrastructure for the collection, processing and maintenance of learner corpora.