Rexha Andi, Kröll Mark, Kern Roman

Multilingual Open Information Extraction using Parallel Corpora: The German Language Case

ACM Symposium on Applied Computing , Hisham M. Haddad, Roger L. Wainwright, ACM, 2018

In the past decade the research community has been continuously improving theextraction quality of Open Information Extraction systems. This was done mainlyfor the English language; other languages such as German or Spanish followedusing shallow or deep parsing information to derive language-specific patterns.More recent efforts focused on language agnostic approaches in an attempt tobecome less dependent on available tools and resources in that language. In linewith these efforts, we present a language agnostic approach which exploitsmanually aligned corpora as well as the solid performance of English OpenIEtools.

Rexha Andi, Kröll Mark, Ziak Hermann, Kern Roman

Authorship Identification of Documents with High Content Similarity

Scientometrics, Wolfgang Glänzel, Springer Link, 2018

The goal of our work is inspired by the task of associating segments of text to their real authors. In this work, we focus on analyzing the way humans judge different writing styles. This analysis can help to better understand this process and to thus simulate/ mimic such behavior accordingly. Unlike the majority of the work done in this field (i.e., authorship attribution, plagiarism detection, etc.) which uses content features, we focus only on the stylometric, i.e. content-agnostic, characteristics of authors.Therefore, we conducted two pilot studies to determine, if humans can identify authorship among documents with high content similarity. The first was a quantitative experiment involving crowd-sourcing, while the second was a qualitative one executed by the authors of this paper.Both studies confirmed that this task is quite challenging.To gain a better understanding of how humans tackle such a problem, we conducted an exploratory data analysis on the results of the studies. In the first experiment, we compared the decisions against content features and stylometric features. While in the second, the evaluators described the process and the features on which their judgment was based. The findings of our detailed analysis could (i) help to improve algorithms such as automatic authorship attribution as well as plagiarism detection, (ii) assist forensic experts or linguists to create profiles of writers, (iii) support intelligence applications to analyze aggressive and threatening messages and (iv) help editor conformity by adhering to, for instance, journal specific writing style.
