During the last decades, the amount of information available for researches has increased several fold, making the searches more difficult. Thus, Information Retrieval Systems (IR) are needed. In this master thesis, a tool has been developed to create a dataset with metadata of scientific articles. This tool parses the articles of Pubmed, extracts metadata from them and saves the metadata in a relational database. Once all the articles have been parsed, the tool generates three XML files with that metadata: Articles.xml, ExtendedArticles.xml and Citations.xml. The first file contains the title, authors and publication date of the parsed articles and the articles referenced by them. The second one contains the abstract, keywords, body and reference list of the parsed articles. Finally, the file Citations.xml file contains the citations found within the articles and their context. The tool has been used to parse 45.000 articles. After the parsing, the database contains 644.906 articles with their title, authors and publication date. The articles of the dataset form a digraph where the articles are the nodes and the references are the arcs of the digraph. The in-degree of the network follows a power law distribution: there is an small set of articles referenced very often while most of the articles are rarely referenced. Two IR systems have been developed to search the dataset: the Title Based IR and the Citation Based IR. The first one compares the query of the user to the title of the articles, computes the Jaccard index as a similarity measure and ranks the articles according to their similarity. The second IR compares the query to the paragraphs where the citations were found. The analysis of both IRs showed that the execution time needed by the Citation Based IR was bigger. Nevertheless, the recommendations given were much better, which proved that the parsing of the citations was worth it.

Open Information Extraction (OIE) targets domain- and relation-independent discovery of relations in text, scalable to the Web. Although German is a major European language, no research has been conducted in German OIE yet. In this paper we fill this knowledge gap and present GerIE, the first German OIE system. As OIE has received increasing attention lately and various potent approaches have already been proposed, we surveyed to what extent these methods can be applied to German language and which additionally principles could be valuable in a new system. The most promising approach, hand-crafted rules working on dependency parsed sentences, was implemented in GerIE. We also created two German OIE evaluation datasets, which showed that GerIE achieves at least 0.88 precision and recall with correctly parsed sentences, while errors made by the used dependency parser can reduce precision to 0.54 and recall to 0.48

Vernetzte Daten und Strukturen erfahren ein wachsendes Interesse und verdrängen bewährte Methoden der Datenhaltung in den Hintergrund. Einen neuen Ansatz für die Herausforderungen, die das Management von ausgeprägten und stark vernetzten Datenmengen mit sich bringen, liefern Graphdatenbanken. In der vorliegenden Masterarbeit wird die Leistungsfähigkeit von Graphdatenbanken gegenüber der etablierten relationalen Datenbank evaluiert. Die Ermittlung der Leistungsfähigkeit erfolgt durch Benchmarktests hinsichtlich der Verarbeitung von hochgradig vernetzten Daten, unter der Berücksichtigung eines umgesetzten feingranularen Berechtigungskonzepts. Im Rahmen der theoretischen Ausarbeitung wird zuerst auf die Grundlagen von Datenbanken und der Graphentheorie eingegangen. Diese liefern die Basis für die Bewertung des Funktionsumfangs und der Funktionalität der zur Evaluierung ausgewählten Graphdatenbanken. Die beschriebenen Berechtigungskonzepte liefern einen Überblick unterschiedlicher Zugriffskonzepte sowie die Umsetzung von Zugriffskontrollen in den Graphdatenbanken. Anhand der gewonnenen Informationen wird ein Java-Framework umgesetzt, welches es ermöglicht, die Graphdatenbanken, als auch die relationale Datenbank unter der Berücksichtigung des umgesetzten feingranularen Berechtigungskonzepts zu testen. Durch die Ausführung von geeigneten Testläufen kann die Leistungsfähigkeit in Bezug auf Schreib- und Lesevorgänge ermittelt werden. Benchmarktests für den schreibenden Zugriff erfolgen für Datenbestände unterschiedlicher Größe. Einzelne definierte Suchanfragen für die unterschiedlichen Größen an Daten erlauben die Ermittlung der Leseperformance. Es hat sich gezeigt, dass die relationale Datenbank beim Schreiben der Daten besser skaliert als die Graphdatenbanken. Das Erzeugen von Knoten und Kanten ist in Graphdatenbanken aufwendiger, als die Erzeugung eines neuen Tabelleneintrags in der relationalen Datenbank. Die Bewertung der Suchanfragen unter der Berücksichtigung des umgesetzten Zugriffkonzepts hat gezeigt, dass Graphdatenbanken bei ausgeprägten und stark vernetzten Datenmengen bedeutend besser skalieren als die relationale Datenbank. Je ausgeprägter der Vernetzungsgrad der Daten, desto mehr wird die JOIN-Problematik der relationalen Datenbank verdeutlicht.

Information validation is the process of determining whether a certain piece of information is true or false. Existing research in this area focuses on specific domains, but neglects cross-domain relations. This work will attempt to fill this gap and examine how various domains deal with the validation of information, providing a big picture across multiple domains. Therefore, we study how research areas, application domains and their definition of related terms in the field of information validation are related to each other, and show that there is no uniform use of the key terms. In addition we give an overview of existing fact finding approaches, with a focus on the data sets used for evaluation. We show that even baseline methods already achieve very good results, and that more sophisticated methods often improve the results only when they are tailored to specific data sets. Finally, we present the first step towards a new dynamic approach for information validation, which will generate a data set for existing fact finding methods on the fly by utilizing web search engines and information extraction tools. We show that with some limitations, it is possible to use existing fact finding methods to validate facts without a preexisting data set. We generate four different data sets with this approach, and use them to compare seven existing fact finding methods to each other. We discover that the performance of the fact validation process is strongly dependent on the type of fact that has to be validated as well as on the quality of the used information extraction tool

This thesis aims to shed light on the early classification of time series problem, by deriving the trade-off between classification accuracy and time series length for a number of different time series types and classification algorithms. Previous research on early classification of time series focused on keeping classification accuracy of reduced time series roughly at the level of the complete ones. Furthermore, that research work does not employ cutting-edge approaches like Deep Learning. This work fills that research gap by computing trade-off curves on classification ”earlyness” vs. accuracy and by empirically comparing algorithm performance in that context, with a focus on the comparison of Deep Learning with classical approaches. Such early classification trade-off curves are calculated for univariate and multivariate time series and the following algorithms: 1-Nearest Neighbor search with both the Euclidean and Frobenius distance, 1-Nearest Neighbor search with forecasts from ARIMA and linear models, and Deep Learning. The results obtained indicate that early classification is feasible in all types of time series considered. The derived tradeoff curves all share the common trait of slowly decreasing at first, and featuring sharp drops as time series lengths become exceedingly short. Results showed Deep Learning models were able to maintain higher classification accuracies for larger time series length reductions than other algorithms. However, their long run-times, coupled with complexity in parameter configuration, implies that faster, albeit less accurate, baseline algorithms like 1-Nearest Neighbor search may still be a sensible choice on a case-by-case basis. This thesis draws its motivation from areas like predictive maintenance, where the early classification of multivariate time series data may boost performance of early warning systems, for example in manufacturing processes.