During the last decades, the amount of information available for researches has increased several fold, making the searches more difficult. Thus, Information Retrieval Systems (IR) are needed. In this master thesis, a tool has been developed to create a dataset with metadata of scientific articles. This tool parses the articles of Pubmed, extracts metadata from them and saves the metadata in a relational database. Once all the articles have been parsed, the tool generates three XML files with that metadata: Articles.xml, ExtendedArticles.xml and Citations.xml. The first file contains the title, authors and publication date of the parsed articles and the articles referenced by them. The second one contains the abstract, keywords, body and reference list of the parsed articles. Finally, the file Citations.xml file contains the citations found within the articles and their context. The tool has been used to parse 45.000 articles. After the parsing, the database contains 644.906 articles with their title, authors and publication date. The articles of the dataset form a digraph where the articles are the nodes and the references are the arcs of the digraph. The in-degree of the network follows a power law distribution: there is an small set of articles referenced very often while most of the articles are rarely referenced. Two IR systems have been developed to search the dataset: the Title Based IR and the Citation Based IR. The first one compares the query of the user to the title of the articles, computes the Jaccard index as a similarity measure and ranks the articles according to their similarity. The second IR compares the query to the paragraphs where the citations were found. The analysis of both IRs showed that the execution time needed by the Citation Based IR was bigger. Nevertheless, the recommendations given were much better, which proved that the parsing of the citations was worth it.

Open Information Extraction (OIE) targets domain- and relation-independent discovery of relations in text, scalable to the Web. Although German is a major European language, no research has been conducted in German OIE yet. In this paper we fill this knowledge gap and present GerIE, the first German OIE system. As OIE has received increasing attention lately and various potent approaches have already been proposed, we surveyed to what extent these methods can be applied to German language and which additionally principles could be valuable in a new system. The most promising approach, hand-crafted rules working on dependency parsed sentences, was implemented in GerIE. We also created two German OIE evaluation datasets, which showed that GerIE achieves at least 0.88 precision and recall with correctly parsed sentences, while errors made by the used dependency parser can reduce precision to 0.54 and recall to 0.48

Social media monitoring has become an important means for business analytics and trend detection, comparing companies with each other or keeping a healthy customer relationship. While English sentiment analysis is very closely researched, not much work has been done on German data analysis. In this work we will (i) annotate ~700 posts from 15 corporate Facebook pages, (ii) evaluate existing approaches capable of processing German data against the annotated data set and (iii) due to the insufficient results train a two-step hierarchical classifier capable of predicting posts with an accuracy of 70%. The first binary classifier decides whether the post is opinionated. If the outcome is not neutral, the second classifier predicts the polarity of the document. Further we will apply the algorithm in two application scenarios where German Facebook posts, in particular the fashion trade chain Peek&Cloppenburg and the Austrian railway operators OeBB and Westbahn will be analyzed

Vernetzte Daten und Strukturen erfahren ein wachsendes Interesse und verdrängen bewährte Methoden der Datenhaltung in den Hintergrund. Einen neuen Ansatz für die Herausforderungen, die das Management von ausgeprägten und stark vernetzten Datenmengen mit sich bringen, liefern Graphdatenbanken. In der vorliegenden Masterarbeit wird die Leistungsfähigkeit von Graphdatenbanken gegenüber der etablierten relationalen Datenbank evaluiert. Die Ermittlung der Leistungsfähigkeit erfolgt durch Benchmarktests hinsichtlich der Verarbeitung von hochgradig vernetzten Daten, unter der Berücksichtigung eines umgesetzten feingranularen Berechtigungskonzepts. Im Rahmen der theoretischen Ausarbeitung wird zuerst auf die Grundlagen von Datenbanken und der Graphentheorie eingegangen. Diese liefern die Basis für die Bewertung des Funktionsumfangs und der Funktionalität der zur Evaluierung ausgewählten Graphdatenbanken. Die beschriebenen Berechtigungskonzepte liefern einen Überblick unterschiedlicher Zugriffskonzepte sowie die Umsetzung von Zugriffskontrollen in den Graphdatenbanken. Anhand der gewonnenen Informationen wird ein Java-Framework umgesetzt, welches es ermöglicht, die Graphdatenbanken, als auch die relationale Datenbank unter der Berücksichtigung des umgesetzten feingranularen Berechtigungskonzepts zu testen. Durch die Ausführung von geeigneten Testläufen kann die Leistungsfähigkeit in Bezug auf Schreib- und Lesevorgänge ermittelt werden. Benchmarktests für den schreibenden Zugriff erfolgen für Datenbestände unterschiedlicher Größe. Einzelne definierte Suchanfragen für die unterschiedlichen Größen an Daten erlauben die Ermittlung der Leseperformance. Es hat sich gezeigt, dass die relationale Datenbank beim Schreiben der Daten besser skaliert als die Graphdatenbanken. Das Erzeugen von Knoten und Kanten ist in Graphdatenbanken aufwendiger, als die Erzeugung eines neuen Tabelleneintrags in der relationalen Datenbank. Die Bewertung der Suchanfragen unter der Berücksichtigung des umgesetzten Zugriffkonzepts hat gezeigt, dass Graphdatenbanken bei ausgeprägten und stark vernetzten Datenmengen bedeutend besser skalieren als die relationale Datenbank. Je ausgeprägter der Vernetzungsgrad der Daten, desto mehr wird die JOIN-Problematik der relationalen Datenbank verdeutlicht.

The rising distribution of compact devices with numerous sensors in the last decade has led to an increasing popularity of tracking fitness and health data and storing those data sets in apps and cloud environments for further evaluation. However, this massive collection of data is becoming more and more interesting for companies to reduce costs and increase productivity. All this possibly leads to problematic impacts on people’s privacy in the future. Hence, the main research question of this bachelor’s thesis is: “To what extent are people aware of the processing and pro- tection of their personal health data concerning the utilisation of various health tracking solutions?” This thesis investigates the historical development of personal fitness and health tracking, gives an overview of current options for users and presents potential problems and possible solutions regarding the use of health track- ing technology. Furthermore, it outlines the societal impact and legal issues. The results of a conducted online survey concerning the distribution and usage of health tracking solutions as well as the participants’ views on privacy concerning data sharing with service and insurance providers, ad- vertisers and employers are presented. Given those results, the necessity and importance of data protection according to the fierce opposition of the participants to various data sharing scenarios is expressed.

Es wird eine mobile Anwendung entwickelt, die Musikstudierende dabei unterstützt reflexiv ein Instrument zu lernen. Der Anwender soll in der Lage sein seinen Übungserfolg über Selbstbeobachtung festzustellen, um in weiterer Folge Übungsstrategien zu finden, die die Übungspraxis optimieren soll. Kurzfristig stellt die Anwendung dem Benutzer für verschiedene Handlungsphasen einer Übungseinheit (preaktional, aktional und postaktional) Benutzeroberflächen zur Verfügung. Mit Hilfe von Leitfragen, oder vom Anwender formulierten Fragen, wird das Üben organisiert, strukturiert bzw. selbstreflektiert und evaluiert. Im Optimalfall kann der Anwender seinen Lernprozess auch auf Basis von Tonaufnahmen mitverfolgen. Langfristig können alle Benutzereingaben wieder abgerufen werden. Diese werden journalartig dargestellt und können zur Selbstreflexion oder auch gemeinsam mit einer Lehrperson ausgewertet werden.

Information validation is the process of determining whether a certain piece of information is true or false. Existing research in this area focuses on specific domains, but neglects cross-domain relations. This work will attempt to fill this gap and examine how various domains deal with the validation of information, providing a big picture across multiple domains. Therefore, we study how research areas, application domains and their definition of related terms in the field of information validation are related to each other, and show that there is no uniform use of the key terms. In addition we give an overview of existing fact finding approaches, with a focus on the data sets used for evaluation. We show that even baseline methods already achieve very good results, and that more sophisticated methods often improve the results only when they are tailored to specific data sets. Finally, we present the first step towards a new dynamic approach for information validation, which will generate a data set for existing fact finding methods on the fly by utilizing web search engines and information extraction tools. We show that with some limitations, it is possible to use existing fact finding methods to validate facts without a preexisting data set. We generate four different data sets with this approach, and use them to compare seven existing fact finding methods to each other. We discover that the performance of the fact validation process is strongly dependent on the type of fact that has to be validated as well as on the quality of the used information extraction tool

The buzzword big data is ubiquitous and has much impact on our everyday live and many businesses. Since the outset of the financial market, it is the aim to find some explanatory factors which contribute to the development of stock prices, therefore big data is a chance to do so. Gathering a vast amount of data concerning the financial market, filtering and analysing it, is of course tightly tied to predicting future stock prices. A lot of work has already been done with noticeable outcomes in this field of research. However, the question was raised, whether it is possible to build a tool with a large quantity of companies and news indexed and a natural language processing tool suitable for everyday applications. The sentiment analysis tool that was utilised in the development of this implementation is sensium.io. To achieve this goal two main modules were built. The first is responsible for constructing a filtered company index and for gathering detailed information about them, for example news, balance sheet figures and stock prices. The second is accountable for preprocessing the collected data and analysing them. This includes filtering unwanted news, translating them, calculating the text polarity and predicting the price development based on these facts. Utilising all these modules, the optimal period for buying and selling shares was found to be three days. This means buying some shares on the day of the news publication and selling them three days later. Pursuant to this analysis expected return is 0.07 percent a day, which might not seem much, however this would result in an annualised performance of 30.18 percent. This idea can also be outlaid in the contrary direction, telling the user when to sell his shares. Which could help an investor to find the ideal time to sell his company shares.

The in-depth analysis of time series has been a central topic of research in the last years. Many of the present methods for finding periodic patterns and features require the use to input the time series’ season length. Today, there exist a few algorithms for automated season length approximation, yet many of them rely on simplifications such as data discretization. This thesis aims to develop an algorithm for season length detection that is more reliable than existing methods. The process developed in this thesis estimates a time series’ season length by interpolating, filtering and detrending the data and then analyzing the distances between zeros in the directly corresponding autocorrelation function. This method was tested against the only comparable open source algorithm and outperformed it by passing 94 out of 125 tests, while the existing algorithm only passed 62 tests. The results do not necessarily suggest a superiority of the new autocorrelation based method, but rather a supremacy of the new implementation. Further related studies might assess and compare the value of the theoretical concept.

This thesis aims to shed light on the early classification of time series problem, by deriving the trade-off between classification accuracy and time series length for a number of different time series types and classification algorithms. Previous research on early classification of time series focused on keeping classification accuracy of reduced time series roughly at the level of the complete ones. Furthermore, that research work does not employ cutting-edge approaches like Deep Learning. This work fills that research gap by computing trade-off curves on classification ”earlyness” vs. accuracy and by empirically comparing algorithm performance in that context, with a focus on the comparison of Deep Learning with classical approaches. Such early classification trade-off curves are calculated for univariate and multivariate time series and the following algorithms: 1-Nearest Neighbor search with both the Euclidean and Frobenius distance, 1-Nearest Neighbor search with forecasts from ARIMA and linear models, and Deep Learning. The results obtained indicate that early classification is feasible in all types of time series considered. The derived tradeoff curves all share the common trait of slowly decreasing at first, and featuring sharp drops as time series lengths become exceedingly short. Results showed Deep Learning models were able to maintain higher classification accuracies for larger time series length reductions than other algorithms. However, their long run-times, coupled with complexity in parameter configuration, implies that faster, albeit less accurate, baseline algorithms like 1-Nearest Neighbor search may still be a sensible choice on a case-by-case basis. This thesis draws its motivation from areas like predictive maintenance, where the early classification of multivariate time series data may boost performance of early warning systems, for example in manufacturing processes.