Efficient cost documentation plays a role, in the construction industry when it comes to project management and financial control. We have developed a Cost Documentation Tool specifically tailored to address the challenges faced by this sector based on Austrian standards. The construction industry is known for its complexity involving stakeholders, lengthy project timelines and intricate finan- cial landscapes. Our objective was to create a solution that streamlines cost doc- umentation practices ensuring accuracy, transparency and efficiency throughout the entire project life cycle. To tackle this issue head on we have designed a tool that seamlessly integrates with existing tendering, awarding and billing systems. This integration allows for data synchronization, comprehensive cost documentation, centralized project management and customized reporting. The implementation of our Cost Docu- mentation Tool has resulted in improvements in cost control, data accuracy and decision making capabilities within the construction industry. With real time ac- cess to data and customizable reports at their fingertips project stakeholders can now make decisions based on data insights while effectively identifying cost trends and anomalies. By addressing the challenges of cost documentation with our tools capabilities we have unlocked a new era of efficiency and transparency, in con- struction cost management. Project professionals now have the power to optimize costs effectively while ensuring compliance and fostering collaboration regardless of project size or complexity. This progress has the ability to completely transform the construction sector empowering it to handle projects in a confident manner in the times ahead.

Abstract In this bachelor thesis, a possible correlation between the sentiment bias in lan- guage models and the interaction of a human user with a chatbot, based on differ- ent language models, was evaluated. Touvron et al. (Touvron et al., 2023) stated in their paper that the bias in model generation is a possible result of training the language model with certain biased data. When interacting with the chatbot the user will be greeted with a picture of a certain object and is asked to tell the chat- bot what they see. This study examined if there was a possible correlation between the interaction of the user with a chatbot and the possibility of sentiment bias. It is based on the prototype theory, which states that if there is a certain category of a term, humans tend to use one specific noun more often than others. If you ask a person to name a fruit, the answer will probably be an apple, rather than a pineapple. This paper aims to observe if the language of the user changes to- wards the language that is used by the language model when asked for a prompt in regard to a certain prototype, which is the picture given to the user. During this bachelor’s thesis, we found out that there were some matches between the prototypes of our test subjects and the prototypes of the large language models. Additionally, we observed that Google Bard’s1answers contained a gender bias and were exceptionally long, filled with information that the users found unnecessary while OpenAI’s ChatGPT2 gave shorter, unbiased answers. With the help of this work, researchers can observe a correlation between the bias in human language and the bias of large language models. Furthermore, a relation between the proto- types humans use to describe a specific object or person and the prototype a large language model uses can be detected.

This thesis is about developing an algorithm to encode black and white patterns in super positions of two and four-qubit systems on a quantum computer provided by IBM Quantum lab using the open-source framework Qiskit (Qiskit, 2023). To accomplish this, the pattern is numbered from the top left to the right bottom. Then the black and white pixels are assigned to the values 0 and 1 respectively, resulting in a binary string representing the original pattern. Afterwards, the algorithm calculates the quantum gates required to manipulate the relative phases of a superposition to store the information of the binary string. Furthermore, for the two and four-qubit systems, a simple neuron was implemented, capable of being trained on a chosen pattern. After the training step is complete, the neuron can correctly identify a new, to this point unseen pattern, which differs from the originally chosen pattern by at most a given number of pixels. For the two-qubit system, this given number is always zero, so only the exact trained pattern passes. The four-qubit system on the other hand can be set to allow a difference of up to 8 pixels, depending on how strict the neuron should be. The nature article An artificial neuron implemented on an actual quantum processor by Francesco Tacchino, Chiara Macchiavello, Dario Gerace and Daniele Bajoni (nature.com, 2019a) is the main reference of this thesis. It uses the same principal of storing information and training the neuron, yet the way in which the required quantum gates are calculated is different and was developed independently for this thesis. As a result of this thesis, a functioning patter recognition system with an effectively exponential advantage in memory storage, with the capability of recognizing one out of 32768 patterns can be reported. The major downside however is a possibly unfeasible scalability beyond a small number of qubits and therefore different patterns. Thus, this must be taken as more of a proof of concept, with the need of further investigation, than a real-world application

Software development requires applicants to have knowledge of involved syntax and semantic which can be seen as a barrier for beginners, but it may also challenge experienced programmers. Significantly, while the fine granularity of textual pro- gramming languages allows the expression of very specific aspects, it also comes with the cost of increased complexity. Visual programming languages introduce an interactive, more graphical approach towards a more abstract development experience which may be considered as rapid, more beginner friendly and less error prone. However, implementations of this visual concept, due to neglecting some de- tails, also loose some degree of expressiveness. Therefore, this work proposes zeus as a general-purpose visual programming environment that combines text-based and visual programming techniques to benefit from abstraction methods without compromising on expressiveness. Accordingly, a prototype was implemented to evaluate the concept by developing applications consisting out of a graphical user interfaces and program logic with it. The applications were then exported to a specific target platform to validate the retrieved code. The experiments indicated that textual and visual software development methodologies can complement each other to construct applications serving general purposes in a way beneficial for beginners as well as experienced programmers.

With the rise of e-learning and modern campus management systems in academia, pursuing multiple degree programmes in parallel seems to have become increasingly accessible in recent years. From a student’s perspective, managing programmes offered at multiple independent academic institutions can be challenging, however. Especially the collision-free scheduling of courses can be a demanding task, since information needs to be acquired and compared manually, possibly even from various different campus management systems. In an effort to solve this problem in the context of a bachelor’s thesis, a software solution intended to assist students in interinstitutional course scheduling was created. A Web platform capable of combining course-related data from a variable number of academic institutions was developed, deployed and released to the general public. Furthermore, an academic institution was integrated into the platform, allowing for testing the platform with real-world data. A literature search for identifying concepts of data science as well as scheduling algorithms possibly applicable to the new platform was conducted. A list of scientific literature proposing tangible approaches related to course scheduling was compiled. Also, an algorithm suitable for maximising the number of non-colliding courses was identified. Based on the present thesis, more specific research as well as the development of tangible software implementations of recommender systems and scheduling algorithms as part of the new platform are now possible.

Wikipedia is well known for being a source of information for almost every topic. New events that happened are usually added within minutes. �e markup behind Wikipedia is quite complex and therefore not every user is able to write an article on his own. �e aim of WikGen is to automate the process of writing an article such that the user only has to touch the surface of the markup language. For solving the problem natural language processing in combination with additional python libraries was used. A graphical user interface was created. �e interface allows users to set certain parameters as well as to set the topic of the article. Additionally the user has to choose between two content generation libraries. With one of those very good results were achieved. To start the article generation the ’Run’ bu�on needs to be pressed. �e �nal article is published again and the page is being reloaded such that the user is able to see the �nal article. An article was created and every section contained meaningful information about its topic. Every user is now able to generate an Wikipedia article from scratch using only an infobox and not knowing anything further about the markup behind

Since long ago when trading companies formed, there was always a need to be competitive, to gain any advantage possible, to get that deal and secure a resource. Today, huge companies have taken a foothold in an international market, competing not only with other large corporations, but also with small local enterprises. This ubiquitous pressure to be better intensifies the need for any advantage an enterprise, big or small, can get in this cutthroat age of globalization. There are many technologies available for companies to achieve a bleeding edge over the others and the Dynamic Import Module (DIM) presented in this document is one such tool. In an age where trading and processing data is an everyday affair, many people are required to carry out mundane and mind numbing tasks such as manually typing files into databases or creating spreadsheets to crunch some numbers. The DIM is a lightweight software module for Extracting, Transforming and Loading (ETL) data from files into SQL databases. There are many different ETL and ELT applications available for a myriad of different programming languages and frameworks, but the DIM is a Grails 2.2 specific plugin and for this particular framework, the first of its kind. As of now, the DIM is able to extract data from DSV - which includes CSV, TSV and whitespace delimited files - PRN, simple XML, SpreadsheetML, simple JSON and JSON-like files like YAML. Further, it is possible to con- figure the DIM so that certain lines are skipped, that data is verified that it is of a certain data type and data can even be specified to be optional or required. As for transforming the data, DIM provides a suite of methods from simple string manipulations like appending and replacing substrings, to numerical operations like calculating a sum/mean and carrying out basic arithmetic operations. Not only that, the DIM is also able extract selected data and insert it into other selected data sets that are to be loaded into the database. Another provided feature is creating one-to-one and one-to-many relations between data during the transformation process, which are then persisted into the database. Last but not least, all of the transformation methods can be applied universally for all data sets with exclusions if need be, or only selectively for special data sets. In any case, the extracted and possibly transformed data can then be loaded into the database in any representation the user needs it to be, as long as the equivalent domain classes exist within the application. The DIM aims to speed up and automate the tedious import process, freeing up time for people to spend their time on meaningful work and giving their company a lead in other areas which require human creativity

Historic documents contain valuable information about our past, providing in- sights into the cultures, societies, and events that have shaped our world. Digiti- zation of large quantities of such documents is crucial not only for analyzing them but also for making them more accessible. However, extracting textual information from these sources is challenging due to factors such as poor image quality, non- standard layouts, and varying fonts. Using a deep learning convolutional neural network, this thesis aims to improve the accuracy of optical character recognition (OCR) in historical schematism-state manuals. This approach involves segmenting document pages into individual elements, and then applying OCR to each element individually rather than to the entire document. To further enhance accuracy, the OCR program Tesseract is fine-tuned on a custom font designed to look as sim- ilar as possible to the original font used in the schematism-state documents. In comparison to applying OCR to the entire document, the methods proposed in this thesis lead to a significant improvement in character extraction accuracy of 71.98%. These results help to better extract and analyse the wealth of information contained in historic documents. Apart from having a significantly reduced error rate when extracting texts from schematism-state documents, it is also possible to attach specific labels to them. In this way, texts can be categorized even before they are processed using natural language processing (NLP)

The increasing prevalence of malicious links, such as phishing attempts and malware down- loads, pose significant threats to users on a daily basis. Determining the intent and safety level of such links has become a crucial challenge in cybersecurity. This thesis addressed this problem by implementing a recursive web crawler which is backed by a random forest machine learning classifier to accurately classify found URLs into various classes of intent. By analysing the dataset and attributes of different URLs, 30 lexical features were extracted and used to train the classification model. Furthermore, the work focused on optimizing the model with hyperparameter tuning and testing different train-test approaches to attain consistent accuracy. The classifier achieved an accuracy rate of 96.2% in classifying unknown URLs into their respective categories of intent. However, it is important to note that short URLs or long subdomains tend to result in a higher misclassification rate than others. The result of this thesis is an application which gives users the ability to check potentially malicious URLs in order to be safer from hidden threads.

In the present work, data from COVID-19 patients are analyzed within the three waves. The collected data is important for the ongoing monitoring of the development of the virus and is still regularly accessed and used in the hospitals. In order to be able to further analyze the pandemic and its ongoing consequences and to support research, data and their evaluations will not lose importance in the future. A dashboard is created with the help of the SAP Lumira Designer and filtered differently. The work considers and compares between the individual COVID-19 waves the patient occupancy, patients treated in intensive care, risk factors of the patients, days of occupancy, duration of care, deaths and the number of patients to be ventilated. Significant differences could be determined in the study, particularly with the data on the vaccination status. The work shows how diverse COVID-19 patient data are.

With the rise of Industry 4.0, huge amounts of collected data must be analyzed. Currently, few applications provide easy-to-use causal discovery tools to visualize data sets. In this work a causal discovery tool, using algorithms by the Causal Discovery Toolbox (CDT), was implemented in an existing web application for causal inference to counter this problem. The generated graphs can then be used for further causal inference processing. Five different skeleton recovery and ten different causal discovery algorithms of the CDT have been implemented. In addition, a new feature was added to the application to store the data sets and results. The results of the implemented algorithm differ from the corresponding ground truth. The generated graphs are sometimes unreliable, so the application does not replace the user’s thinking

Reproducibility in Machine Learning Research

With the rise of e-learning and modern campus management systems in academia, pursuing multiple degree programmes in parallel seems to have become increasingly accessible in recent years. From a student’s perspective, managing programmes offered at multiple independent academic institutions can be challenging, however. Especially the collision-free scheduling of courses can be a demanding task, since information needs to be acquired and compared manually, possibly even from various different campus management systems. In an effort to solve this problem in the context of a bachelor’s thesis, a software solution intended to assist students in interinstitutional course scheduling was created. A Web platform capable of combining course-related data from a variable number of academic institutions was developed, deployed and released to the general public. Furthermore, an academic institution was integrated into the platform, allowing for testing the platform with real-world data. A literature search for identifying concepts of data science as well as scheduling algorithms possibly applicable to the new platform was conducted. A list of scientific literature proposing tangible approaches related to course scheduling was compiled. Also, an algorithm suitable for maximising the number of non-colliding courses was identified. Based on the present thesis, more specific research as well as the development of tangible software implementations of recommender systems and scheduling algorithms as part of the new platform are now possible

In today’s world, data protection is becoming more and more important, be it because of the services that are always available or offline applications that accompany us in everyday life. All these services and services collect data from us, in order not to give the protection of this out of our hands, many regulations have already been issued, but these do not cover every area. In order to get a more detailed insight into this area of data protection, an attempt is first made to build up a basic understanding, which should prepare the reader for the following chapters. The various facets of the online environment are attempted to be divided into different domains and get categorized. These domains are explained and individual ones are considered in more detail. The categorization is intended to be used to obtain and show important data relevant to privacy from the domains. Specifically selected examples from the domains will then be applied and evaluated with programs designed for data protection, and their functionality will be roughly explained. Through this practical use, an attempt is made to show what opportunities these programs bring to the various domains and what challenges need to be overcome if you want to use them

Durch sogenannte FakeShops bzw. problematische Online-Shops drohen für Konsument*innen diverse Gefahren. Um diese Gefahren zu minimieren soll mit dieser Arbeit die automatische Erkennung solcher Shops verbessert werden

The aim of the thesis is to find and compare the best methods to solve a 1v1 resource allocation game. In the game, two players try to gather as many resources as possible in order to construct more buildings than the other player. The game is called Lux AI and everyone can participate in the challenge hosted on the public data science platform Kaggle. In the scope of this thesis, three different methods from different fields of AI were tested, a rule-based approach, a supervised machine learning approach and a reinforcement learning approach. The thesis first introduces the different methods tested. Later on, the experiments on how to get to the best results within each method are examined. Finally, the obstacles and possible improvements for each approach are discussed. The results of the thesis show that a rule-based approach can be a well perform- ing baseline, but it is complex to introduce and balance rules for every situation. Therefore, the rule-based approach is recommended when a quick and stable agent is required rather than a high performing agent. The supervised machine learn- ing approach turned out to work best. The reinforcement learning approach did hardly work, but it was shown that with more resources it might outperform the supervised machine learning approach.

Researching political trends (political topics, political figures etc.) is important for two reasons. Firstly, to enable political leaders to make right decisions for organisational transformations and policies; Secondly, to understand impact and influence of political leaders and political organisations as trend representatives or as part of research for election campaigns. This work concentrates on making the software solution to assess political trends from online media in the real-time and provide insights in form of historic quantit- ative data of distinct trends. Created software solution is based on the data gathered in the real-time from online news papers in Bosnia and Herzegovina and analyzed with combination of existing solutions for gathering the data, processing it in distributed manner and tools for natural language processing. Software solution delivers results in the real-time in form of the time-series data for specific political trends by defining related keywords and measuring its presence in media. Resulting time-series data and intermediate results are ready for further analysis and can be used to give more detailed insights.

Neben Geldtransfer stellt die zeitnahe Bewegung von Waren innerhalb der E-Commerce-Prozesse den wichtigsten kritischen Aspekt dar. Um dies mithilfe der richtigen Lageraggregation und Kommissionierung möglichst schnell und einfach zu gestalten, bieten sich mehrere Technologien, wie beispielsweise künstliche Intelligenz an. Innerhalb der Arbeit soll erörtert werden, wie und mittels welcher Technologie diese Prozesse verbessert werden können

Calibration in recommender system

Detecting sarcasm has proven to be bene€cial in several sub€elds of natural language processing. It improves the results of sentiment analysis, is a valuable preprocessing step in information extraction, can be helpful in generating natural language components and more. While much research has been done in the English language, only few systems have been proposed to detect sarcasm in German texts. To contribute to the research of detecting sarcasm in German texts, the characteristics of expressing sarcasm were examined, a working de€nition derived,and a system to automatically detect sarcasm on sentence-level was implemented. W‘e analyzed sentences originate from political speeches, which were taken from the Austrian National Council and compiled to a corpus. ‘ese sentences were manually annotated for sarcasm and used to generate a multitude of features. By using decision trees, the importances of the generated features were learned and in respect to their importance, combinations of various feature sets tested. ‘Wecarried out experiments showed that sarcasm is detectable on sentence-level and that sentence length and part-of-speech tags are among the most important features to detect it. ‘e proposed sarcasm detector provides a solid baseline and valuable insights for future work that focus on automatically detecting sarcasm in German texts, and in particular, German political text

The purpose of this work is to study and improve the usability of a single-page web application. A single-page web application is a recent method for programming web pages in which part of the page content is updated based on user action without updating all the content if not necessary. These new programming method helps to speed up the web pages a lot, making them interactive and their content visible in a shorter time, and for this reason, more and more web pages are programmed in this way. In this work, the runtime performance and usability of the CourtCulture single-page web application will be tested with different methods and solutions are proposed to make the application closer to the user needs and more conform to the design and usability standards. A new version of the application will also be proposed with a new design and new functionalities that address the problems highlighted during the test. The study on the effectiveness of the improvements of the application is only theoretically analyzed.

Cross-platform analysis of user comments

Eine neue Gesetzgebung in Österreich erlaubt den Flugbetrieb ohne die Anwesenheit eines Flugplatz-Betriebsleiters oder einer Flugplatz-Betriebsleiterin am Flugplatz. Es besteht jedoch die gesetzliche Verpflichtung zur Aufzeichnung aller Starts und Landungen. Es soll ein elektronisches System entwickelt werden, welches Starts und Landungen detektiert und diese Informationen in eine Datenbank speichert. Dieses System soll Flugplatzbetreiber*innen bei der Erfüllung der gesetzlichen Aufzeichnungspflicht unterstützen. Es wurde ein Algorithmus erarbeitet und implementiert, der eine gute Basis für die Automatisierung dieser Aufgabe bietet. Der Algorithmus erreichte bei Tests mit insgesamt 76 Flugbewegungen eine korrekte Klassifizierung von 96% bzw. 100%

The focus of this work lies in causal science. Causal science engages in finding causes of all sorts of events occurring in the world. The cause is usually called treatment and the subject of its influence is called outcome. Causal science employs different methods for finding and numerically presenting the intensity of the treatment’s causal influence on the outcome. The process of finding that intensity of causal influence is called causal inference. This work presents the development of an application which was created for the purpose of performing causal inference via a visual interface. The application has coupled existing causality libraries with a user interface. It provided a simple, practical solution for causal inference in the form of a web application. The results have shown good accuracy, given different types of data files and inference options. It is now possible for a common user to easily perform causal inference via the web browser. It is also possible to extend and develop the application further to provide a better practical tool for causal science.

In this paper, we attempt to classify functional and non-functional requirements using Bidirectional Encoder Representations from Transformers (BERT). In this paper, we discuss the concepts and the implementation of our classifiers, as well as the achieved results and how well our approach handles real-world noise such as spelling mistakes in detail. We created two different requirement classifiers, one a binary functional/non-functional (FR/NFR) classifier and the other for classifying non-functional requirements of the four most frequent classes of Operational (O), Performance (PE), Security (SE) and Usability (US). For this, we fine-tuned a pre-trained BERT language representation model for our specific task. Using this approach, our FR/NFR binary classifier achieved an average precision of 95.1 percent, an average recall of 92.6 percent, and an F1-score of 93.8 percent. The NFR classifier achieved an average precision of 90 percent and an average recall of 88.7 percent. Our approach and results enable automating the classification of software requirements in a straightforward and efficient way.

Verschiedene Forscher- und Entwicklerteams beschäftigen sich mit der Frage wie man Maschinen die menschliche Sprache beibringen kann. Um Maschinen sprechen zu lernen muss man sie mit großen Mengen an Daten füttern, diese Daten müssen für die Computer aufbereitet und mit Metadaten erweitert werden. Das hinzufügen solcher Anmerkungen nennt man Annotation. Die vorliegende Bachelorarbeit gibt einen Überblick über Annotation und den Vergleich von manuellen und semi-automatischen Ansätzen. Der erste Teil der Arbeit befasst sich mit dem theoretischen Hintergrund von Annotation und ver- schiedener Anwendungsgebiete. Einige Begriffe wie Natural Language Processing, Human Language Technologies und Distant Supervision werden erläutert und der Zusammenhang mit Annotation erklärt. Im Hintergrund-Kapitel werden auch verschiedene Tools und Techniken zur Annotation vorgestellt und ihre Anwen- dungsgebiete erläutert. Im praktischen Teil der Arbeit wird ein Webtool mit dem Dokumente annotieren kann vorgestellt. Das Ziel der Evaluation ist herauszufinden welche der zwei Me- thoden des Webtools besser für Annotation geeignet ist. Es werden zwei Ansätze gegenüber gestellt nämlich eine manuelle und eine semi-automatische. Methode 2 stellt Suchbegriffe in ihrem Kontext dar, diese Begriffe können per Mausklick zur Annotation hinzugefügt oder verworfen werden. Um die Forschungsfrage zu beantworten annotieren sieben Personen zwei wissenschaftliche Dokumente ein- mal mit der manuellen Methode, danach mit der semi-automatischen Methode. Außerdem muss jede Person einen Fragebogen zur Selbsteinschätzung ausfüllen. Das Ergebnis der Evaluation hat gezeigt, dass die manuelle Methode länger dauert dafür aber ein genaueres Ergebnis liefert. Die zweite Methode ist schneller dafür aber ungenauer. Zu beobachten war das der Großteil der Probanden sich wesentlich schlechter einschätzte als ihr tatsächliches Können.

Text simplification as a field is steadily increasing, but a lot of the problems are still far from being solved. The application which will be described in this thesis, aims to help with the one of the problems of text simplification, which is gathering enough data to train the models for automatic text simplification. This applica- tion should provide the framework for easier visualization, effortless editing and exporting of the annotated data, from articles written in German. The process of development and important design decisions were noted in the thesis, together with functional requirements and use cases. Main focuses of the application, during the development, was performance, easiness of use for novice users and correctness of the outputted data. To test these quality attributes and to evaluate if functional requirements are fulfilled, user testing was conducted. The results showed that all of the set requirements were fulfilled and users highly praised the performance and usability of the system, with some minor remarks as to how the system can be further improved.

When developing enterprise software applications, the goal is to expand the products life-span which is often done by developing a state of the art base to ensure the future expandability, integration of third-party applications and maintenance. Many companies started developing their tailored software in the early 2000s, but over the last 20 years, requirements for software products in terms of SLA, Testing and Fault Tolerance, to name a few, increased drastically as more and more issues came to light. As a result of this, a new architectural style emerged: Microservices. Microservices try to reduce technical challenges when developing applications and achieve a more structured way to develop software solutions on the organisation level. Whereas organisational structures can be changed relatively easily, the mi- gration of an existing legacy system towards microservices requires a lot of human work. Algorithmic techniques exist, but they heavily relay on static code analysis and ignore crucial runtime behaviours of the system. This thesis tackles that problem by presenting an algorithmic way to extract microservice candidates in a refactoring scenario entirely based on run-time data of the system. For this, a large amount of runtime-data was acquired and modelled as a graph. To represent the runtime dynamics of the system, a set of weight functions were defined. The extraction of the microservice candidates was realized by applying graph-based clustering algorithms on the graph representing the system. In addition to this, a web-based user-interface was developed to provide architectural insights before and after the extraction process. To assert and test the correctness of the developed approach the author entered a cooperation with the Raiffeisen Information Service, which tested and rated the output of the extraction process. Besides this, the correctness was verified via custom microservice-specific metrics. The results show that the described approach works very well for its structural simplicity and can be used to analyze the current state of the system and automate the extraction process arbitrary well.

A problem that came up during the last twenty-five years is the re-finding of emails. Many different groups of people have thousands of emails in their inboxes, which often causes frustration during the search for older emails. This fact is reason enough to think about new solutions for this issue. Is the continually managing of your emails with folders and labels the best answer? Or is it more efficient to use a memory-based attempt? In this thesis, we planned and implemented a search tool for Mozilla Thun- derbird to test if it is reasonable to use the human’s associative memory for re-finding. The first step was to investigate which different things, besides the conventional text and name, people potentially remember to an email. The decision fell on the separation into three additional searching features. They focus on the email partner’s primary data, on side facts to the date, and the option to search for a second email, which the user possibly associates with the wanted email. To check if the tool is applicable, we evaluated it with several test persons by giving them tasks to complete in a test email environment. The results showed a positive attitude toward these new searching ways. Especially the date-related features were rated very high. These results lead to the motivation of potentially starting further research on the topic. By discovering that dates tend to be remembered quite well, we can improve the tool in this direction before starting a large-scale evaluation with real email data.

This present work elaborates on how a browser’s bookmark functionality, a common tool to aid revisitation of web pages, can be improved concerning performance and user experience. After identifying and investigating issues arising with state-of-the- art approaches, solutions to that issues were elaborated and a browser extension for the Google Chrome browser was implemented based on the gathered insight. A special focus was put on developing novel functions that allow for incorporating temporal relations between bookmarks of a given bookmark collection as well as a feature that supports searching for bookmarked web pages by colour. Ten participants completed an evaluation of the implemented browser extension in order to investigate its performance and usability. The study showed that users familiarise quickly with the proposed novel functions and rated their ease of use and helpfulness positively. However, though the suggested functions were commented positively on by participants and showed advantages over traditional full-text search for special cases where some (temporal) context is required, full-text search extended by widespread functions like autocomplete suffice for most of the basic use cases.

Test case prioritization is a common approach to improve the rate of fault detection. In this scenario, we only have access to very limited data in terms of quantity and quality. The development of an useable method in such a limited environment was the focus of this thesis. For this purpose, we made use of log output and requirement information to create a cluster-based prioritization method. For evaluation, we applied the method to regressions of a device currently in development. The results indicate no impactful improvement, based on the simple and limited metrics used. To show the importance of fault knowledge, we generated a simplified dataset and applied the same prioritization method. With the now existing awareness of faults we were able to evaluate the method using a well established fault-based metric. The results of the generated dataset indicate a great improvement in the rate of fault detection. Despite the restrictions of this limited environment the implemented method is a solid foundation for future exploration.

Anomaly detection on sequential time series data is a research topic of great relev- ance with a long standing history of publications. In the context of time series data, anomalies are subsequences of data that differ from the general pattern. Frequently, these specific areas represent the most interesting regions in the data, as they often correspond to the influence of external factors. Problems which conventional anomaly detection frameworks face are the limita- tion to highly domain specific applications and the requirement for pre-processing steps in order to function as intended. Through the use of the Recurrence Plot, the algorithm proposed in this thesis, initially seeks to capture the pattern of recurrence found in sequential time series data. An ensuing step for vector quantization by Growing Neural Gas ensures more efficient computation of collective anomalies. Furthermore, the usual preprocessing steps for noise removal are bypassed by the topology preservation aspects the Growing Neural Gas provides. Recurrence Plot construction is done according to a sliding window approach. The results indicate that both the noise removal by Growing Neural Gas and the pattern preservation by the Recurrence Plot, lead to highly accurate results, with the proposed Anomaly Detector finding all anomalies in a real world data set of Austria’s Power Consumption in the year 2017. Having demonstrated the applicability and potential of combining the Growing Neural Gas with the Recurrence Plot, it seems likely that these concepts could also be adapted to detect further anomalies such as contextual ones.

Wikipedia is the biggest online encyclopedia and it is continually growing. As its complexity increases, the task of assigning the appropriate categories to articles becomes more difficult for authors. In this work we used machine learning to auto- matically classify Wikipedia articles from specific categories. The classification was done using a variation of text and metadata features, including the revision history of the articles. The backbone of our classification model was a BERT model that was modified to be combined with metadata. We conducted two binary classification experiments and in each experiment compared various feature combinations. In the first experiment we used articles from the category ”Emerging technologies” and ”Biotechnology”, where the best feature combination achieved an F1 score of 91.02%. For the second experiment the ”Biotechnology” articles are exchanged with random Wikipedia articles. Here the best feature combination achieved an F1 score of 97.81%. Our work demonstrates that language models in combination with metadata pose a promising option for document classification.

Bei Großveranstaltungen entsteht ein sehr hoher Managementaufwand um die Sicherheit für alle Besucher gewährleisten zu können. Es sind nämlich nicht nur private Sicherheitskräfte im Einsatz, sondern oft auch Polizei, Rettung, oder auch die Feuerwehr. Aus diesem Grund ist es sehr wichtig, dass alle beteiligten Organisationen effizient und ohne organisatorische Pro- bleme zusammenarbeiten können. Bei Notfällen, kann es durch schlechten Informationsaustausch schnell zu einer kritischen Situation kommen. Um dieses Problem zu beheben, wurde eine Managementlösung entwickelt mit der es möglich ist den Informationsaustausch zu optimieren, so dass alle Beteiligten schnellen und einfachen Zugriff auf alle Informationen haben. Das System besteht aus einer Web-Anwendung für die Einsatzzentrale, so wie einer Android-Applikation für alle mobilen Einheiten. Da es mit diesem Managementsystem nun möglich ist, dass alle Organisationen, das selbe System verwenden, können die Informationen direkt an alle zuständigen Organisationen ohne Umwege gesendet werden. Durch die Verwendung einer eigenen Android-Applikation verfügen außerdem auch alle mobilen Einsatzkräfte über die notwendigen Informationen und nicht mehr nur die Einsatzzentrale. Somit können durch den optimierten Informations- austausch zwischen allen beteiligten Organisationen, kritische Situation effizient und ohne organisatorische Problem gelöst werden. Dieses Projekt ist zwar lediglich ein Prototyp, aber es zeigt bereits sehr gut, was alles möglich ist, und wie es eingesetzt werden kann.

This present work elaborates on how a browser’s bookmark functionality, a common tool to aid revisitation of web pages, can be improved concerning performance and user experience. After identifying and investigating issues arising with state-of-the- art approaches, solutions to that issues were elaborated and a browser extension for the Google Chrome browser was implemented based on the gathered insight. A special focus was put on developing novel functions that allow for incorporating temporal relations between bookmarks of a given bookmark collection as well as a feature that supports searching for bookmarked web pages by colour. Ten participants completed an evaluation of the implemented browser extension in order to investigate its performance and usability. The study showed that users familiarise quickly with the proposed novel functions and rated their ease of use and helpfulness positively. However, though the suggested functions were commented positively on by participants and showed advantages over traditional full-text search for special cases where some (temporal) context is required, full-text search extended by widespread functions like autocomplete suffice for most of the basic use cases.

In order to provide accurate statistics and information on how much work was published by institutes and researchers, Graz University of Technology uses a com- mercial research management system called PURE. The university would like to have all work which was published by its institutes and researchers registered to this system. However, registering older publications to this system is a daunting task be- cause missing meta-information has to be entered manually. The project behind this thesis was to develop an application which makes the import of meta-information provided by other research portals into this system easier. This problem had to be tackled by the development of smart algorithms to infer missing meta-information, and an user-interface which supports the definition of default values for informa- tion where no inference is possible. Those tasks involved working with public and private API’s, parsing and generating large XML-files and the implementation of an architecture which supports multiple different sources for meta-information on publications. The development of this application was successful and the generation of XML for a bulk import of meta-information from another research portal called DBLP is now possible. The application is easily extensible in respect to the addition of other research portals and provides versatile settings to adjust the generation of import-XML more specifically. Users with administrative access to the PURE server of the university can now select publications from supported research portals and generate large XML-files for a bulk import of meta-information. Only a long- term field test of this application will show whether or not the problem has been completely solved by this work.

In automatised warehouses often unwanted situations, which are called problems, occur. In this bachelor’s thesis, a system component which col- lects information about these problems and offers solutions to overcome these was developed. This component was integrated into an existing ware- house management system. Out of ten common problematic scenarios, 26 requirements which define functional and non-functional attributes of the desired system component have been worked out. From process details like recognition of problems, the definition of problems and their solutions and handling of these by users are covered in this thesis. Then, a chosen set of demands was implemented in a proof-of-concept solution. Additionally, the introduced scenarios were implemented in a demonstration warehouse. In the provided framework, the implemented scenarios can be observed and handled by users. Handling problems is more than 68 per cent faster using this framework. Even though adding new problems to handle is not simple and the calculations made are very time-consuming, this thesis offers a big first step from a user-guided system to a system-guided user.

Data virutalization is an emergent technology for implementing data-driven business intelligence solutions. With new technologies come new challenges, the complex security and data models within business data applications require sophisticated methods for efficient, scalable and accurate information retrieval via full text search. The challenge we faced was to find a solution for all required steps from bringing data into an index of a search engine to data retrieval afterwards, without enabling the users to bypass the security policy of the company and thus preserve confidentiality. We researched state-of-the-art solutions for similar problems and elaborated different concepts for security enforcement. We also implemented a prototype as a proof-of-work, provided suggestions for follow-up implementations and guidelines on how the faced problems may be solved. Finally, we discussed our proposed solution and examined the drawbacks and benefits arising from our chosen way. We figured out, that a Late Binding approach for access control within the index delivers a fully generic, zero-stale solution that, as we show in the evaluation, is sufficient for a small set of documents with high average visibility density. However, to facilitate scalability, our proposed solution incorporates both, early binding as pre-filtering as well as late binding for post-filtering.

The Portable Document Format, also called PDF, plays an important role in industry, academics and personal life. The purpose of this file format is to exchange documents in a platform independent manner. The PDF standard includes a standardized way to add annotations to a document, enabling users to highlight text, add notes and add images. However, those annotations are meant be added manually in a PDF reader application, resulting in tedious manual work for large documents. The aim of this bachelor thesis was to create an application that enabled users to annotate PDF documents in a semi-automatic way. First, users could add annotations manually. Then, the application provided functionality to repeat the annotation automatically based on certain rules. For instance, annotations could be repeated on all, even or odd pages. Additionally, annotations can be repeated based on font and font size. The application was built using modern web technologies, such as HTML5 DOM elements, front-end web frameworks, REST APIs and Node.js. The system compon- ent responsible for automatic annotation repetition was implemented as a separate service, resulting in a small-scale microservice architecture. Evaluation showed that the application fulfills all use cases that were specified be- forehand. However, it also showed that there were some major problems regarding usability and discoverability. Furthermore, performance tests showed that in some browsers, memory consumption can be an issue when handling large documents.

As monolithic applications are becoming rarer a new problem occurs how these smaller applications are communicating with each other it becomes especially significant when looking into the topic of reporting which usually requires data from multiple sources together. We introduce Kafka as a distributed messaging system into our environment as a means of inter-service communication. Additionally, two ways of storing data are provided. MySQL for structured data and MongoDB for unstructured data. The system is then evaluated in several categories. It will be tested in terms of resiliency, performance tests with a high number of messages and an increasing size of individual messages. The blockages of this system will be assessed if this system is useful for reporting data to customers. The experiments indicate that this system circumvents many problems in a monolithic infrastructure. Nevertheless, it creates a performance bottleneck when storing data received from Kakfa. Storing structured data turned out to be way more problematic than unstructured data by a magnitude. Despite this, we have been using a distributed messaging setup in production for some years now and are also using this for reports with structured data. Storing unstructured data in this new setup has not made it to production yet which we are currently working on.

The advances in data science provide us with a vast array of tools to analyse and better understand our environment. Of special interest to us is the topic of sequential pattern mining, in which statistic patterns are found within sequences of discrete data. In this work, we review some of the major techniques currently offered by the pattern mining field. We also develop a proof of concept tool for frequent itemset mining in Tinkerforge sensor data, showing how the application of the FP-Growth algorithm to Tinkerforge sensor data can provide valuable observations and offer an inexpensive yet powerful setting for further knowledge discovery processes. Lastly, we discuss some of the possible future lines of development of the presented problem.

Fake News and misinformation are widely discussed topics in our modern information society. A multitude approaches have been taken to filter out false information, ranging from manual research to artificial intelligence. Most of these projects, however, focus on the English language. To fill this gap, we introduce Crowd Fact Finder, a fact-checking tool for German language text, which uses Google search results alongside Open Information Extraction to distinguish fact from fake. We use a wisdom-of-the-crowd approach, deciding that what is popular opinion must be the truth. Crowd Fact Checker is based on the idea that true statements, as a search engine query, will produce more results related to the query than untrue statements. Crowd Fact Checker was evaluated in different categories, achieving an accuracy of 0.633 overall, and 0.7 when categorizing news. The informative value of wisdom-of-the-crowd depends strongly on the popularity of the discussed topic than its validity.

Since the new regulations of 2016, nearly all businesses in Austria are required to manage their invoices digitally and hand out digitally signed receipts. Existing solutions are mostly aimed at bigger companies or lack in usability and performance. In this paper, we describe a modern platform independent application to manage invoices, customers and room bookings. This was implemented using state of the art techniques to create a web application built on the Grails framework. Aimed at being deployed as system as a service, the application makes use of a hybrid multi tenancy database concept which allows many customers on a single server without compromising data security. Due to its responsive design, the application can be used on devices of nearly all screen sizes with little compromises. The system is nearly production ready and is already used in a productive environment by one customer. By fully integrating the invoice component with the hotel component, our application achieves great performance when billing hotel rooms. As soon as the system is fully production ready, it will offer small and medium sized enterprises a modern and affordable solution for digitally managing their invoices and room bookings in full compliance with the law.

The goal of this thesis was to test if a raspberry pi cluster is suitable for big data analysis. The frameworks Hadoop and Spark were used. For clarification, if the raspberry pi cluster is a good choice for big data analysis, the same calculations were tested on a reference laptop. The tested test programs were programed in Java for Hadoop and in Scala for Spark. The files were stored on Hadoops distributed file system. The test programs tried to address strengths and weaknesses of the frameworks and ranged from simple data analysis to the random forest machine learning algorithm. At last, the resource usages of the frameworks and the distributed file system were monitored. The raspberry pi cluster was faster with the test programs for Spark, if they worked on the cluster, because many of Sparks features were not usable on the cluster. Map Reduce worked fine on the cluster, but the reference laptop clearly outperformed the cluster for this test programs. The test programs for Spark were except in one case faster than the test programs for Map Reduce.

Since the new regulations of 2016, nearly all businesses in Austria are required to manage their invoices digitally and hand out digitally signed receipts. Existing solutions are mostly aimed at bigger companies or lack in usability and performance. In this paper, we describe a modern platform independent application to manage invoices, customers and room bookings. This was implemented using state of the art techniques to create a web application built on the Grails framework. Aimed at being deployed as system as a service, the application makes use of a hybrid multi tenancy database concept which allows many customers on a single server without compromising data security. Due to its responsive design, the application can be used on devices of nearly all screen sizes with little compromises. The system is nearly production ready and is already used in a productive environment by one customer. By fully integrating the invoice component with the hotel component, our application achieves great performance when billing hotel rooms. As soon as the system is fully production ready, it will offer small and medium sized enterprises a modern and affordable solution for digitally managing their invoices and room bookings in full compliance with the law.

Authorship identification techniques are used to determine whether a document or text was written by a specific author or not. This includes discovering the rightful author from a finite list of authors for a previously unseen text or to verify if a text was written by a specific author. As digital media continues to get more important every day these techniques need to be also applied to shorter texts like emails, newsgroup posts, social media entries, forum posts and other forms of text. Especially because of the anonymity of the Internet this has become an important task. The existing Vote/Veto framework evaluated in this thesis is a system for authorship identification. The evaluation covers experiments to find reasonable settings for the framework and of course all tests to determine the accuracy and runtime of it. The same tests for accuracy and runtime have been carried out by a number of inbuilt classifiers of the existing software Weka to compare the results. All results have been written to tables and were compared to each other. In terms of accuracy Vote/Veto mostly delivered better results than Weka’s inbuilt classifiers even though the runtime was longer and more memory was necessary. Some settings provided good accuracy results with reasonable runtimes.

In recent years, the variety of car insurance models rose increasingly. Including the range of GPS supported contracts that observe the driving behavior of the insured, assisted by GPS locators, and transfer them to the insurance company. By analyzing the data, the insurance companies try to create a profile of the policyholder and to adjust the insurance fee to the respective driving behavior such as speeding, breaking, turn speeds and much more. However, this calculation assumes that people who spend more time in cars are automatically more vulnerable to accidents and small damages. They assume that there is a direct correlation between time spent in the car and the risk of an accident. Here, however, it was forgotten that experience plays a very important role. The more time you spend driving, the more experience you have gained with hazards or problem situations. The handling of the vehicle itself is best learned by experience and thus reduces the chance of parking damage or similar. The aim of the thesis is to verify or disproof the current approach of insurance companies. To this end, several methods are used to combine multiple perspectives on the topic as possible. In addition to a survey, data is automatically collected by means of web scraping and also manually by means of several random sampling tests. After evaluating the data quality, the results obtained are summarized and evaluated. In addition to statistical evaluations in PSPP, the focus is also on logical or obvious relationships. Finally, all aspects are merged and the underlying assumption was mostly refuted as studies showed that people driving regularly also have the highest percentage of accidents. But this group of drivers also shows the most stable and predictable values while people driving irregularly show much bigger irregularities. Most surveillants stood up against permanent monitoring of driving habits including all types of test groups. During the data collection of the thesisit had to be stated that web scapping of RSS Feeds provides very little usable data.

In this thesis I present a novel object graph mapper for Neo4j written in modern statically typed JavaScript. The aim of this library, namely neo4-js, is to reduce the code size while still preserving readability, maintainability and code quality when writing backend applications and communicating with a Neo4j database in JavaScript. Readability is a key factor for maintainable code. Hence neo4-js provides a declarative and natural way of defining a data scheme. Better code quality is reached by supporting the developer with good error messages and providing a well tested library. Furthermore, neo4-js fully supports Flow type definitions to be able to find type errors without running the code itself, resulting in better code quality. Neo4-js is specifically targeted for backend JavaScript applications running on Node.js. With the basic neo4-js library it is possible to reduce the code size by up to 1200%. Additionally, I will discuss an effective way of test driven development for database libraries written in JavaScript with a Docker container. Finally, we will have a look at a new way of expressing a schema definition with a new schema definition language and its own compiler to reduce the code size more.

People spend hours on social media and similar web platforms each day. They express a lot of their feelings and desires in the texts which they post online. Data analysts always try to find clever ways to get use of this information. The aim of this thesis is to first detect business intent in the different types of information users post on the internet. In a second step, the identified business intent is grouped into the two classes: buyers and sellers. This supports the idea of linking the two groups. Machine learning algorithms are used for classification. All the necessary data, which is needed to train the classifiers is retrieved and preprocessed using a Python tool which was developed. The data was taken from the web platforms Twitter and HolidayCheck. Results show that classification works accurately when focusing on a specific platform and domain. On Twitter 96 % of test data is classified correctly whereas on HolidayCheck the degree of accuracy reaches 67 %. When con- sidering cross-platform multiclass classification, the scores drop to 50 %. Although individual scores increase up to 95 % when performing binary classification, the findings suggest that features need to be improved fur- ther in order to achieve acceptable accuracy for cross-platform multiclass classification. The challenge for future work is to fully link buyers and sellers automatically. This would create business opportunities without the need of parties to know about each other beforehand.

While design patterns are proposed as a standard way to achieve good software design little research is done on the actual impact of using these strategies on the code quality. Many books suggest that such methods increase flexibility and maintainability however they often lack any evi- dence. This bachelor thesis intends to empirically demonstrate that the use of design patterns actually improves code quality. To gather data about the code two applications were implemented, that are designed to meet the same requirements. While one application is developed following widespread guidelines and principles proposed by the object oriented programming, the other is implemented without paying attention to the topics of software maintenance. After complying to the basic requirements a number of additional features were implemented in two phases. At first a new graphical user interface is being supported, then a different data tier is added. The results show that the initial effort of implementing the program version following object oriented programming guidelines are noticeably higher in terms of code lines and necessary files. However, during the implementation of additional features fewer files needed to be modified and during one phase transition considerably less code was needed to be written while not performing worse in the other and furthermore the cyclomatic complexity of the code increased less rapid.

Product development starts with the product requirements. If these are defined, solutions are created for the individual components, which then correspond to the entire product requirements. The process of solution approaches and solution refinement is operated in many iterations until a corresponding quality of the product requirements is achieved. This entire ”knowledge process “is to be transferred into a knowledge management. This is why we are showing ways to make new information technologies of Web 2.0 usable for knowledge management in the automotive industry. It is based on a research project of the Virtual Vehicle Competence Center, which includes a software prototype (”information cockpit “). ”The information cockpit “links both the product requirements and development tasks with the project organization. Thus a Product Data Management (PDM) as well as a Requirement Management System (RQM) is mapped. The networking has succeeded in uniting the individual systems, which represents a novelty in this area. By networking the product data, request data and project organization, the user is able to obtain a quick overview of different data in the automotive development. As a result, the management as well as the design is able to use existing knowledge quickly and to provide newly generated knowledge for others in an unconventional manner. At present only the visualization is implemented. The data to be used are made available by ”Link-Nodes “from the data system. The goal is to transfer the demonstrator to the application ”information cockpit “. The ontology PROTARES (PROject TAsks RESources) is used here as a basis. This ontology includes the entire data schema. A semanitc representation-based transfer (REST) Ful Web Service was designed and implemented accordingly. The data storage layer is a triple-store database. ”The information cockpit “can be used to query the system, which graphically and structurally displays the information to the user. Through the use of these technologies it was possible to create a modular whole system for the system architecture. In the near future, data management can be tackled, not just visualization, but also changing the data. After that, you can still think about user administration, access control, and so on.

Mobile apps become more and more important for companies, because apps are needed to sell or operate their products. For being able to serve a wide range of customers, apps must be available for the most common platforms, at least Android and iOS. Considering Windows Phones as well, a company would need to provide three identical apps - one for each platform. As each platform comes with their own tools for app development, the apps must be implemented separately. That means development costs may raise by a factor of three in worst case. The Qt framework promises multi platform ability. This means an app needs to be implemented just once but still runs on several platforms. This bachelor’s thesis shall prove that by developing such a multi platform app using the Qt framework. The app shall be able to collect data from sensors connected to the mobile device and store the retrieved data on the phone. For the proof the supported platforms are limited to the most common ones - Android and iOS. Using this app for recording data from a real life scenario demonstrates its proper functioning.

This thesis deals with the creation of regular expressions from a list of input that should match the resulting expression. Since regular expressions match a pattern, they can be used to speed up work that includes large amounts of data, under the assumption that the user knows some examples of the pattern that should be matched. In the herein discussed program, a regular expression was created iteratively by working away from a very rudimentary regular expression, allowing for an adjustment of a threshold to mitigate the effect of not having any negative matches as input. The result is an easy creation of a sufficiently well-working regular expression, assuming a representative collection of input strings while requiring no negative examples from the user.

Due to persistent issues concerning sensitive information, when working with big data, we present a new approach of generating articial data1in the form of datasets. For this purpose, we specify the term dataset to represent a UNIX directory structure, consisting of various les and folders. Especially in computer science, there exists a distinct need for data. Mostly, this data already exists, but contains sensitive information. Thus, such critical data is supposed to stay protected against third parties. Hence, this reservation of data leads to a lack of available data for open source developers as well as for researchers. Therefore, we discovered a way to produce replicated datasets, given an origin dataset as input. Such replicated datasets represent the origin dataset as accurate as possible, without leaking any sensitive information. Thus, we introduce the Dataset Anonymization and Replication Tool, short DART, a Python based framework, which allows the replication of datasets. Since we aim to encourage the data science community to participate in our work, we constructed DART as a framework with high degree of adaptability and extensibility. We started with the analysis of datasets and various le and MIME types to nd suitable properties which characterize datasets. Thus, we dened a broad range of properties, respectively characteristics, initiating with the number of les, to the point of le specic characteristics like permissions. In the next step, we explored several mathematical and statistical approaches to replicate the selected characteristics. Therefore, we chose to model characteristics using relative frequency distributions, respectively unigrams, discrete as well as continuous random variables. Finally, we started to produce replicated datasets and analyzed the replicated characteristics against the characteristics of the corresponding origin dataset. Thus, the comparison between origin and replicated datasets is exclusively based on the selected characteristics. The achieved results highly depend on the origin dataset as well as on the characteristics of interest. Thus, origin datasets, which indicate a simple structure, tend more likely to deliver utilizable results. Otherwise, large and complex origin datasets might struggle to be replicated succiently. Nevertheless, the results aspire that tools like DART will be utilized to provide articial data1for persistent use cases.

This paper is about comparing variables and feature selection with greedy and non greedy algorithms. For the greedy solution the ID3 [J. Quinlan, 1986] algorithm is used in this paper, which serves as a baseline. This algorithm is fast and provides good results for smaller datasets. However if the dataset gets larger and the information, which we want to get out of it has to be more precise, several combinations should be checked. Therefore a non greedy solution is a possible way to achieve that goal. This way of getting information out of data tries every possibility/combination to get the optimal results. This results may contain combinations of variables. One variable on its own possibly provides no information about the dataset, but in combination with another variable it does. That is one reason, why it is useful to check every combination. Besides the precision, which is very good, the algorithm needs higher computational time, at least W(n!). The higher the amount of attributes in a dataset is the higher the computational complexity is. The results have shown, even for smaller datasets that the non greedy algorithm finds more precise results, especially in view of combination of several attributes/variables. Taken together, if the dataset needs to be analysed in a more precise way and the hardware allows it, then the non greedy version of the algorithm is a tool, which provides precise data especially at combinational point of view.

Social media monitoring has become an important means for business analytics and trend detection, comparing companies with each other or keeping a healthy customer relationship. While English sentiment analysis is very closely researched, not much work has been done on German data analysis. In this work we will (i) annotate ~700 posts from 15 corporate Facebook pages, (ii) evaluate existing approaches capable of processing German data against the annotated data set and (iii) due to the insufficient results train a two-step hierarchical classifier capable of predicting posts with an accuracy of 70%. The first binary classifier decides whether the post is opinionated. If the outcome is not neutral, the second classifier predicts the polarity of the document. Further we will apply the algorithm in two application scenarios where German Facebook posts, in particular the fashion trade chain Peek&Cloppenburg and the Austrian railway operators OeBB and Westbahn will be analyzed

The rising distribution of compact devices with numerous sensors in the last decade has led to an increasing popularity of tracking fitness and health data and storing those data sets in apps and cloud environments for further evaluation. However, this massive collection of data is becoming more and more interesting for companies to reduce costs and increase productivity. All this possibly leads to problematic impacts on people’s privacy in the future. Hence, the main research question of this bachelor’s thesis is: “To what extent are people aware of the processing and pro- tection of their personal health data concerning the utilisation of various health tracking solutions?” This thesis investigates the historical development of personal fitness and health tracking, gives an overview of current options for users and presents potential problems and possible solutions regarding the use of health track- ing technology. Furthermore, it outlines the societal impact and legal issues. The results of a conducted online survey concerning the distribution and usage of health tracking solutions as well as the participants’ views on privacy concerning data sharing with service and insurance providers, ad- vertisers and employers are presented. Given those results, the necessity and importance of data protection according to the fierce opposition of the participants to various data sharing scenarios is expressed.

Es wird eine mobile Anwendung entwickelt, die Musikstudierende dabei unterstützt reflexiv ein Instrument zu lernen. Der Anwender soll in der Lage sein seinen Übungserfolg über Selbstbeobachtung festzustellen, um in weiterer Folge Übungsstrategien zu finden, die die Übungspraxis optimieren soll. Kurzfristig stellt die Anwendung dem Benutzer für verschiedene Handlungsphasen einer Übungseinheit (preaktional, aktional und postaktional) Benutzeroberflächen zur Verfügung. Mit Hilfe von Leitfragen, oder vom Anwender formulierten Fragen, wird das Üben organisiert, strukturiert bzw. selbstreflektiert und evaluiert. Im Optimalfall kann der Anwender seinen Lernprozess auch auf Basis von Tonaufnahmen mitverfolgen. Langfristig können alle Benutzereingaben wieder abgerufen werden. Diese werden journalartig dargestellt und können zur Selbstreflexion oder auch gemeinsam mit einer Lehrperson ausgewertet werden.

The buzzword big data is ubiquitous and has much impact on our everyday live and many businesses. Since the outset of the financial market, it is the aim to find some explanatory factors which contribute to the development of stock prices, therefore big data is a chance to do so. Gathering a vast amount of data concerning the financial market, filtering and analysing it, is of course tightly tied to predicting future stock prices. A lot of work has already been done with noticeable outcomes in this field of research. However, the question was raised, whether it is possible to build a tool with a large quantity of companies and news indexed and a natural language processing tool suitable for everyday applications. The sentiment analysis tool that was utilised in the development of this implementation is sensium.io. To achieve this goal two main modules were built. The first is responsible for constructing a filtered company index and for gathering detailed information about them, for example news, balance sheet figures and stock prices. The second is accountable for preprocessing the collected data and analysing them. This includes filtering unwanted news, translating them, calculating the text polarity and predicting the price development based on these facts. Utilising all these modules, the optimal period for buying and selling shares was found to be three days. This means buying some shares on the day of the news publication and selling them three days later. Pursuant to this analysis expected return is 0.07 percent a day, which might not seem much, however this would result in an annualised performance of 30.18 percent. This idea can also be outlaid in the contrary direction, telling the user when to sell his shares. Which could help an investor to find the ideal time to sell his company shares.

The in-depth analysis of time series has been a central topic of research in the last years. Many of the present methods for finding periodic patterns and features require the use to input the time series’ season length. Today, there exist a few algorithms for automated season length approximation, yet many of them rely on simplifications such as data discretization. This thesis aims to develop an algorithm for season length detection that is more reliable than existing methods. The process developed in this thesis estimates a time series’ season length by interpolating, filtering and detrending the data and then analyzing the distances between zeros in the directly corresponding autocorrelation function. This method was tested against the only comparable open source algorithm and outperformed it by passing 94 out of 125 tests, while the existing algorithm only passed 62 tests. The results do not necessarily suggest a superiority of the new autocorrelation based method, but rather a supremacy of the new implementation. Further related studies might assess and compare the value of the theoretical concept.

A mobile application is developed which supports a doctor in the treatment of stroke patents in the phase when they are already at home. Specifically it supports the anaysis of the activity level and type of a stroke patient after he is send home. Quantitative measures for activity level and activity type are calculated using mobile Phone Sensors like Accelerometer, Gyroscope and Compass. The activity levels diversified are standing still, using a wheel chair, using a walking frame, walking on your own, or other.The gathered information then can be shown to a doctor giving him a clear, quantifyable view of the developement of the activity level of the patient....

die herkömmlichen Datenbanklösungen wie RDBMS wurden zu der Zeit entworfen, in der das heutige Wachstum der Daten nicht vorstellbar war. Während dieses Wachstum besonders in den letzten Jahren geschah, versuchten Unternehmen ihre Datenbanklösungen der neuen Anforderung anzupassen. Die Tatsache ist aber, dass die klassischen Datenbanksysteme wie RDBMS für Skalierung nicht geeignet sind. Neue Technologien mussten geschaffen werden, um mit diesem Problem leichte umgehen zu können und das ist genau das Thema dieser Arbeit. Die neuen Technologien, die zum Bearbeiten von Big Data entworfen sing gehören meistens zu der Hauptkategorie NoSQL. Diese Arbeit diskutiert die Herausforderungen vom Umgang mit großen Datenmengen und versucht, eine Grenze klarzustellen, mit der z.B. eine Firma wissen kann, ob sie für ihre Anwendungen eine NoSQL-Technologie braucht oder würde auch ein RDBMS reichen. Diese Arbeit diskutiert auch das geeignete Datenmodel das für verschiede NoSQL Technologien. Am Ende der Arbeit gibt es einen praktischen Teil, wo drei Kandidaten  von verschiedenen NoSQL-Kategorien gegeneinander evaluiert werden. 

Eine Vielzahl von Softwareherstellern haben sich mit der Unternehmenssuche beschäftigt, undunterschiedliche Enterprise Search Lösungen mit breitem Funktionsspektrum präsentiert. Um dieEnterprise Search Lösungen schnell und effizient untereinander vergleichen zu können, wurdedie Systemarchitektur der Suchlösungen modelliert und mittels Fundamental Modeling Concepts(FMC) dargestellt. Dies bietet die Möglichkeit sich einen Überblick über die einzelnen Lösungenzu verschaffen, ohne sich mit unzähligen Informationen in Datenblättern und Whitepapers herum-zuschlagen. Das Portfolio der zu vergleichenden Enterprise Search Lösungen erstreckt sich von denMarktführern wie Microsoft und Google, dem marktführendem Unternehmen für Suchtechnologieim Raum Deutschland, Österreich und der Schweiz - IntraFind - bis hin zu den Visionären wieCoevo, Sinequa und Dassault Systems.Aus den durch den Vergleich gewonnenen Informationen wurde der Microsoft SharePoint 2013 fürdie prototypische Umsetzung in einem Systemlabor ausgewählt. Entscheidender Grund dafür wardie Kosten/Nutzen-Frage. Microsoft ist einer der wenigen Anbieter die eine kostenlose Versionfür eine Einstiegs- bzw. Pilotlösung zur Verfügung stellen. Die Enterprise Search Lösung wurdeauf einer virtuellen Maschine installiert, und vor der vollständigen Ausrollung am Virtual VehicleResearch Center von zehn Mitarbeitern aus zwei verschiedenen Arbeitsbereichen (Informationsma-nagement und Engineering-Bereich) auf Nützlichkeit und Qualität der Suchergebnisse getestet. Esgibt kaum Studien wie Suchlösungen im Engineeringbereich eingesetzt werden bzw. wie Engineersmit solchen Suchlösungen umgehen und wie zufrieden sie eigentlich damit sind. Diese Tatsacheführte dazu, dass zur Evaluierung der Pilot-Suchlösung eine Kombination aus Thinking AloudTest und Interview eingesetzt wurde. Mittels Interview wurden Informationen zu den Probandengesammelt, aus welchen geeignete Suchtasks für die Testpersonen abgeleitet wurden, welcheim Rahmen des Thinking Aloud Tests von dem jeweiligen Probanden gelöst werden mussten.Anschließend wurde die Testperson zu Qualität der Suchergebnisse, Sucherlebnis und Nützlichkeitder Unternehmenssuche befragt.Es hat sich gezeigt, dass die Mitarbeiter Schwierigkeiten haben, die geeigneten Keywörter für dieSuche zu definieren. Je mehr sie jedoch über die gesuchte Information Bescheid wussten, destoleichter fiel es den Probanden passende Keywörter zu definieren. Kritisch wurde auch die Relevanzder Suchergebnisse bewertet. Die Probanden waren der Meinung, dass sie beim Suchen dergewünschten Informationen mittels Suchinterface mehr Zeit beanspruchen, als bei ihrer derzeitigenSuchmethodik. Es hat sich herausgestellt, dass Metadaten für die Suche von großer Bedeutung sind.Sie enthalten wichtige Informationen, welche das Suchen von Informationen wesentlich erleichtert.Die Probanden müssen ihren Informationsbedarf auch ohne die Unternehmenssuche decken, daherwurde im Rahmen der Evaluierung die derzeitige Suchstrategie der Probanden behandelt.Basierend auf den Aussagen der Probanden konnten aus der Evaluierung Anforderungen an Enter-prise Search Lösungen abgeleitet und Informationen gesammelt werden. Diese Anforderungenund Informationen liefern für die IT-Abteilung wichtiges Feedback, welches bei der Ausrollungdes Pilotprojektes unterstützen soll. 

Table recognition is an important task in information retrieval anddocument analysis. Most scientic work today is available in the formof PDF documents, and tables within those documents often containvaluable information. Various approaches for table recognition ex-ist, common to all is the need for ground-truthed datasets, to trainalgorithms or to evaluate the results.Herein is presented a web-application for annotating elements andregions in PDF documents, in particular tables. The collected datais intended to serve as a ground truth useful to machine learning al-gorithms for detecting table regions and table structure, as well asto determine the quality and relevance of various table detection ap-proaches. The software system allows for previous attempts of auto-matic table detection to be imported, examined and further renedand corrected, thus providing a framework for visualizing results of ta-ble recognition. A survey is conducted, showing that the usage of thetool is convenient, compared to three other ways of creating groundtruth of tables in documents. The quality of the ground truth is as-sessed by comparison to other datasets and human evaluation. Thesoftware system is available under the terms of the Apache 2.0 License.

Diese Arbeit befasst sich mit der Erweiterung einer bestehenden Webapplikation (Headstart (Kraker et al., 2013)) um eine Visualisierung für Zeitreihen (Time Series Data). Visualisierungen ermöglichen es, komplexe Sachverhalte in besser verständliche Formen zu bringen und diese einfacher zu interpretieren. Eine besonders interessante Art von Informationen sind Zeitreihen. Diese häufig auftretende Form von Daten bietet sich dazu an, um Trends und Muster zu erkennen und Aussagen über zukünftige Entwicklungen zu machen.Als Proof of Concept wird eine Visualisierung entwickelt, welche es den Anwendern von Headstart ermöglicht, Trends und Entwicklungen in Forschungsgebieten auszumachen. Um die Erweiterung in dem bestehenden Projekt zu bewerkstelligen, muss dieses erst um einen Statusverwaltungsmechanismus bereichert werden. Dessen Implementation bildet den einen Teil dieser Arbeit, während sich der zweite Teil der neuen Visualisierung widmet.Der Einbau der Statusverwaltung wurde mit Hilfe von vorgestellten Metriken aufgezeichnet und führte zu einer klaren Verbesserung des Projektes. Somit sind zukünftige Erweiterungen mit deutlich weniger Aufwand verbunden. Die Visualisierung für Zeitreihen durch Small Multiples bleibt der ursprünglichen Oberfläche treu und ermöglicht den Benutzern, einfach Vergleiche zwischen Forschungsgebieten anzustellen.

This work presents the design and scientific background of a web-based information extraction system using the open-source GATE-library 1 (General Architecture for text engineering) [Ham12]. The application provides an extendable architecture for server-based high-level information extraction. Textual resources are annotated using a set of tools according to both, their semantic and grammatical representation. These annotations are then presented to the user via a web-interface.

Das Ziel dieser Arbeit ist das Extrahieren, Evaluieren und Speichern von Informationen aus tweets, wodurch eine Schnittstelle zwischen dem World Wide Web und dem semantischen Web modelliert wird. Unter der Verwendung des microblogging- und sozialen Netzwerkdienstes Twitter, wird ein Datensatz von tweets generiert. Dieser wird auf so genannte, von uns definierten, facts untersucht. Diese Filterung wird mit Hilfe von regular expressions (regex) durchgeführt. Die so gefundenen facts werden mit spezifischer Metainformation versehen und in einer Datenbank abgespeichert. Dies ermöglicht Maschinen die Daten intelligent zu durchsuchen und logische Verknüpfungen zwischen den Daten herzustellen. Durch die Verwendung der Programmiersprache Java ist die Applikation systemunabhängig. Die Arbeit liefert einfache Verständnisserklärungen zum semantischen Web, regex und Twitter, welche für die Applikation notwendig sind. Weiters werden das Konzept, verwendete Methoden, auftretende Probleme und gewonnene Resultate diskutiert.

Ein MediaWiki ist eine Social Web Applikation, welche es einer Gruppe von Personen einfach ermöglicht Informationen kollaborativ zusammenzutragen, Text zu erstellen und aktuell zu halten. Die wichtigsten Funktionen eines MediaWikis sind, das Erstellen und Bearbeiten von Artikeln, das Verlinken der Artikeln um eine Navigation zwischen den Artikeln zu ermöglichen und das Zusammenfassen der Artikeln in Kategorien ([Barrett, 2009]). Unter gewissen Umständen ist es notwendig die Qualität eines Artikels in einem MediaWiki einzuschätzen bzw. die Qualität eines Artikels zu steigern. Nach [Wang & Strong, 1996] hat schlechte Qualität von Daten einen erheblichen sozialen und wirtschaftlichen Einfluss. Im Rahmen dieser Arbeit wurde wissenschaftliche Literatur, die sich mit der Qualität von Artikeln und Daten beschäftigt, analysiert und zusammengefasst. Aus dem Ergebnis dieser Analyse und in Zusammenarbeit mit einem Unternehmen wurden Features formuliert, die dazu führen, dass die Qualität von MediaWiki Artikeln eingeschätzt werden kann und des Weiteren den Benutzern eines Mediawikis dabei unterstützen die Qualität von Artikeln zu steigern. Nachdem Features gefunden wurden, wurde ein Prototyp von einer Toolbar mit diesen Features in Adobe Flex entwickelt, die als Erweiterung in ein MediaWiki eingebunden werden kann.

In this thesis we examine the automatic generation of training data in order to train a machine learning algorithm. We will use a rule-based approach to generate the training data which is build using the GATE Natural Language Processing framework. The machine learning algorithm is using a statistical model, the maximum entropy model (MEM) in our case, to do the information extraction task. We will introduce an architecture and an application for the automatic generation of training data. In order to test our approach we will introduce and adapt evaluation metrics. The implications of using automatic test data on the structure of the result will be elaborated. We will partition the error in different regions and see its impact. We see that under certain circumstances the statistical model outperforms the rule-based extraction algorithm, that was used to train that model.