The background of this work emphasizes the significance of data privacy in safeguarding individual rights amid the growing misuse of personal data, under- scoring its role in preserving democratic principles and personal freedoms. This problem has been present for centuries, but with the evolution of technology, its effect increased significantly and has become frequent in many industries, including health care. Even though the Health Insurance Portability and Accountability Act (HIPAA) and the General Data Protection Regulation (GDPR) regulate sensitive data protection, the healthcare industry deals with thousands of data breach incidents reported daily. Therefore, we decided to explore the repercussions of confidentiality breaches in healthcare and answer a pivotal question: Is automatic detection of cases where HIPAA anonymization is not sufficient for GDPR compliance in EHRs achievable? This research question is crucial for protecting sensitive information in medical tourism programs and the clinical services provision across inter- national borders, and to address it, we divided the practical work into three phases. First, our objective was the clinical dataset acquisition, data preprocessing, annotation, and Named Entity Recognition (NER) to identify specific Protected Health Information (PHI) elements of interest belonging to the scope of the work (PATIENT, PHONE, LOCATION, HOSPITAL, ID, DATE, DOCTOR, AGE, NORP, DISEASE, and CHEMICAL). Second, we developed a customized approach combining different anonymization techniques to anonymize the data according to HIPAA and GDPR and reduce the risk of re-identification. Ultimately, we investigated if it is possible to construct a pipeline capable of detecting HIPAA but not GDPR-compliant records under the assumption we previously identified and anonymized all sensitive data. As a result of the first phase, we fine-tuned one unified BERT model, namely emilyalsentzer/Bio ClinicalBERT, capable of identifying 11 PHI entity types of interest (DISEASE, CHEMICAL, PATIENT, DOCTOR, LOCATION, HOSPITAL, PHONE, AGE, ID, DATE, and NORP). After comparing the total number of annotations generated by the model (6,387) and the total number of annotations we manually validated (6,618), the model resulted in an overall accuracy of 96.5%. Moreover, we checked how many entities the model misclassified per PHI type and cautiously estimated our model’s general accuracy to be around 95%. With this assessment in mind and assuming the correctness and reliability of the extracted data of interest, we developed a customized approach to anonymizing PHI of interest, combining tokenization, encryption, and pseudonymization to meet HIPAA and GDPR requirements. Our evaluation of the categorical entity anonymization process has shown that our approach preserves data patterns effectively and meets strict privacy requirements while providing a robust solution for anonymizing 6,518 PHI and ensuring regulatory compliance and data integrity. Conclusively, we recognized the intricate nature of achieving simultaneous HIPAA and GDPR compliance in EHR anonymization since, while identifying records that fall short of compliance in terms of extracted entities or anonymiza- tion techniques is possible, a comprehensive analysis of GDPR compliance remains a multifaceted endeavor and requires expertise knowledge and efforts.

This master’s thesis investigates the nature of causality within AI language models operating in the field of natural language processing (NLP). The core research question centres around comprehending the ability of large language models to infer causal relationships. To delve deeper into this central question, a comprehensive literature review was initiated tracing the roots of the causal relationships, exploring the importance of the ability of the large language models (LLMs) to differ them from correlation relationships, up to the point of discerning the limitations of LLMs in inferring them. This resulted in the framing and exploring four key research inquiries, where four fundamental Open-AI models and two Google models were evaluated through prompting various tasks with different scenarios such as inferring causal relationships from texts containing missing information, analysing texts that express the same causal relationship using different phrasing or vocabulary, examining texts that may possess biases related to gender, race, or other demographic variables, and “what if” questions to rate their counterfactual abilities: 1.2.3.4.How well can the large language models infer causality from texts with missing information? Findings indicate that the best-performing models from Open-AI and Google can infer causality from such texts with a 64% success rate. How does the performance of the large language models vary with different phrasing or vocabulary expressing the same causal relationship? The models can identify different phrasings or vocabularies expressing the same causal relationship with a 78% success rate. Can the language models accurately identify and categorize potential biases in their inference of causality related to gender, race, or other demographic variables? The top-performing models can accurately identify and categorize biases at a rate of 50%. What are the limitations of Large Language Models in capturing counterfactual reasoning? The top two models from each company achieve an efficiency of 85% in capturing counterfactual reasoning. The gained insights of this study are destined to contribute to the creation of more accurate and efficient natural language processing applications, as well as to the ethical use of AI language models. This study intent to a greater comprehension of the nature of causality in AI language models and its relevance to natural language processing

Semiconductor companies invest significant resources in R&D to bring competitive products to the market. As market demand changes, the ability to innovate quickly and cost-effectively is a key factor for competitive advantage. Therefore, the goal of this thesis is the creation of a content-based recommender system that suggests IP-Reuse at the beginning of a new project to reduce develop- ment costs. To address this topic, different methods and models such as multilayer perceptrons and sentence transformers have been utilized to create word embeddings for the products, followed by the computation of similarity matrices and the genera- tion of recommendations. Due to the nature of multilayer perceptrons, which are not pre-trained, a proxy task had to be set up to enable the generation of embeddings. This proxy task was also used to facilitate the evaluation of the different solutions by feed- ing the word embeddings into support vector machines and computing scores such as silhouette or f1. The results were visualized and ranked in order to find the most appropriate solution in terms of complexity and variety of recommendations. Finally, a combined approach of the TabTransformer (an extension of a multilayer perceptron) and a sentence transformer model called MPNet was implemented to capture the full variety of available information consisting of tabular data and PDF files. This approach was able to produce more meaningful recommendations than the simple models them- selves and ranked first or second in most evaluation scores. In addition, the results were compared with a simpler baseline model to highlight the capabilities of language models combined with TabTransformer. The Web store constructed, which includes the recommender system, improved the visibility and discoverability of all candidates for IP-Reuse, reducing the time to find relevant information by approximately 20% and the time to market by approximately two percent points.

Understanding the movement in urban regions is essential for city and traf- fic planning. Network operators passively collect mobility data, i.e., Cellular Signaling Data (CSD), which provides benefits of daily availability and wide coverage, potentially allowing them to provide similar insights as expensive and time-consuming traditional surveys. The basic motivation of the thesis is to explore the potential of differ- entiating between public and private transport modes using CSD, given its limitations caused by the spatiotemporal uncertainty and the complex multimodal urban environment. Therefore, this work investigated to which extent supervised machine learning is suited for distinguishing public (bus, train, and tram) and private (bicycle, car, and motorcycle) transport modes from Cellular Signaling and GIS data. A segment-wise classification was performed based on engineered fea- tures. For segmentation, a rule-based and a density-based clustering al- gorithm were used. Furthermore, a comparison of tree-based methods (Random Forest and Extreme Gradient Boosting) and a neural network (Multi-Layer Perceptron) is provided. The evaluation was performed on data collected in the city of Graz by 89 users. Among all evaluated methods, Random Forest (RF) achieved the best performance with the Trajectory DBSCAN (T-DBSCAN) segmentation algorithm using four out of 79 extracted features. An overall classification performance of 66% for balanced accuracy and 60% for F1 macro score was obtained.

This master’s thesis investigates the application of Reinforcement Learning (RL) in the context of financial market-making, an area where decisions require strategic balancing between risk and reward. As an integral part of financial markets, market makers facilitate the trade process by providing immediate liquidity to market participants, offering bid and ask prices at which they can place orders. This dynamic and multi-objective problem invites an exploration of RL’s potential to manage these complexities. We begin with an in-depth analysis of the financial market’s operation and the nuanced roles of market makers, while also delineating the dilemmas they face. Then, the focus shifts to RL, detailing its theoretical underpinnings, recent advancements, and its potential applications in the world of finance. Notably, we introduce a Reinforcement Learning based market-making agent and train it in various market scenarios, highlighting RL’s ability to learn and adapt to a highly stochastic environment. This thesis presents a detailed comparison between an RL-based market maker and an Automated Market Maker (AMM), employing a variety of criteria to evaluate their performance. These include factors such as market quality, price stability, and utility (profit) achieved. Our results indicate that the RL-based market maker holds promise in its ability to maintain market quality and price stability, comparable to the AMM. We have shown that the RL-based market maker collected 114% more profits than the AMM, had a 6.97% lesser spread, while also providing a more stable price, by making 91% less quote adjustments. Additionally, it is 74.84% more inventory efficient, require less assets to be held, while performing similar or better. This study’s findings contribute to the growing body of evidence that RL can play a pivotal role in enhancing the capabilities of market makers in financial markets. Building upon these findings, we propose directions for future research, including the introduction of multiple market makers into trading sessions, the incorporation of full limit order books for more realistic market simulations, and improvements in the intelligence of simulated traders. By pursuing these lines of inquiry, we expect to advance our understanding of automated trading and market-making strategies and their impact on financial markets.

Simplified language is gaining an increasing amount of attention in recent years – facilitating clearer and more effective communication simply allows reaching a larger target audience – and recent changes in legislation only further accelerate this development. This thesis focuses on the use of simplified language in short German texts, encom- passing sentences, and short paragraphs. There are three primary aspects of this study (a) sentence alignment, wherein several alignment models are trained and evaluated. (b) The domain of automatic text simplification (ATS) through the design, training, and evaluation of several neural models that aim to produce simplified German text while retaining original meaning and coherence and (c) human evalu- ation of the simplification quality, comparing several quality assessment measures and their correlation to the human evaluation result, with the results indicating a modest correlation between human assessment scores and automatic quality assessment measures. Based on these finding, this thesis proposes a novel metric that holds the potential for a higher correlation with human judgment. This work also underscores the importance of quality assessment in the context of the adoption of Large Language Models (LLMs) for text generation tasks. As LLMs become increasingly prevalent in various applications, this thesis advocates for the rigorous, ongoing development and refinement of quality assessment metrics, ensuring that generated texts meet high standards of clarity, readability, and utility.

Military crisis situations in recent years, especially COVID, have shown us the importance of being able to evacuate CBRN contaminated or infected soldiers or civilians by air. The aim of this Master’s thesis is to develop a multi-sensor patient monitoring system inside an aircraft that supports the caregiver in monitoring multiple patients and provides the basis for avoiding direct contact with the patient to minimise the risk of contamination. In order to be able to implement the system properly, first the requirements were defined with the help of experts and a first mock-up was developed, which formed the basis of the system. This system consists of a database server, an Android application as a data relay and a live dashboard. The developed system was extensively evaluated using synthetic data from a specially developed data simulator and in two field tests. In addition to the development of the system, the minimum transmission rate required to recognise all relevant characteristics of the vital data was evaluated, as well as the influence of the sampling rate on the integrated change point detection algorithm, which resulted in a minimum interval of 30 seconds and 1 minute respectively. In addition, the plausibility of the data synthetically generated by the simulator was evaluated in comparison with real data, which gave very good results for not too complex data sets. In the end, an operational patient monitoring system was developed that meets all the basic requirements and can be used by medical escorts without much prior knowledge to provide additional support in monitoring CBRN contaminated patients during air transport

Handwriting recognition has gotten more and more attention in times of digitaliza- tion. It is not only possible to transcript Latin cursive and other common scripts, but also older documents with Kurrent or other not so well known letter forms. Since this recognition task needs quite some training data which is at best also mostly transcribed, there already occurs the first difficulty. There is an intense task of manual preprocessing and domain knowledge necessary. This is where the connection to the first research question regarding an experimental character-based handwriting recognition is made. The problem statement is about to determine whether a model can be trained, which is able to perform the recognition task on a real handwritten document. Handwriting fonts are used to generate test data which are the basis for the model. It turned out that the idea already stumbled on the complex problem of the proper separation of a word into its individual characters. This would therefore need a better approach than using the word histogram for splitting in order to make progress with the initial idea. The second interesting question which this thesis deals with, is to determine if left- and right-handwritten documents can easily be distinguished. If there are real unique characteristics, the knowledge of this can support the general handwriting recognition process by training different character models for left- and right-handed persons. The challenging part of this research task is that there are not a lot of left hand written documents in the used dataset. For that reason meaningful features need to be found, so that a classifier can be trained on them and not on the whole document image. Those different characteristics can be determined and calculated with domain knowledge. The drawback of this solution is still that there are not that many samples of left-handed documents and for that reason the feature calculation needed to be very accurate, so that a classifier can be based on them. Since there occurred some inaccuracies during the feature calculation process, this step would be the place where room for improvement is and future work can be done. Having more accurate measures would probably enable the algorithm to find a significant separation line between both target classes. The third research question was about distinguishing between smilies and words within a document and furthermore assign the smiley moods to the written text. This kind of classification can support in the digitalization process where handwritten reviews or ratings of any kind contain smilies. The solution for this problem worked out by first manually labelling quite some samples and then training a first classifier on just determining a word or a smiley and another classifier to distinguish between the smiley expressions. With that approach individual sentences can get assigned a smiley for a rating or even the whole document can be evaluated in one measurement by its summed up smiley moods. In general it can be said that the research field of handwriting recognition has so many interesting open tasks which can be accomplished. That includes the future work possibilities which this thesis provides, where some new ways of needing less labelled data can be determined and also rethinking the common character models by also taking writing styles into consideration. Future work in this area can also be seen broader in way of using existing ideas and algorithms more open minded to solve more research questions with them.

Anomaly detection has many applications such as predictive maintenance and intrusion detection systems. It is usually an unsupervised problem as ground-truth labels are not available. Isolation Forest is an algorithm that is widely used for such problems. In interactive anomaly detection, a human expert immediately reviews every detected anomaly and provides a ground-truth label as feedback. This feedback is used to optimize the detection model to retrieve a higher number of true anomalies. However, most anomaly detection algorithms, including Isolation Forest, are not capable of incorporating the feedback. Several algorithms were built upon Isolation Forest to include available labels, but it is unclear which one performs best in an interactive setting. This thesis investigates the performance of existing interactive anomaly detection algorithms based on Isolation Forest. Additionally, a new algorithm called Interactive Isolation Forest (IIF) is proposed. After a literature review, three algorithms were selected for evaluation. IF-AAD and OMD are state-of- the-art algorithms for interactive anomaly detection. TiWS-iForest is a supervised extension of Isolation Forest, which was not evaluated for the interactive scenario before. After analyzing the properties of the algorithms, the new algorithm IIF was designed by extending TiWS-iForest. A variant of the algorithm utilizes data pruning. Experiments using real-world data sets were conducted to evaluate the performance of the algorithms. The comparative evaluation shows that TiWS-iForest outperformed the unsupervised baseline. The new algorithm, IIF, achieved better results than TiWS-iForest. The performance of IIF improved if data pruning was used. IF-AAD and OMD achieved the best overall performance. IIF with data pruning performed best or second-best on four of eight data sets. Unlike IF-AAD and OMD, no additional hyper parameters are required by IIF

For many employees in the IT services sector, time recording is a frustrating part of the working day. The aim of this work is to facilitate this part of the working day with the help of software. The automatic recognition of task changes based on behavioural changes allows for a prediction of day segmentation, so that employees only have to fill in the content according to their task description. This partial automation can be achieved in part by statistical methods such as anomaly detection and change point detection based on user input such as keyboard or mouse input. The experiments carried out resulted in F1 scores of 50%. The F1 score provides a good balance between recall and precision scores, where false positives are as important as false negatives. There is a group of parameters that can be applied to different people without affecting the results. A generic method has been found that can be applied universally to different people without significantly affecting the results. The recognition of longer tasks on the basis of the number of opened windows is even more precise with an F1-score of 68%. A major problem, however, remains the massive intrusion into employees’ privacy. How- ever, transparent development could solve this problem, and employees around the world could save aggravation and time every day by using this software.

Accuracy, Miscalibration, and Popularity Bias in Recommendation

Time series forecasting poses a challenging problem in machine learning, mainly due to the changing statistical properties of the data over time. For instance, time series might experience significant shifts in its mean or sudden changes in variance and such changes pose a considerable challenge for traditional forecasting models. One approach to address the issue is by using a continual learning approach. Con- tinual learning allows for learning new data without forgetting what was learned from the previous data. This work proposes a continual learning approach to time series forecasting based on variational continual learning (VCL). VCL handles non- stationarity by adapting to new data while retaining previous knowledge through Bayesian Inference and avoids catastrophic forgetting in a fully automatic way. The proposed approach is evaluated on several synthetically generated and one real-world data set and compared to the equivalent artificial neural network model updated sequentially. Experiments conducted on synthetically generated data sets and one real-world data set suggest that VCL has the potential to be an effective tool for time series forecasting in certain non-stationary environments

Understanding the mobility of the population comprises, among other things, the comprehension of the activities which occur every day (home, work, edu- cation, leisure, shopping, etc.), specifically when they happen, and how long they last. Since obtaining this information through travel surveys is expensive, scientists have been examining how the knowledge about daily activities from the whole population can be gained from data provided by cellular networks. The purpose of this study was to explore to which extent it is possible to detect education activities, in particular university activities based on cellular signaling data (CSD), and what the possible limitations are. A rule-based approach was applied on already preprocessed CSD (provided by A1 Telekom Austria) on top of the pre-existing home and school activity detection. The calibration of the results is conducted dependent on the student reference number, therefore no validation step was provided. Although 94% of the defined reference numbers were reached overall, the findings of this study suggest that more information about an average student’s behavior is required, in order to derive more accurate results. Additionally, the present study provides recommendations, which could be beneficial regarding future research on this topic.

Earnings calls are part of the quarterly reporting procedures of large and public companies. While financial reporting usually focuses on historical results as struc- tured data, the verbal presentation of the management team on those calls might give additional information. The goal of this thesis was to study whether Natural Language Processing (NLP) tech- niques can be used to process transcripts of these earnings calls and whether they are suitable to quantify the management’s future guidance in those transcripts. Four conventional statistical models and six deep learning models were trained and benched on a simple extraction task. A forecaster was developed to further test whether there is any predictive value in the outlooks provided by the management teams. Finally, the two segments of an earnings call—the management’s presenta- tion, and the interactive Questions and Answers (Q&A) segment—were compared by their suitability for these forecasts. While not all deep learning models delivered the expected results, the conventional statistical models and some of the simpler deep learning models performed well on the benchmark task. The forecaster did match, and in parts outperform, the results of a reference model that solely used historical data to do the same. Looking at the different segments of the call, the Q&A segment did appear to provide the most intel for the subtask of forecasting the company’s revenues.

The military monitors the combat readiness and functionality of its vehicles, aircraft and other machinery, yet tends to neglect the operational readiness of their human resources during field deployment or during military training. Missing the importance can lead to a decline in the soldier’s physical and mental performance, which can impact the outcome of the military activity negatively. CBRN soldiers, who perform challenging work tasks while being exposed to encapsulating protective clothing, are most notably affected by it. We therefore propose a two-component system consisting of a strain, and a heat stress early warning classifier that utilizes supervised machine learning algorithms to assess the current health state of an individual exposed to heat stress. A great variety of physiological and thermoregulatory data was recorded by various biosensors during relevant studies, which simulated the exertion experienced by CBRN soldiers. The data was then processed, analyzed and used for the development of the components. The first component of the system, the strain classifier, deploys a perceptual scoring scale as a performance status indicator. This ensures that the individual’s thermal tolerance limit is not disregarded. The classifier scored a CV accuracy of 48.55 %. Since wearing comfort and the acceptability of the system are of highest priority, we worked out the minimal sensor set. Additionally, we evaluated the importance of the variable body core temperature to diminish the financial strain caused by disposable sensors. Yet, further research is needed to make the classifier deployable for military scenarios. The second component of the system, the heat stress early warning classi- fier, detects a potentially hazardous change in body core temperature in the next 15 minutes with a test score accuracy of 84.63 %. We therefore conclude that the classifier issues valid early warnings to soldiers, unit leaders and medical personnel and therefore impacts the outcome of a military activity positively.

I would like to express my deepest gratitude to Ass.Prof. Roman Kern for his in- valuable supervision and feedback throughout the whole thesis. This endeavor would not have been possible without his expertise and advisership. Additionally, I’m extremely grateful to Milan Živadinović, MSc. for his provided knowledge and support. Without his data science expertise, this thesis could not have taken place. Furthermore, I want to give special thanks to Simon Erker, PhD. and Ass.Prof. Christian Hametner for their battery domain know-how and guidance. I am also thankful to the Institute of Interactive Systems and Data Science (ISDS) for enabling me to engage in this research activity. I would like to thank the Graz University of Technology for the educational background that made it possible to tackle this Master’s thesis. I would like to extend my sincere thanks to AVL List GmbH for funding this re- search and for sharing the datasets. Furthermore, the computations would not have been possible without the provided compute cluster by AVL. Additionally, I am also thankful to my colleagues who provided me with valuable insights and ideas. Lastly, I would be remiss in not mentioning my family and friends for believing in me and keeping my spirits and motivation high during all steps of the Master’s thesis

Correctly identifying irregular heartbeats is a time-critical task that can prevent many Sudden Cardiac Deaths (scd) worldwide. Classifying a patient’s heartbeat and deciding whether it is pathological or not is the primary goal of a large number of ongoing studies. Especially in the medical field, Boruta, a wrapper of Random Forest, is often used as a feature selection algorithm due to its fast and reliable performance. However, as seen in the current research, either Boruta is used on time series data, often electrocardiograms (ecg) in a batch learning setting or is applied to data streams of other domains. The aim of this thesis is to test whether Boruta can be applied to data streams for correctly classifying healthy and patho- logical heartbeats. A window size, which was small yet had an adequate overall classification performance, was identified, and the quality of the selected features was assessed. It was seen that distinctive features could be deemed relevant based on the type of heartbeat. Insights of both experiments were combined, and an on- line pipeline was implemented, including Boruta and Hoeffding trees. It was shown that, in fact, Boruta could be applied to data streams, leading to promising results. Overall, applying different methods for online feature selection, a relative mcc of 79.26% respectively 69.73% to the offline approach could be achieved. However, further insights into the extraction of the minimal window need to be gained, as well as the extraction of additional ecg-specific features needs to be considered.

With recent systems like ChatGPT being able to amaze people, creating new head- lines every day, the interest of the public in NLP Systems has risen. We are likely to see AI becoming more popular in aiding in our daily work in new innovative ways in the near future, and in different application domains, too. One such appli- cation domain is automated Question Generation to aid in learning and teaching. The basis of ChatGPT, the Transformer model, has been around since it was intro- duced by Vaswani et al. (2017). And with it a lot of research and different models that focus on the task of Question Generation. But hardly was this research fo- cusing on languages other than English. In this work we created a system for generation in German, utilizing existing pre-trained transformers and comparing different models to find the best one. Basing our research on multilingual models, Testing MBart and MT5, the aim was to identify the better performing model in order to give a recommendation on which one to choose when creating a trans- former based Question Generation solution for the German language. To achieve this we fine-tuned the models with settings as comparable as possible. Also, we investigated some variations. In the end we are able to give a recommendation to choose MBart for German QG. The different insights of this work will be of aid for anyone who aims to find a SOTA way of creating a Transformer-based solution on German Question Generation.

This thesis presents our experiments on investigating priming effects and their influence on the performance of authorship attribution methods. We translate the concept of priming in psychology, where individuals react differently if they are exposed to certain information beforehand, to a natural language processing context. We make the case that there are additional features (meta features) to traditional features like n-grams and word embeddings that could improve such methods. We start by giving background information about priming, the platform we collected our data from (Reddit), and authorship attribution. After shortly describing the preliminary work that was done, we explain the feature extraction process in detail. Finally, we present our results on different model architectures and variants where we get a consistent improvement in accuracy of around 2% by integrating stimulus features. According to our data, the influence of meta features is diminishing while the additional information of base features in stimuli comments is responsible for the slight boost in performance of our models.

In a rapid developing digital world where an insignificant event can go viral, memes play a central role, the Karen meme being one of them. Aside from detecting Karen- like behaviour in our day-to-day (physical and digital) life, we do not yet have a means of detecting such behaviour through algorithmic models. Therefore, this thesis aims to define who Karen is and how to detect situations with Karen-like behaviour online. In order to solve this problem, a ”Karen” data pipeline was created encompassing in itself several phases like data collecting (text and images), data processing (cleaning, sampling and labeling, balancing) and classification (NLP and image) models. The accuracy levels varied from the data source, with the accuracy percentages ranging from 70 to 80%. The performed error analysis showed the reasons why the inaccuracies might have happened, which were context and way of description (for text) and merged pictures, low picture quality and dark backgrounds among others (for images). On the other hand, the evaluations fell into moderate agreement category for image collections and fair agreement for the text collection. From that it was concluded that the Karen related situation detection topic, depending on the source collection, can be considered a moderate to highly complex topic. The models used in the detection of the Karen related situations are valid for the ”type” of Karen that is defined in this thesis. To be able to use them for a modified definition of Karen, the models must be retrained with different/newer samples. While the thesis detects Karen related situations, it does not detect who the Karen in the situation is. This could be a possible future research topic.

Basic chatbots that rely on simple pattern matching have been around for centuries. Recent developments in large language models, often based on the Transformer architecture, allow for more sophisticated chatbots. In this paper we investigated how a chatbot based on GPT-3, a large language model that has been shown to be capable of producing human-like text, performed in an open domain scenario. To allow for the human to interact with the chatbot via voice, we used Speech-to-Text and Text-to-Speech components. We performed a variety of conversations with the chatbot, some based on datasets for dialogue and Q&A. After the conversations, we used different evaluation measures to investigate how suitable GPT-3 is in the context of a chatbot and where its limitations lie. The findings support that GPT-3 is a good choice for both Q&A and for conducting conversations. At the same time, we also identified some limitations of using GPT-3 in a chatbot. Specifically, we found limitations that produce repeating answers, factually wrong answers and biases. We also evaluated the components that allow for a speech interface and found limitations, especially in the Speech-to-Text component we used. The limitation of wrongly transcribed texts was party offset by an interesting capability of GPT-3, where it often interpreted the wrong texts correctly. We showed that GPT-3 is a good choice for a chatbot and stated how future work can address the limitations we observed.

-

Causality in historic documents is an important source of information for historians. Manually finding relevant causal relations from the immense number of documents is a time-intensive process. To support historians in their work, we created a novel approach for causal relationship extraction and introduced a dataset of historical documents annotated for causal relations in German. Our proposed model for causality extraction was based on BERT. We extended traditional sequence labeling approaches to allow the model to detect multiple overlapping relations. The model created distinct context embeddings per causal relation, from which associated causal arguments, such as cause and effect, were detected. Additionally, we assigned a causal type and degree to each relation. Our model outperformed a pattern-based approach in all tasks. We evaluated various BERT models, pre-processing steps, and transfer learning approaches. German BERT models generally performed better than multilingual models, and pre-training on contemporary texts performed similarly well to pre- training on historical texts. Transfer learning on related tasks could overall improve the model. Pre-processing the text to correct historic spelling variations or including additional information about coreferences did not increase the performance. We also found evidence that BERT learns about causal relationships during self-supervised pre-training, indicating that causality is integral for encoding information in natural text. The promising results of our model demonstrate the potential to support historians in their work by recommending relevant passages containing causal relations or by creating knowledge bases from cause and effect relationships.

Anomaly detection refers to finding patterns in data structures that appear abnormal or deviate from a well-defined concept of expected normal be- haviour within datasets. Anomaly detection is widely used in industrial applications because undetected anomalies can cause considerable losses. This thesis focuses on anomaly detection, referring to identifying changes, differences or anomalies in an automatic measurement system developed and used by NXP Semiconductors Austria GmbH Co & KG. The automated measurement system is responsible for conducting measurements regard- ing Near-Field Communication (NFC) devices, testing their performance and compliance with the ISO 14443 standard. Detecting anomalies in the measurement system is crucial for NXP because finding anomalies in the measurement data would imply that the new firmware of the product is not working as expected. In this thesis, the target is to evaluate the chosen anomaly detection approach applied to the automation system and to estimate the most suitable number of test run data used as a baseline for the detection. Furthermore, we have to check if the algorithm is satisfying w.r.t. the detection accuracy of the anomaly detection system in different measurement setups, evaluate the algorithm against false negatives and positives, and observe how accurate it is. We selected the machine learning algorithms DBSCAN and LOF in this work. The chosen machine learning algorithms are applied in one-class classification mode to solve the anomaly detection problem. We decided on a one-class classification approach because obtaining normal data that behaves as expected is more feasible than considering all possible anomalous data. The method presented in this work is evaluated on real measurement data collected and generated in NXPs laboratory. Since various software tools are conducted under the automated measurement system, different data structures and formats are generated. Thus the data collected by the software tools differ from each other. Hence we first had to parse the data into a consistent data format, JSON. Additionally, using domain knowledge, where we defined the expected behaviour of measurements and inferred anomalies from this definition, we generated artificially anomalous data by injecting anomalies into normal datasets. The presented method in this work is also evaluated and tested against this dataset. This work revealed that a model developed and evaluated on a specific domain setup cannot be generalised and applied to a different domain and still obtain the same satisfactory results. The evaluation of the number of baseline data used for a model indicated that the performance does not solely depend on the number of data used but on the information content introduced by new data instances.

Data related issues are one of the main reasons why current industrial projects cannot be accomplished, and this is due to the fact that data collection processes are too complex, time-consuming and often very expensive. However, datasets of insufficient size are often responsible for poor performances in machine learning projects. Therefore, the balance between the amount of data that can be collected versus the amount of training data needed to achieve a certain performance of the model needs to be found, which makes it a trend topic in nowadays Artificial Intelligence researches. In this thesis, the impact of reducing the amount of training data used for learning a binary classification Support Vector Machine model is studied. The results show that this reduction decreases the accuracy achieved by the model and its variance becomes larger. Also, the corresponding generalization error increases when de- creasing the length of the training sets. Multiple aspects of the data and the model itself need to be studied before defining the minimum size of a training dataset required to achieve certain results. Different conditions related to the original data, like different datasets, dimensionalities or statistical properties are considered. Also, some modifications in the data, such as considering synthetic data oversampled by different Data Augmentation techniques, are applied. It is shown that such tech- niques improve the test accuracy of the model but do not prevent from overfitting. Finally, multiple configurations of the model itself, like hyperparameter tuning and regularization techniques, are covered. An in-depth comparision of all the results is done focusing on the performance of the model in terms of test and train accuracies, misclassification errors for each class, variance of the results after many runs and generalization error.

As the spread of false information has become ever more problem- atic in recent years, research on automatic fact-checking methods has intensified. Typically, such approaches rely on an explicit knowledge base to verify claims. They use a pipeline that first retrieves relevant documents, then passages therein and, finally, performs entailment, i.e., predicts whether the evidence supports the claim or not. The current state of the art mostly uses a vari- ation of a standard Transformer with full self-attention for the entailment. However, its quadratic memory complexity limits the amount of evidence the model can process. In this thesis, we study the use of various different, more efficient Transformers as entailment models, allowing them to process more evidence. We compare these techniques and balance the advantages and disadvantages. The efficiency improvements allow us to com- pletely remove the passage retrieval step, resulting in significant savings in computational cost for the complete pipeline while achieving 97-99% of the current state-of-the-art performance on the benchmark data set FEVER. Further, our experimental results show that the efficient Transformer Longformer outperforms a RoBERTa baseline for long evidence documents, as it can process more input within the same memory budget. Overall, we find using more evidence beneficial for predictive performance. Us- ing efficient Transformers can reduce the computational costs of fact-checking pipelines and allow them to handle longer evidence documents.

One of the many achievements of Artificial Intelligence Applications in the recent years involve NLP - Natural Language Processing. The sheer amount of unstructured textual data that is produced on a daily basis is seemingly increasing with no end in sight. Unfortunately one does not only find well intended texts online. Therefore for many companies, organizations and alike a certain need arises to automatically evaluate the submitted text to assist human moderators finding potentially dangerous, toxic and other similar negative contributions to either censor those or even block the author completely from future submissions. The goal of this thesis was not only to try to find those but also to de-escalate agitated users actively on Reddit and by that influence their overall behaviour such that censoring might not even become necessary. A case study was conducted on the r/Austria subreddit which at that time had approximately 311,000 members. For the study a chatbot was imple- mented which tried to de-escalate agitated users with previously prepared priming phrases which should have motivated the users to overall edit their comment into a non-aggressive way or to delete it all together and implicitly get the user to sanction aggression themselves in the future. The study was only partly successful as in most cases the intervention attempts sparked even more aggression whilst in some other cases it was perceived as a positive impact and even lead to people actively chang- ing/deleting their comments.

Current processes for the surveillance of critical infrastructures, comprising objects of interest in more significant heights, mainly rely on manually capturing high-quantity RGB image data using a Unmanned Aerial Vehicle (UAV), which causes high effort of subsequent manual human evaluation of thousands of images. Moreover, the evaluation process on extensive amounts of data is prone to error, has a higher demand for Information Technology (IT) infrastructure such as data storage, or causes higher computational efforts when conducting fault detection using computer vision methods. Consequently, this thesis presents a method to estimate the absolute six degrees of freedom (6DOF) pose of a single RGB camera in a predefined map coordinate system based on its captured images to give a camera-equipped UAV a better understanding of the spatial realities in a real-world scene. Accordingly, the work contributes to a refined future process of critical infrastructure surveillance aiming for a higher level of autonomy during the acquisition of close-proximity images for fault detection, pursuing the collection of low-quantity but high-quality data and therefore lowering the subsequent manual human or computational evaluation effort. The estimation of the capturing camera’s pose is based on an existing three-dimensional (3D) representation of the target infrastructure, two- dimensional (2D) object center coordinates of objects of interest, estimated using a reliable but fast object detection model, a calibrated camera system, the mathematical model of the image formation process, and a photogram- metric system capable of estimating the desired pose through least-squares parameter adjustment. Multiple experiments on a carefully designed hard- ware prototype, which serves as the target infrastructure during development, reveal that the determined pose’s precision depends on the number of detec- tions, the alignment of the calculated object center coordinates on the image plane, and their distance to the ideal projections of the corresponding 3D object coordinates. The results prove the feasibility of the developed process for pose estimation on single imagery, show its limitations, and point out crucial future work. Notably, the pose estimations conducted within this work show a mean distance of about 6 cm from the ground-truth capturing camera’s position using 2D image coordinates derived from estimated bounding boxes of the chosen object detection model. Combined with a reasonably small deviation from the expected degrees of freedom for orientation of fewer than 2.26 degrees on average, the pose estimation quality shows promising practicality for acquiring a higher level of autonomy in UAV navigation.

Physics informed neural networks (PINNs) are an emerging class of deep learning methods capable of solving both forward and inverse problems of differential equations. They gained great popularity due to the seamless integration of both observational data and prior information about the underlying physical system in a combined multi-objective cost function. As a result of the additional physics loss term, PINNs can be employed in applications where purely data-driven methods are doomed to failure due to insufficient data quantity and quality. Despite extensive research, PINNs are still difficult to train, especially when litte data is available and the optimization relies heavily on the physics loss term. In particular, PINNs suffer from severe convergence problems when simulating dynamical systems with high-frequency components, chaotic or turbulent behavior. In this work, we discuss the question of whether PINNs are a suitable method for predicting chaotic motion by conducting several experiments on the undamped double pendulum. The experimental results demonstrate that the additional information of the physics loss term effectively improves a purely data-driven approach in the presence of noisy, incomplete, or only partially observed data. However, their prediction accuracy degrades immensely in the chaotic regime. In contrast to the behavior of a chaotic system, PINNs do not exhibit any sensitivity to perturbations in the initial condition. Instead, PINNs consistently converge to certain highly attractive solutions that deviate strongly from the reference but display significantly lower values for the physics loss. We find that only a reduced computational domain combined with an appropriate loss weighting scheme allows convergence to the correct solution.

Generative Adversarial Networks (GANs) are currently seeing wide application in data augmentation tasks. While most studies focus on the generation of image datasets, little research has thus far focused on geomorphological data. In this work, we propose an innovative way of applying GANs to the production of synthetic tridimensional scenes through the generation of RGB-depth images in the context of an industry planetary exploration use case. These landscape objects not only encapsulate 3D coordinates but also colour textures, both drawn from the distribution of colour and depth values of real Martian landscapes. To enable this, we present an end-to-end pipeline, consisting of an RGB-depth data collection strategy using widelyavailable open-source 3D computer graphics software, a preprocessing and data preparation strategy, the Spatial GAN neural architecture (SGAN) and a 3D conversion post-processing module. With the help of this pipeline, we manage to generate artificial tridimensional Martian environments that look strikingly realistic to the human eye. Lastly, we explore the limits of this approach and possible improvements to it.

The rise of modern DNA sequencing methods and tools has led to an abun- dance of readily available genomic data. Since identifying the locations of genes and coding regions in novel organisms is a time-intensive process, we en- deavored to create a pipeline, which produces informative embeddings from raw DNA sequences. Salient features are learned using autoencoder neu- ral networks. Models with different parameter values and combinations of layer types were trained and evaluated. The autoencoders transform a given genome into a point cloud in the latent space. We implemented and evaluated various sampling methods, which compress this point cloud into a compact representation. The quality of the embeddings was validated on a down- stream task of taxonomic realm prediction of novel organisms from their raw DNA sequences. Furthermore, we propose several embedding visualizations for intuitive genome understanding and comparison.

The proliferation of alternative energy sources and the advancing use of electric mobility have increased the need to stabilize the power grid these technologies depend on. This work explored whether machine learning in general or bayesian deep learning in particular could be utilized to facilitate the deployment of private photovoltaic installations with attached battery energy storage system as part of the primary frequency response reserve. In order to amend the data set and for a better understanding of the underlying mechanics, two simulations were created. The first emulates the energy consump- tion of a private household based on the devices which are assumed to be used within. The second replicates the behaviour of a charge controller in a private household with attached photovoltaic installation and battery. Several prediction methods were matched against each other to find good estimators for the energy output of photovoltaic installations and the energy consumption of private house- holds. The simulations in conjunction with the found estimators were then used to find good strategies for providing frequency response reserve using genetic optimization. Results showed that in theory, households with attached photovoltaic installation and battery energy storage system can be used to provide frequency response reserve in a profitable manner. However for practical applications the accuracy of the prediction models would need to be higher. Knowing now that profitable strategies exist, further research can be done into increasing the prediction accuracy.

Despite numerous algorithms in the anomaly detection field, enhancing the performance of such algorithms still remains an open research topic. This thesis proposes using causal relationships between variables to improve the accuracy of anomaly identifica- tion tasks. An algorithm using a structural causal model as a base for anomaly detection is introduced and evaluated using synthetic data sets. Unlike numerous other anomaly detection algorithms re- lying on measuring distance or density between the data points, the proposed algorithm compares actual value against the expected out- come prediction for the same data point, and using this difference labels some data points as anomalies. The results of the causally in- formed anomaly detection algorithm are compared with the three well-performing unsupervised machine learning algorithms for anomaly detection to understand if this approach is useful and capable of detecting such anomalies, which other algorithms miss. The algorithm using the structural causal model achieved up to 33.5% higher F1 score compared to the next best performing ma- chine learning model, thus indicating that this approach can be used for anomaly detection and provide good results.

During manufacturing process in semiconductor industry, a large amount of data is produced. Measurements can be visualized with so called wafer maps. Different problems in manufacturing process may cause a decrease in production yield. Very often, patterns on the wafer map may indicate the problems in an early stage. A malfunctioning production tool may therefore be indicated by a well-known pattern. For standard pattern recognition tasks, a set of labelled data is needed in order to train classification architectures. This master thesis proposes a method using a generative adversarial network (GAN) called BigBiGAN, which creates a low dimensional representation of the input wafer map. By applying different clustering methods, sets of similar wafer maps can be clustered without prior knowledge. If the network is correctly parametrized, the data is properly pre-processed and the clustering methods are well suited, it is possible to generate labelled data sets from real world data of any size. This approach is able to replace the manual creation of data sets, which is a time consuming and error prone task.

Der Einfluss von Kundenbewertungen auf die Kaufentscheidung lässt sich in heutiger Zeit nicht leugnen. Menschen nutzen die Erfahrungen und Meinungen Anderer, um sich vor dem Einkauf über Produkte und Dienstleistungen zu Informieren. Doch dieses Verhalten kann ausgenutzt werden, um durch gezielte Platzierung von gefälschten Bewertungen Produkte in ein anderes Licht zu rücken. Diese Arbeit befasst sich mit der Identifikation von Fake Reviews. Ziel ist es, ein besseres Verständnis darüber zu erlangen, wie in Österreich mit potenziell gefälschten Reviews umgegangen wird. Dazu stellt sich die Frage, ob ein Bewusstsein darüber vorhanden ist, dass Fälschungen existieren. Im weiteren Zuge sollen auch Überlegungen zum schädlichen Einfluss von Fake Reviews und dem Umgang damit betrachtet werden. Durch Interviews mit Expertinnen und Experten werden angewendete Strategien erhoben und mit den Möglichkeiten aus der aktuellen Fachliteratur diskutiert. Die Auswahl von geeigneten Merkmalen soll zeigen, dass durch die Ermittlung von Verhaltensdaten eine deutliche Verbesserung gegenüber der ausschließlichen Betrachtung des Inhalts erreicht wird. Die Befragungen zeigen, dass aufgrund der geringen Reviewzahlen die einzelne Freigabe jeder Bewertung eine beliebte Lösung darstellt. Damit ist ein großes Potential durch Automatisierung bei steigenden Zahlen gegeben. Durch geeignetes Feature Engineering an einem Datensatz mit 3845 öffentlichen Reviews kann gezeigt werden , dass trotz einer scheinbar guten Moderation 10% der Produkte in einer Top-Bewerteten Produktliste ihren Platz durch verdächtige Accounts und Reviews erhalten haben. Es wurde eine Handlungsempfehlung vorgestellt, die durch menschliche Erkennung in Kombination mit bereitgestellten Verhaltensmerkmalen auf einfachem Weg höhere Erkennungsraten liefert. Weiterführende Forschung könnte zeigen, welche Schritte für einen strukturierten Übergang von menschlicher Identifikation zur vollautomatischen Erkennung nötig sind

Als Folge der Einführung neuer Technologien wie Cloud Computing und Big Data sehen wir grundlegende Veränderungen bei traditionellen Geschäftsmodellen sowie das Entstehen neuer Modelle. So sammeln und nutzen immer mehr Unternehmen Daten, um ihren Wettbewerbsvorteil auszubauen. Viele von ihnen haben jedoch Schwierigkeiten, die vorhandenen Daten effektiv zu nutzen und ein datengesteuertes Geschäftsmodell zu implementieren. Um einen solchen grundlegenden organisatorischen Wandel durchzuführen, ist eine Bewertung des aktuellen Geschäftsmodells notwendig, um dessen Reifegrad in Bezug auf die Datenerfassung und -nutzung zu bestimmen. Ziel dieser Arbeit war es, die wissenschaftliche Literatur zu Reifegradmodellen zur Bewertung datengetriebener Unternehmen sowie den Großteil der grauen Literatur, die von Beratungsunternehmen erstellt wurde, zu untersuchen, um den Stand der Forschung zusammenzufassen, ein Reifegradmodell zu konstruieren und schließlich datengetriebenen Organisationen ein Werkzeug an die Hand zu geben, um ein besseres Verständnis ihrer aktuellen Fähigkeiten in Bezug auf die Datennutzung zu erhalten und Geschäftsfelder mit Zukunftspotenzial zu identifizieren.Die Auswertung 16 bestehender Reifegradmodelle zeigte folgende Einschränkungen. Das Fehlen eines systematischen Rahmens für die Entwicklung von Reifegradmodellen und eine unzureichende Dokumentation des Erstellungsprozess. Der letzte Schritt der Arbeit beinhaltete die Erstellung eines Reifegradmodells unter Berücksichtigung der Einschränkungen der bewerteten Reifegradmodelle und das Testen des neu erstellten Modells in einer Interviewstudie. Keine solche Bewertung kann das institutionelle Wissen langjähriger Führungskräfte ersetzen, aber ein gut konzipiertes Reifegradmodell, das in überschaubare Dimensionen unterteilt ist, kann Organisationen dabei helfen, von Beobachtungen zu praktischen, gewinnbringenden Handlungsweisen überzugehen.

Evaluation of Job Recommendations for the Studo Jobs Platfor

Today the internet is growing fast as users generate an increasing amount of data. Therefore, finding relevant information is getting more and more time- consuming. This happens as the internet consists of a larger amount of data that is distributed over various information sources. Search engines filter data, and reduce the time required to find relevant information. We focus on scientific literature search where search engines help to find scientific articles. An advantage of scientific articles is that they share a common structure to increase their readability. This structure is known is IMRaD (Introduction, Method, Results and Discussion). We tackle the question whether it is possible to improve the search result quality while searching for scientific works by leveraging IMRaD structure information. We use several state-of-the-art ranking algorithms, and compare them against each other in our experiments. Our results show that the importance of IMRaD chapter features depends on the complexity of the query. Finally, we focus on structured text retrieval and the influence of single chapters on the search result. We set out to tackle the problem to improve the quality of the results produced by state-of-the-art ranking algorithms for scientific literature research.

Automatically separating text into coherent segments sharing the same topic is a nontrivial task in research area of Natural Language Processing. Over the course of time text segmentation approaches were improved by applying existing knowledge from various science fields including linguistics, statistics and graph theory. At the same time obtaining a corpus of textual data varying in structure and vocabulary is problematic. Currently emerging application of neural network models in Natural Language Processing shows promise, which particularly can be seen on an example of Open Information Extraction. However the influence of knowledge obtained by an Open Information Extraction system on a text segmentation task remains unknown. This thesis introduces text segmentation pipeline supported by word embeddings and Open Information Extraction. Additionally, a fictional text corpus consisting of two parts, novels and subtitles, is presented. Given a baseline text segmentation algorithm, the effect of replacing word tokens with word embeddings is examined. Consequently, neural Open Information Extraction is applied to the corpus and the information contained in the extractions is transformed into word token weighting used on top of the baseline text segmentation algorithm. The evaluation shows that application of the pipeline to the corpus increased the performance for more than a half of novels and less than a half of subtitle files in comparison to the baseline text segmentation algorithm. Similar results are observed in a preliminary step in which word tokens were substituted by their word embedding representations. Taking into account complex structural features of the corpus, this work demonstrates that text segmentation may benefit from incorporating knowledge provided by an Open Information Extraction system.

Portable Document Format (PDF) is one of the most commonly used file formats. Many current PDF viewers support copy-and-paste for ordinary text, but not for mathematical expressions, which appear frequently in scientific documents. If one were able to extract a mathematical expression and convert them into another format, such as L A TEX or MathML, the information contained in this expression would become accessible for a wide array of applications, for instance screen readers. An important step to achieve this goal is finding the precise location of mathematical expressions, since this is the only unsolved step in the formula extraction pipeline. Accurately performing this crucial step is the main objective of this thesis. Unlike previous research, we use a novel whitespace analysis technique to demarcate coherent regions within a PDF page. We then use the identified regions to compute carefully selected features from two sources: the grayscale matrix of the rendered PDF file and the list of objects within the parsed PDF file. The computed features can be used as input for various classifiers based on machine learning techniques. In our experiments we contrast four different variants of our method, where each uses a different machine learning algorithm for classification. Further, we also aim to compare our approach with three state of the art formula detectors. However, the low reproducibility of these three methods combined with logical inconsistencies in their documentation greatly complicated a faithful comparison with our method, leaving the true state of the art unclear, which warrants further research.

This thesis presents a novel way of creating grid-based word puzzles, named the AI Cruciverbalist. These word puzzles have a large fan base of recreational players and are widespread in education. The puzzle creation process, an NP-hard problem, is not an effortless task, and even though some algorithms exist, manual puzzle creation achieved the best results so far. Since new technologies arose, es- pecially in the field of data science and machine learning, the time had come to evaluate new possibilities, replace existing algorithms and improve the quality and performance of puzzle generation. In particular neural networks and constraint programming were evalu- ated towards feasibility, and the results were compared. The black box of a trained model makes it hard to ensure positive results, and due to the impossibility of modelling some requirements and con- straints, neural networks are rated unsuitable for puzzle generation. The significance of correct values in puzzle fields, the approximative nature of neural networks, and the need for an extensive training set additionally make neural networks impractical. On the other hand, precisely modelling requirements for a constraint satisfaction prob- lem has shown to create excellent results, finding an exact solution, if a solution exists. The presented results achieved with the constraint programming approach are rated as successful by domain experts, and the algorithm has been successfully integrated into an existing puzzle generator software for use in production.

People use different styles of writing according to their personalities. These dis- tinctions can be used to find out who wrote an unknown text, given some texts of known authorship. Many different parts of the texts and writing style can be used as features for this. The focus in this thesis lies on topic-agnostic phrases that are used mostly unconsciously by authors. Two methods to extract these phrases from texts of authors are proposed, which work for different types of input data. The first method uses n-gram tf-idf calculations to weight phrases while the second method detects them using sequential pattern mining algorithms. The text data set used is gathered from a source of unstructured text with a plethora of topics, the online forum called Reddit. The first of the two proposed methods achieves average F1-scores (correct author predictions) per section of the data set ranging from 0.961 to 0.92 within the same topic and from 0.817 to 0.731 when different topics were used for attribution testing. The second method scores in the range from 0.652 to 0.073, depending on configuration parameters. In current times, due to the massive amount of content creation on such platforms, using a data set like this and using features that work for authorship attribution with texts of such nature is worth exploring. Since these phrases have been shown to work for specific configurations, they can now be used as a viable option or in addition to other commonly used features.

Political debates today are increasingly being held online, through social media andother channels. In times of Donald Trump, the American president, who mostlyannounces his messages via Twitter, it is important to clearly separate facts fromfalsehoods. Although there is an almost infinite amount of information online, toolssuch as recommender systems, filters and search encourage the formation of so-called filter bubbles. People who have similar opinions on polarizing topics groupthemselves and block other, challenging opinions. This leads to a deterioration ofthe general debate, as false facts are difficult to disprove for these groups.With this thesis, we want to provide an approach on how to propose different opin-ions to users in order to increase the diversity of viewpoints regarding a politicaltopic. We classify users into a politic spectrum, either pro-Trump or contra-Trump,and then suggest Tweets from the other spectrum. We then measure the impact ofthis process on diversity and serendipity.Our results show that the diversity and serendipity of the recommendations can beincreased by including opinions from the other political spectrum. In doing so, wewant to contribute to improving the overall discussion and reduce the formation ofgroups that tend to be radical in extreme cases

Diese Arbeit beschäftigt sich mit der Anwendung von Data Mining-Algorithmen zur Informati-onsgewinnung im Softwaresupport. Data Mining-Algorithmen sind Tools der sogenannten „Knowledge Discovery“, der interaktiven und iterativen Entdeckung von nützlichem Wissen. Sie werden eingesetzt, um Daten zu analysieren und über statistische Modelle wertvolle In-formationen einer Domäne zu finden. Die Domäne in dieser Arbeit ist der Softwaresupport, jene Abteilung in Softwareentwicklungs-Unternehmen, die Kundinnen und Kunden bei der Lösung von Problemen unterstützt. Meist sind diese Supportabteilungen als Callcenter organisiert und arbeiten zusätzlich mit Ticketsys-temen (einem E-Mail-basierten Kommunikationssystem). Zweck dieser Arbeit ist es zu prüfen, inwiefern Data Mining-Algorithmen im Softwaresupport angewendet werden und ob tatsächlich wertvolle Informationen identifiziert werden können. Erwartet wird, Informationen über das Supportverhalten von KundInnen sowie den Einfluss von externen Faktoren wie Wetter, Feiertage und Urlaubszeiten zu entdecken. Die Literaturrecherche dieser Arbeit, beinhaltet unter anderem die Themen Personaleinsatz-planung im Softwaresupport und Data Science (Zusammenfassender Begriff für Data Mining, Data Engineering oder Data-Driven Decision Making, etc.). Im „experimental Setup“ finden Interviews zum Thema Status quo- und Kennzahlen im Softwaresupport mit führenden öster-reichischen Softwarehäusern sowie eine Fallstudie zur Anwendung eines Data Mining-Vorgehensmodells statt. Letztlich wird in einem Feldexperiment geprüft, ob es mit Data Mi-ning-Algorithmen tatsächlich möglich ist, Informationen für den Softwaresupport zu entdecken. Als Ergebnis dieser Arbeit zählen einerseits die Identifikation von Möglichkeiten, um im Sup-port Kosten zu sparen und Effizienz zu gewinnen und andererseits das Finden von wertvollen Informationen über Abläufe und Zusammenhänge im Support. Die gewonnenen Informationen können in weiterer Folge in den Supportprozess einfließen, um effektivere und effizientere Prozesse zu schaffen. Ein weiteres Resultat des Informationsgewinns ist auch die Qualitäts-steigerung von Managemententscheidungen sein

Due to a rapid increase in the development of information technology, adding computing power to everyday objects has become a major discipline of computer science, known as “The Internet of Things”. Smart environments such as smart homes are a network of connected devices with sensors attached to detect what is going on inside the house and what actions can be taken automatically to assist the resident of the house. In this thesis, artificial intelligence algorithms to classify human activities of daily living (having breakfast, playing video games etc.) are investigated. The problem is a time series classification for sensor-based human activity recognition. In total, nine different standard machine learning algorithms (support vector machine, logistic regression, decision trees etc.) and three deep learning models (multilayer perceptron, long short-term neural network, convolu- tional neural network) were compared. The algorithms were trained and tested on the ucami Cup 2018 data set from sensor inputs captured in a smart lab over ten days. The data set contains sensor data from four different sources: intelligent floor, proximity, binary sensors and acceleration data from a smart watch. The mutlilayer perceptron reported a testing accuracy of 50.31%. The long short-term neural network showed an accuracy of 57.41% (+/-13.4), the convolutional neural network in 70.06% (+/-2.3) on average - resulting in only slightly higher scores than the best standard algorithm logistic regression with 65.63%. To sum up the observations of this thesis, deep learning is indeed suitable for human activity recognition. However, the convolutional neural network did not significantly outperform the best standard machine learning algorithm when using this particular data set. Unexpectedly, the long short-term neural network and the basic multilayer perceptron performed poorly. The key drawback of finding a fitting machine learning algorithm to solve a problem such as the one presented in this thesis is that there is no trivial solution. Experiments have to be conducted to empirically evaluate which technique and which hyperparameters yield the best results. Thus the results found in this thesis are valuable for other researchers to build on and develop further approaches based on the new insights.

The artificial classification of audio samples to an abstraction of the recorded location (e.g., Park, Public Square, etc.), denoted as Acoustic Scene Classification (ASC), represents an active field of research, popularized, inter alia, as part of the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge. Nevertheless, we are more concerned to artificially assign audio samples directly to the location of origin, i.e., to the location where the recording of the corresponding audio sample is conducted, which we denote as Acoustic Location Classification (ALC). The evidence for the feasibility of ALC contributes a supplementary challenge for acoustics-based Artificial Intelligence (AI), and enhances the capabilities of location dependent applications in terms of context-aware computing. Thus, we established a client-server infrastructure with an Android application as recording solution, and proposed a dataset which provides audio samples recorded at different locations on multiple consecutive dates. Based on this dataset, and on the dataset proposed for the DCASE 2019 ASC challenge, we evaluated the application of ALC, along with ASC, providing a special focus on constraining training and test sets temporally, and locally, respectively, to ensure reasonable generalization estimates with respect to the underlying Convolutional Neural Network (CNN). As indicated by our outcomes, employing ALC constitutes a comprehensive challenge, resulting in decent classification estimates, and hence motivates further research. However, increasing the number of samples within the proposed dataset, thus, providing daily recordings over a comparatively long period of time, e.g., several weeks or months, seems necessary to investigate the practicality and limitations of ALC to a sufficient degree.

Die Erkennung von Communities ist ein essenzielles Werkzeug für die Analyse von komplexen sozialen und biologischen Netzwerken, sowie von Informationsnetzwerken. Unter den bislang veröffentlichten, zahlreichen Community-Erkennungsalgorithmen ist Infomap ein prominentes und etabliertes Framework. In dieser Masterarbeit präsentieren wir eine neue Methode zur Erkennung von Communities, welche von Infomap inspiriert ist. Infomap wählt eine analytische Herangehensweise an das Community-Erkennungsproblem, indem die erwartete Beschreibungslänge eines Zufallslaufs auf einem Netzwerk minimiert wird. Im Gegensatz dazu minimiert unsere Methode die Unterschiedlichkeit, quantifiziert via Kullback-Leibler Divergenz, zwischen einem Graph-induzierten und einem synthetischen Zufallsläufer, um eine Partition in Communities zu erhalten. Daher nennen wir unsere Methode Synthesizing Infomap. Spezifischer behandeln wir Community-Erkennung in ungerichteten Netzwerken mit nicht-überlappenden Communities und zweischichtigen Hierarchien. In dieser Arbeit präsentieren wir eine Formalisierung sowie eine ausführliche Herleitung der Synthesizing Infomap Zielfunktion. Anhand der Anwendung von Synthesizing Infomap auf eine Gruppe von Standardgraphen erkunden wir dessen Eigenschaften und qualitatives Verhalten. Unsere Experimente an künstlich generierten Benchmark-Netzwerken zeigen, dass Synthesizing Infomap dessen ursprüngliche Version bezüglich „Adjusted Mutual Information“ auf Netzwerken mit schwacher Community-Struktur übertrifft. Beide Methoden zeigen gleichwertiges Verhalten bei Anwendung an einer Auswahl von realen Netzwerken. Dies indiziert, dass Synthesizing Infomap auch in praktischen Anwendungsfällen sinnvolle Ergebnisse liefert. Die vielversprechenden Resultate von Synthesizing Infomap motivieren eine weiterführende Evaluierung anhand von realen Netzwerken, sowie mögliche Erweiterungen für mehrstufige Hierarchien und überlappende Communities.

As the complexity of a software projects rises it can become difficult to add new features. Additionally to the maintainability, other quality attributes such as reliab- ility and usability may suffer from the increased complexity. To prevent complexity from becoming an overwhelming issue we use principles of good programming and reside to well known software architectures. We often do so, by choosing to use specific frameworks. However, we can only subjectively judge whether or not the usage of a specific framework resulted in less perceived complexity and an improvement in other quality attributes. In our work, we investigated the applicability of existing software measurements for measuring desired quality attributes and their applicability for framework com- parison. We chose a set of quantitative software measurements which are aimed at specific quality attributes, namely maintainability and flexibility. Additionally, we used well established software measurements such as McCabes Cyclomatic Com- plexity [44] and Halsteads Metrics [32] to measure the complexity of a software. By developing the same application using two different web frameworks, namely ReactJS and Laravel, over a set of predefined ‘sprints’, each containing a specific set of features, we were able to investigate the evolution of different software measurements. Our results show that some of the measurements are more applic- able to the frameworks chosen than others. Especially measurements aimed at quantitative attributes of the code such as the coupling measures by Martin [43] and the Cyclomatic Complexity by McCabe [44] proved particularly useful as there is a clear connection between the results of the measurements and attributes of the code. However, there is still the need for additional work which focuses on defining the exact scale each of the measurements operates on, as well as need for the development of tools which can be used to seamlessly integrate software measurements into existing software projects.

Traffic accident prediction has been a hot research topic in the last decades. With the rise of Big Data, Machine Learning, Deep Learning and the real- time availability of traffic flow data, this research field becomes more and more interesting. In this thesis different data sources as traffic flow, weather, population and the crash data set from the city of Graz are collected over 3 years between 01.01.2015 and 31.12.2017. In this period 5416 accidents, which were recored by Austrian police officers, happened. Further these data sets are matched to two different spatial road networks. Beside feature engineering and the crash likelihood prediction also different imputation strategies are applied for missing values in the data sets. Especially missing value prediction for traffic flow measurements is a big topic. To tackle the imbalance class problem of crash and no-crash samples, an informative sampling strategy is applied. Once the inference model is trained, the crash likelihood for a given street link at a certain hour of the day can be estimated. Experiment results reveal the efficiency of the Gradient Boosting approach by incorporating with these data sources. Especially the different districts of Graz and street graph related features like centrality measurements and the number of road lanes play an important role. Against that, including traffic flow measurements as pointwise explanatory variables can not lead to a more accurate output accuracy.

The entry point of this master thesis is the context-based Web-Information- Agent Back to the Future Search (bttfs) which was developed with the goal of shortening the period of vocational adjustment while working on different projects at once as well as providing different functionalities for finding and re-finding relevant sources of information. bttfs supports the learning of a context-based user profile in two different ways. The first way is to learn the user profile by the use of a cosine-distance function applied on the Term Frequency-Inverse Document Frequency (tf-idf) document vectors and the second approach is to learn the user profile with a one-class Support Vector Machine (svm). Furthermore, the Information Retrieval methods Best Matching 25 (bm25), Term Frequency (tf), and tf-idf, are used on the created model, to determine the most relevant search queries for the user’s context. The central question answered in this thesis is stated as follows: ”Is it possible to anticipate a users future information need by exploiting the past browsing behavior regarding a defined context of information need?” To answer this question the methods above were applied to the AOL- dataset1, which is a collection of query logs, that consists of roughly 500.000 anonymous user sessions. The evaluation showed that a combination of the cosine-distance learning function and the tf weighting function yielded promising results ranging between 18.22% - 19.85% matching rate on av- erage, for the first three single word queries that appeared in advancing order on the timeline of the user actions. While the difference in perfor- mance between the cosine-distance method and the svm method appeared to be insignificant, tf and tf-idf outperformed bm25 in both of the tested scenarios. Regarding to the gained results, it can be stated, that the future information need of a particular user can be derived from prior browsing behavior in many cases, when the context of information need remained in the same context. Therefore, there are scenarios in which systems like bttfs can aid and accelerate the user’s information generation process by providing automated context-based queries.

In this thesis, we present a system to recognise natural appearing gestures using a self build smartglove prototype. We explain the nature of gestures and the anatomy of the human arm and go into the theory of gesture recognition. A user study is used as a basis of a data-driven approach to gesture recognition, where all possible features from human activity recognition are generated, and automatic methods to select a good set of features are explored. We extend this approach even further with a novel algorithm for selecting sensors for a specific target system. Recursive Sensor Elimination (RSE) selects sensors recursively using a heuristic function to find the best configuration for a given subset of gestures. We explain the use cases, the detail of the RSE algorithm and first experimental results. It shows the problems when someone tries to apply the insights of this work to consumer hardware in the form of a smartwatch experiment and which design decision have to be made. Within this experiment, it presents a possible method to augment IMU time series data if the labels are not corrupted by speeding up or slowing down the time series and adding some noise. With this, it is possible to train a simple system to allow steering f.e. a slide set with your watch.

Propaganda is one of the biggest problems in the modern world because it provokes conflicts which can lead to a great loss of human life. The annexation of Crimea and following conflict in Eastern Ukraine is a prime example of it. This conflict lead to thousands of lost lives and millions of displaced people. The lack of research on the topic of unsupervised propaganda detection led us to devise methods for analysing propaganda that does not rely on fact checking or makes use of a dedicated ground truth. Instead, we base our measures on a set of guiding principles that constitutes the intention of an propagandist authors. For each of these principles we propose techniques from the fields of Natural Language Processing and Machine Learning. We have chosen the Russian military intervention in Ukraine as our focus, and the Russian News and Information Agency as our data source. We found the representation of Ukraine to be remarkably different to other countries, hinting that the principles of propaganda might be applicable in this case. Our quantitative analysis paves the way to more in-depth qualitative analysis.

With the increasing development of technology nowadays a diverse number of possibilities have arisen but new challenges come into play too. These developments have made it possible to move towards Industry 4.0 and the so-called Smart Factories. It is the new manufacturing system where everything is supposed to be connected. This can have a big impact like in supporting decision making, in shortening the production life-cycle or in enabling highly customizable product manufacturing, which can be achieved by making use of the right data. The data that flows within a Smart Factory can be of an enormous volume, is heterogeneous and they do not come only from a single data source. However, the systems have to bring the created data into play somehow. The challenge here is to transform the created Big Data to the more valuable Smart Data, so that later in the process, analytics like Predictive Maintenance or Retrospective Analysis can be performed successfully on those data. This is also the aim of this Master’s Thesis. In order to solve this problem, a prototype service called Smart Data Service has been developed so that the raw incoming data streams are aggregated and put together in a more reduced but valuable format, known as Smart Data. For the testing purposes and the evaluation of the work, it was necessary to additionally develop a Smart Factory Simulator, which is supposed to emulate different scenarios of a manufacturing setup. Two use cases have been taken into consideration for evaluating the Smart Data Service - aggregating data that would be useful for applying Retrospective Analysis and aggregating data that would be useful for Predictive Maintenance. Finally, the results show that the aggregated Smart Data can have considerable value for performing Retrospective Analysis as well as Predictive Maintenance.

The modern economy heavily relies on data as a resource for advancement and growth. A huge amount of data is produced continuously, and only a fragment of the amount is handled properly and efficiently. Data marketplaces are increasingly gaining attention. They provide possibilities to exchange, trade and access different kinds of datasets across organizations, between interested data providers and data buyers. Data marketplaces need stable and efficient infrastructure for their operations, and a suitable business model in order to provide and gain value. Due to the rapid development of the field, and its recent high increase in popularity, the research on business models of data marketplaces is fragmented. This thesis aims to address the issue by identifying dimensions and characteristics of data marketplaces, which outline the characteristics of their business models. Following a rigorous process for taxonomy building, a business model taxonomy for data marketplaces is proposed. Using the evidence from a final sample of twenty available data marketplaces, the frequency of characteristics of data marketplaces is analyzed. In addition, four data marketplace business model archetypes are identified. The findings reveal the impact of the structure of data marketplaces as well as the relevance of infrastructure, regulations and security issues handling for identified business model archetypes. Therefore, this study contributes to the growing body of literature on digital business strategies.

Die Automobilindustrie erfährt aufgrund technologischer Entwicklungen, wie zum Beispiel dem autonomen Fahren oder der Elektrifizierung des Antriebsstranges, bedeutende Veränderungen. Einhergehend mit diesen Veränderungen, ist ein deutliches Wachstum generierter Daten, welche in sämtlichen Phasen der Automobilen Wertschöpfungskette erzeugt werden. Ziel vieler Unternehmen ist es, diese zur Verfügung stehenden Daten, wirtschaftlich zu verwerten. Die zwei bedeutendsten Möglichkeiten hierfür sind die datenbasierte Umsatzsteigerung, welche beispielsweise den Verkauf von Daten oder das Angebot von datenbasierten Services, beinhaltet, und die Kostenreduktion basierend auf dem Wissen, welches mittels vorhandener Daten generiert wird. Das große ökonomische Potential, welches von diversen Unternehmungen und Institutionen, darunter auch McKinsey (2016c, p.7ff), vorhergesagt wird, ruft Unternehmen aus verschiedenen Geschäftsbereichen auf den Plan, in diesem Bereich tätig zu werden. Neben den konventionellen Unternehmen in der Automobilindustrie, wie OEMs und Entwicklungsdienstleistern, versuchen neue Marktteilnehmer wie zum Beispiel IT-Unternehmen und Start-ups, im Datengeschäft der Automobilindustrie, Fuß zu fassen. Ziel dieser Arbeit ist es, eine Auswahl an, für die AVL relevanten, Entwicklungsdienstleistern, IT-Unternehmen und Start-ups zu identifizieren, diese auf ihr Marktangebot an datenbasierten Dienstleistungen, Produkten, Plattformen und anderen datenbasierten Aktivitäten, wie etwa Forschung, Kooperationen oder Firmenübernahmen, zu analysieren und die Ergebnisse zu interpretieren. Die Bestimmung der zu analysierenden Unternehmen basiert auf Rankings welche die umsatzstärksten Entwicklungsdienstleister in der Automobilindustrie sowie die umsatzstärksten IT-Unternehmen in der deutschen Automobilindustrie identifiziert. Relevante Start-ups wurden mit Hilfe einer Start-up Abfrage des Unternehmens Innospot bestimmt. Unternehmen dieser drei Unternehmensgruppen wurden auf Basis der öffentlich verfügbaren Informationen analysiert. Relevante Informationen bezüglich datenbasierter Dienstleistungen, Produkte und anderer datenbasierten Aktivitäten wurden unter Verwendung von Clustern kategorisiert und mit zusätzlichen Informationen aufgenommen. In dieser Arbeit kann ein Cluster als Themengebiet verstanden werden, wie zum Beispiel „Autonomes Fahren“ oder „Testen“. Die Auswertung der durch die Analyse gewonnen Daten, führte zu einer Vielzahl an Ergebnissen. Durch die Methode des Clusterns, wurden die Aktivitätsbereiche der Unternehmen, sowie jene Bereiche, in denen keine Aktivität festgestellt wurde, ermittelt. Eine Gegenüberstellung der Aktivitätsbereiche der analysierten Unternehmen mit jenen der AVL, identifiziert Unternehmen nach ihrer Cluster-Übereinstimmung mit der AVL. Jene Cluster, in denen keine Aktivität der AVL festgestellt werden konnte, wurden einer eigenen Analyse unterzogen, um Unternehmen zu identifizieren, welche in diesen Bereichen aktiv sind. Eine separate Analyse zeigt die Aktivität der analysierten Unternehmensgruppen in den Phasen der Automobilen Wertschöpfungskette. Entwicklungsdienstleister sind in den Phasen Entwicklung, Validierung, Produktion und Aftersales aktiv. Der Schwerpunkt der IT-Unternehmen liegt im Bereich der Produktion und des Aftersales. Start-ups legen ihren Fokus hauptsächlich auf den Aftersales Bereich. Diese Arbeit beschäftigt sich auch mit der Frage, ob Entwicklungsdienstleister und IT-Unternehmen an denselben datenbasierten Themen arbeiten oder ob eine klare Differenzierung möglich ist. Um diese Frage zu beantworten, wurde eine Competitive Landscape erstellt, welche die gegenwärtige Position von zuvor definierten Entwicklungsdienstleistern, IT-Unternehmen und Start-ups darstellt. Speziell größere Entwicklungsdienstleister, welche in vielen Clustern aktiv sind, sind vermehrt auch in IT-Bereichen tätig.

The subject area of automated Information Extraction from PDF documents is of high relevance since the PDF standard is still one of the most popular document formats for information representation and exchange. There is no structuring blue- print for PDF documents, which makes automated information gathering a complex task. Since tables are structuring elements with a very high information density, the field of Table Detection is highly relevant in the context of Information Extraction. Due to the high variety of formats and layouts it is hard to choose the correct tool that suits optimally for every specific scenario. In this thesis, the added value of techniques used to identify table structures in scanned PDF documents is evaluated. Therefore, two algorithms were implemented to allow an objective comparison of Table Extraction applied on different types of PDF documents. While the algorithm developed to treat native PDFs is based on heuristics, the second approach relies on deep-learning techniques. The evaluation of both implementations showed that the heuristic approach performs excellent in detecting tables. However, it shows weaknesses in distinguishing non-tabular areas that show similarities to table struc- tures, from tabular areas. Therefore, the Recall metric shows better results than the Precision for the heuristic method. When applying Table Detection on scanned PDFs using the second approach, the low number of False Positives and therefore the superior Precision value compared to the first approach is notable. On the other hand the number of tables not detected as trade-off for the high Precision result in a lower Recall for single- as well as multi-column documents if partial detections are classified as correct results. Furthermore, limitations that reduce the detection-ratio were detected. This concerns structures that share similarities with tables, like figures, formulas and pseudo-code. These mentioned limitations are particularly relevant for the heuristic and less for the deep-learning based approach. All in all, there were several findings concerning advantages and disadvantages of applying Table Detection on scanned and native documents. Based on the evaluation results, strategies were elaborated of when to preferably use a specific approach dependent upon the document type, layout and structuring elements.

Prognosen in heutigen Lieferketten sind von immer mehr Einflussfaktoren abhängig und deshalb wird es immer schwieriger, die Laufzeiten vorherzusagen. Aus diesem Grund müssen oft externe Systeme abgefragt werden, was in der Regel ressourcenintensiv ist. Ziele dieser Arbeit sind die Entwicklung und Einführung eines Entscheidungsbaumes, um die direkte Abhängigkeit von externen Services zu eliminieren und die Vorhersa- ge anhand von historischen Daten durchzuführen. Über einen Datengenerator können synthetische aber auch konstante Testdaten erzeugt und somit die Performance des ent- wickelten Entscheidungsbaumes getestet werden. Der Baum selbst unterscheidet zwischen Entscheidungsfragen und manuellen Fragen. Entscheidungsfragen werden vollständig in der Lernphase anhand der Parameter-Objekte definiert, wohingegen manuelle Fragen vorab programmiert werden. Eine Entscheidungs- findung basiert auf der Grundlage, dass so wenig Ebenen wie möglich erzeugt werden. Die Vereinfachung des Baumes wird anhand von mathematischen Operationen bzw. statis- tischen Werkzeugen, wie dem Ignorieren von unwahrscheinlichen Ergebnissen, erreicht. In dieser Arbeit wird gezeigt, dass es möglich ist, eine NoSQL Datenbank für das Spei- chern von Entscheidungsmodellen zu verwenden. Darüber hinaus kann aufgezeigt wer- den, dass die Vorhersage des Zustelldatums in einem Online-Shop mittels Entscheidungs- baum möglich ist.

Maschinelles Lernen ist weit verbreitet auf dem Gebiet der kondensierten Materie, besonders im Zusammenhang mit traditionellen quantenmechanischen Methoden, wie zum Beispiel der Dichtefunktionaltheorie (DFT). Eine mogliche Anwendung ist das Erlernen der Potentialhyper ache von Festkorpern zur Vorhersage von Kristallstrukturen. Im Allgemeinen ist die Ezienz und Genauigkeit des maschinellen Lernens abhangig von den verfugbaren Daten, dem Lernalgorithmus und der Datendarstellung. Die Datendarstellung ist notwendig um relevante Informationen uber das System quantitativ zu erfassen, sodass diese vom Lernalgorithmus verarbeitbar sind. In dieser Arbeit wenden wir unterschiedliche Methoden des maschinellen Lernens an, um die inneren Energien von polymorphen mono-elementaren Kristallstrukturen aus Kohlensto und Bor zu erlernen, die zuvor durch Kistallstruktur-Vorhersagen erzeugt wurden. Wir untersuchen unterschiedliche Lernalgorithmen und entwickeln eine physikalisch-motivierte Datendarstellung, welche die Kristallstruktur beschreibt. Wir optimieren und evaluieren die Leistung der Lernalgorithmen an Datensatzen, die relaxierte und gemischte, d.h. relaxierte und unrelaxierte, Kristallstrukturen beinhalten. Unsere Ergebnisse zeigen, dass Kernel-basierende Regressionsverfahren mit der entwickelten Datendarstellung genaue Vorhersagen von Energien gemischter Kristallstrukturen liefern, die mit quantenmechanischen Methoden vergleichbar sind. Mit einem ermittelten mittleren absoluten Fehler (MAE) von ungefahr 10 meV / Atom konnte die entwickelte Methode teure Berechnungen ersetzen, die in kostenintensiven Vorhersagen von Kristallstrukturen benotigt werden

Thermal processes in the manufacturing industry involve highly optimized equipment for production. In order to run the process the equipment has to be maintained, replaced and adjusted in their settings regularly. This requires a certain amount of effort, concerning the economic and timely aspects. The goal of this thesis was to purpose an approach for further improvement of the equipment efficiency, based on data-driven methods. Initially historic product and process data had been collected, mapped and pre-processed. In order to train selected machine learning algorithms features had been engineered and extracted. To ensure the state of the equipment can be represented through the available data, several models had been trained and evaluated. The presented heuristic approach dealt with the quality of the collected data and included a predictive maintenance model. This model further was analyzed to identify the influencing parameters on the lifespan of the equipment. Besides the prediction of maintenance actions, a proposal to optimize the utilization of the equipment had been presented. Based on the knowledge that the state of the equipment can be represented with the according techniques, there seems to be potential for further improvement in the processes through data-driven models.

Dramatic tragedies at major events in recent years with many deaths have shown how important it is to develop a security solution to prevent such catastrophes. In the context of this master thesis, a development concept for a mobile multisensor solution was developed, tested and evaluated to support safety and risk tasks at major events. After a detailed hardware research, a first prototype was developed, which was tested at the Frequency Festival in St. Pölten. The impressions and results from this test were evaluated and then a second prototype was developed, tested and subsequently evaluated. In addition to the detailed research of the various hardware components, Global Positioning System (GPS) and Inertial Measurement Unit (IMU) accuracy tests were conducted between professional sensors and smartphone sensors. Finally, a ready-to-use mobile multi-sensor solution was developed to support security and risk issues at major events designed to help security personnel in security tasks at urban locations and major events, thereby avoiding potentially dramatic tragedies.

Decision trees are one of the most intuitive models for decision making used in machine learning. However, the greedy nature of state of the art decision tree building algorithms can lead to subpar results. This thesis aimed to use the non- greedy nature of reinforcement learning to overcome this limitation. The novel approach of using reinforcement learning to grow decision trees for classification tasks resulted in a new algorithm that is competitive with state of the art methods and is able to produce optimal trees for simple problems requiring a non-greedy solution. We argue that it is well suited for data exploration purposes due to diverse results and direct influence on the trade-off between tree size and performance.

Whether it is a posting spreading hate about a group of people, a comment insulting another person or a status containing obscenities, such types of toxic content have become a common issue for many online platforms. Owners of platforms like blogs, forums or social networks are highly interested in detecting this negative content. The goal of this thesis is to evaluate the general suitability of convolutional neural networks (CNNs) for classifying toxicity in textual online comments. For this pur- pose different CNN architectures are developed and their performance is compared to state-of-the-art methods on the data set containing comments from Wikipedia discussion pages. For a better understanding of this type of neural networks this thesis contains three subquestions: a) Which patterns do CNNs learn and which features are important for the classification when being applied to this task? b) Which preprocessing techniques are beneficial to the performance? c) Are CNNs well-suited for comments from sources other than Wikipedia discussion pages? The evaluation showed a performance similar to other classifiers on the same data set. Moreover, the model showed a comparable performance on a second data set created for this thesis. The best single preprocessing technique in this work improved the F1 score from 0.636 to 0.645 compared to the baseline. An analysis of a trained model revealed that some patterns detected by the convolutional layer are interpretable by humans. The analysis of the influence of words to the prediction highlighted struggles with negations in the text and also revealed a severe bias included in the model.

In order to meet the current trends and challenges in the industrial sector, production logistics is one of the focal points in the optimization of assembly systems. In order to increase the efficiency of the internal material supply, milk-run systems were introduced. The milk-run is responsible for the replenishment and transport of parts from the warehouse to the workplaces within a company and is part of the intralogistics system. The aim of this thesis was to digitize such a milk-run system with the help of an RFID system and to test it afterwards. In the course of this digitization a software was developed, which simulates the complete production and logistics process of an assembly line. In order to be able to test this simulation, a suitable institution had to be found where the digitized milk-run system could be implemented and tested in order to generate a meaningful comparative value for the simulator. With the IIM LEAD Factory a suitable learning factory was found in which it was possible to implement the digitized milk-run system. The digitized milk-run system consists of an order management sub-system, which gives the logistics employee an overview of open orders and suggests to him or her where the parts to be picked are located on the shelf. The picking process is completed in connection with a pick-to-light system, that visually shows the employee exactly the compartment in the warehouse that is needed for the active order. In addition, the digitized milk-run system was enhanced by a route calculation, which allows to find the most suitable path from the warehouse to the workplace. One of the tasks of the already mentioned simulator is to simulate real production in such a way that it is possible to make suggestions to the employee for orders that would ideally have been placed in the near future. In order to evaluate that these simulated orders are correct, it was important to compare them with real orders from the learning factory. The result was not only a fully functional digitized milk-run system, but also an evaluation of how well the digitized system works in comparison to the old system and how precise the results of the simulator are. With the completion of this project it is possible to have a digitized milk-run system available, which has been tested and evaluated in a university institution.

Transport mode detection (TMD) is the process of recognizing the means of transportation (such as walking, cycling, driving, taking a bus, riding a metro etc.) by a given sensory input. When this input consists exclusively of audio data then it is called acoustic TMD. This thesis recherches and presents the mythology for creating datasets, which fulfill all critical requirements for the highly complex task of acoustic TMD. It provides a step-by-step guideline on what needs to be considered when designing, producing and enhancing the dataset. In order to compile this guideline a recording application was developed, a 9-class dataset with 245 hours of recordings was created, and experiments were run using this dataset. Those experiments aimed to shed light onto the required number and diversity of recordings, the ideal number of total classes, what is an appropriate sample length, how to remove samples of low quality and which evaluation strategy should be used. Finally, existing external datasets were used to evaluate the classification capabilities. With the help of our findings it should be easier for future projects to create their own acoustic datasets, especially for TMD.

Efficient siting of public charging infrastructure is critical for a seminal economic success in the expansion and utilization of electromobility. The research questions posed by this thesis read firstly: what are key criteria for the siting of charging points (CP) at the present day and secondly what characterizes optimal locations for future charging stations (CS) in Austria and Germany? To answer the research questions, a literature review was conducted to understand existing approaches to siting charging infrastructure and identify tools and practices already in use. Secondly, nine expert interviews were held with planners, operators and promoters of charging infrastructure from Germany and Austria. How existing companies and official authorities plan and develop charging infrastructure is currently subject of scientific research. Various approaches and models exist. However, they still require empirical and practical validation. The target of the thesis is to ascertain if there is a predefined procedure existent for the positioning of future charging infrastructure in the public space, as well as to examine which quality criteria are the most important to site both profitable and customer-oriented charging infrastructure in the future. To accomplish that, results from the interviews are contrasted with current literature. Findings show that there is no predefined procedure existent for the positioning of charging infrastructure. However, there are criteria that are of particular relevance for an efficient positioning. The aspects that are considered by both, literature and experts, to be most relevant in finding the right location of future charging infrastructure for EV are: points of interest nearby, participation of society (demand-based positioning) and use case (normal vs. fast charge) orientation. Once a CP is setup, there are three key parameters that define a profitable CP. These are high workload, high fluctuation and high energy turnover.

In most companies business management software has become omnipresent in recent years. These systems have been introduced to streamline productivity and handle data in a more centralized fashion. While younger staff, who grew up with computers and smart-phones, navigate newly introduced IT-services with ease, it can be challenging for more mature employees to understand and efficiently use those systems. To increase the efficiency in usage, we propose the introduction of a chatbot to assist users in performing complex tasks. Users can achieve their goals by writing to the conversational system messages in natural language. In further work, we focus on the German language to deploy the chatbot to a mid-sized Austrian company. To build a meaningful and helpful chatbot, we first elaborate on the back- grounds of customer-relationship management (CRM) software, the general structure of conversations and relating work regarding chatbots. With this information in mind, we outline useful features a chatbot for a German CRM software should exhibit. We evaluate existing Natural Language Processing (NLP) components for German and choose to implement a hybrid approach consisting of machine learning for intent classification and rule-based methods in a frame-based approach. After an evaluation period, we conducted a technical and empirical evalu- ation. For the empirical evaluation questionnaires were sent out to collect seven metrics. A major finding was, while this system was text-based only, users wished for voice-based interaction, to use the otherwise dead time when driving to and from the customer. The empirical evaluation also found users preferring a more rigid syntax over natural text. This reduced ambiguity for the chatbot and therefore improves on conversation efficiency.

Nowadays there are more and more devices that are being connected to the internet, therefore it is important to provide a reliable bridge between them. Gathering/Routing the data is the foundation for many different business processes and is therefore highly important. The goal of this thesis was to build a scalable infrastructure for sensor data that only uses open source components and is easy to use for users who provide sensor data. To make this system scalable, different container orchestrators were evaluated. As a basis, the container orchestration tool Kubernetes was chosen. Addi- tional system components for system maintenance were selected to improve the maintainability. Further components include a load-balancer, certificates for secure communication and monitoring. For the persistence of data, a solution was evaluated and included. The platform can be deployed to different IaaS providers via a Terraform script. The web UI for users and application management is written in Java and based on the high performance web framework Vert.x. The performance was evaluated using current web frameworks as a reference point. Applications from categories such as data input, data output and data computation/pro- cessing can be consumed by users. For every application category there is at least one reference application configured. On the data input category available MQTT servers were tested in regards to performance and the best suitable server solution was selected. The data output layer was evaluated and the best databases were used. For the data computation layer a HSTM based computational intelligence library was selected to showcase inter-connectivity between the components. The framework is extensible to include new applications to provide additional functionality to the users of the system. The system was tested in full action with two sensor types for input and out- put. Additional hardware sensors can be included by providing a template and base-values. Code can then be uploaded to these sensors, based on the values the user provided. Thus the developed system allows and facilitates the setup of a full-blown scalable sensor data framework on multiple cloud provider.

The problem of information overload is widely recognized today. Living in an information society, we are all affected by the increasing amounts of information becoming available every day. The impact of this phenomenon shows itself in several information related tasks, such as conducting a litera- ture search, by making it difficult for people to find information relevant to their interests. In this work, we develop a recommender system capable of providing relevant literature recommendations for a pending citation in a scientific paper. We employ a content-based recommendation approach based on information retrieval techniques. The input to our system con- sists of the citation context around the pending citation while the output comprises a ranked list of documents serving as citation candidates. Within our experimental setup, we experiment with different query formulation strategies and retrieval models in order to improve the performance of the system. The evaluation of our system shows the potential of this approach, reaching a peak MRR of 0.416. This is further emphasized by the results gained from our contribution to the CL-SciSumm Shared Task 2017 where we achieve top results among all participating systems.

Anomaly detection is a common research topic in data science. Detecting anomalies that occur collectively in a sequence is useful for many appli- cations such as intrusion or fault detection. In this thesis, I developed a parameter-free solution for detecting collective anomalies in sequential data based on stationarity and volatility estimation (STAVE). The STAVE algorithm extracts subsequences of a full sequence with a sliding win- dow and clusters them according to a stationarity and volatility distance function. Collective anomalies are then detected by extracting the longest connected sequence within the smallest cluster. In a practical evaluation, STAVE achieved results comparable to commonly used parametric alterna- tives, while retaining low computational complexity and requiring no input other than the sequence to be investigated.

Systems that extract information from natural language texts usually need to consider language-dependent aspects like vocabulary and grammar. Compared to the development of individual systems for different languages, development of multilingual information extraction (IE) systems has the potential to reduce cost and effort. One path towards IE from different languages is to port an IE system from one language to another. PropsDE is an open IE (OIE) system that has been ported from the English system PropS to the German language. There are only few OIE methods for German available. Our goal is to develop a neural network that mimics the rules of an existing rule-based OIE system. For that, we need to learn about OIE from German text. By performing an analysis and a comparison of the rule-based systems PropS and PropsDE, we can observe a step towards multilinguality, and we learn about German OIE. Then we present a deep-learning based OIE system for German, which mimics the behaviour of PropsDE. The precision in directly imitating PropsDE is 28.1%. Our model produces many extractions that appear promising, but are not fully correct

Feature selection has become an important focus in machine learning. Es- pecially in the area of text classification, using n-gram language models will lead to high dimensional datasets. In this thesis we propose a new method of dimensionality reduction. Starting with a small subset of features, an iterative forward selection method is performed to extend our feature space. The main idea is, to interpret the results from a trained classifier in order to determine feature importance. Our experimental results over various classification algorithms show that with this approach it is possible to improve prediction performance over other state of the art dimension reduction methods, while providing a more cost-effective feature space.

 Für verschiedene Interessensgruppen wie Betreiber, Ordner, Exekutive, usw. ist die Erfassung und Präsentation von Menschenströmen und lokalen Dichten auf dem Gelände einer Großveranstaltung von großer Bedeutung.Um dieses Ziel zu erreichen wird ein Framework zur Multi-Sensor-Datenfusion erstellt, mittels dessen ein Modell der Besucherpopulation auf einem definierten Veranstaltungsgelände beliefert wird. Der Einsatz verschiedener Arten von Sensoren (Bluetooth-Scanner, Zählsensoren, Video und GSM-Zellen-Information) führt zu Aussagen über Personenzählungen in unterschiedlichen räumlichen Ausdehnungen mit unterschiedlicher Aussagekraft. Nach Bestimmung der Aussagekraft jedes Sensors können Zählungen auf den erfassten Bereichen des Geländes erfolgen. Überlappende Bereiche werden mittels Datenfusion mit höherer Genauigkeit gezählt. Um Aussagen über nicht direkt erfasste Bereiche des Geländes treffen zu können, wird ein einfaches Weltmodell eingesetzt, das seine Information aus den Zählungen der überwachten Bereiche bezieht sowie dem modellierten Verhalten von Veranstaltungsgästen.  

This thesis demonstrates the potential and benefits of unsupervised learning with Self-Organizing Maps for stress detection in laboratory and free-living environment. The general increase in pace of life, both in the personal and work environment leads to the intensification and amount of work, constant time pressure and pressure to excel. It can cause psychosocial problems and negative health outcomes. Providing personal information about one’s stress level can counteract the adverse health effects of stress. Currently the most common way to detect stress is by the means of questionnaires. This is time consuming, subjective and only at discrete moments in time. Literature has shown that in a laboratory environment physiological signals can be used to detect stress in a continuous and objective way. Advances in wearable technology now make it feasible to continuously monitor physiological signals in daily life, allowing stress detection in a free-living environment. Ambulant stress detection is associated with several challenges. The data acquisition with wearables is less accurate compared to sensors used in a controlled environment and physical activity influences the physiological signals. Furthermore, the validation of stress detection with questionnaires provides an unreliable labelling of the data as it is subjective and delayed. This thesis explores an unsupervised learning technique, the Self-Organizing Map (SOM), to avoid the use of subjective labels. The provided data set originated from stress-inducing experiments in a con- trolled environment and ambulant data measured during daily-life activities. Blood volume pulse (BVP), skin temperature (ST), galvanic skin response (GSR), electromyogram (EMG), respiration, electrocardiogram (ECG) and acceleration were measured using both wearable and static devices. First, a supervised learning with Random Decision Forests (RDF) was applied to the laboratory data to provide a gold standard for unsupervised learning outcomes. A classification accuracy of 83.04% was reached using ECG and GSR features and 76.89% using ECG features only. Then the feasibility of the SOMs was tested on the laboratory data and compared a posteriori with the objective labels. Using a subset of ECG features, the classification accuracy was 76.42%. This is similar to supervised learning with ECG features, indicating the principal functioning of the SOMs for stress detection. In the last phase of this thesis the SOM was applied on the ambulant data. Training the SOM with ECG features from the ambulant data, enabled clustering from the feature space. The clusters were well separated with large cohesion (average silhouette coefficient of 0.49). Moreover, the clusters were similar over different test persons and days. According to literature the center values of the features in each cluster can indicate stress and relax phases. By mapping test samples on the trained and clustered SOM, stress predictions were made. Comparison against the subjective stress levels was however poor with a root mean squared error (RMSE) of 0.50. It is suggested to further explore the use of Self-Organizing Maps as it solely relies on the physiological data, excluding subjective labelling. Improvements can be made by applying multimodal feature sets, including for example GSR.

Die elektrische Energiewirtschaft befindet sich in einer Wende. Sowohl Energieerzeuger, wie auch Netzbetreiber sind von der Hinwendung zu regenerativen Energien betroffen.Höhere Kosten für Erzeugung und Übertragung stehen regulierten Einnahmen gegenüber. Instandhaltungskosten sind ein erheblicher Kostenfaktor. Es stellt sich die Frage, ob Predictive Analytics im Allgemeinen bzw. Predictive Maintenance im Speziellen eine Option zur Verminderung dieser Kosten bei gleichbleibender oder verbesserter Zuverlässigkeit sind. Nach einer Aufarbeitung der technologischen, wirtschaftlichen und rechtlichen Rahmenbedingungen, wird mittels Szenariotechnik ein narratives Szenario erstellt. Dieses dient der Stimulation von Experten aus verschiedenen Bereichen der elektrischen Energiewirtschaft. In der Folge werden diese Experten zu ihrer Meinung befragt. Auch wenn aktuell rechtliche Bedenken vorhanden sind, herrscht Einigkeit darüber, dass Predictive Maintenance in der elektrischen Energiewirtschaft kommen wird. Diese Änderungen sind nicht auf die Energieversorger beschränkt. Auch Zulieferbetriebe, Dienstleister und Kunden werden davon betroffen sein.

Question and answer (Q&A) systems are and will always be crucial in the digital life. Famous Q&A systems succeeded with having text, images and markup language as input possibilities. While this is sufficient for most questions, I think that this is not always the case for questions with a complex background. By implementing and evaluating a prototype of a domain-tailored Q&A tool I want to tackle the problem that formulating complex questions in text only and finding them consequently can be a hard task. Testing several non-text input possibilities including to parse standardized documents to populate metadata automatically and mixing exploratory and facetted search should lead to a more satisfying user experience when creating and searching questions. By choosing the community of StarCraft II it is ensured to have many questions with a complex background belonging to one domain. The evaluation results show that the implemented Q&A system, in form of a website, can hardly be compared to existing ones without having big data. Regardless users do see a potential for the website to succeed within the community which seems convincing that domain-tailored Q&A systems, where questions with metadata exist, can succeed in other fields of application as well.

Während der Durchführung von Großveranstaltungen muss eine Einsatzleitung bestehend aus den führenden Mitgliedern der beteiligten Organisationen die Sicherheit der Besucher gewährleisten. Der leitende Stab benötigt laufend Information, um stets Bewusstsein über die aktuelle Lage zu haben und bei Bedarf Maßnahmen zu setzen. Zur Abwendung drohender Gefahren und Lösung bestehender Lagen ist Lageinformation entscheidend. Hat Information den Stab erreicht, so muss sie effizient und fehlerfrei darin verteilt werden. Dadurch kann ein gemeinsames Lagebewusstsein entstehen, das für alle Mitglieder gleichermaßen unmissverständlich verfügbar ist. Um die Erfüllung dieser Aufgaben zu unterstützen, wurde ein Führungsunterstützungssystem entwickelt, dessen Funktionen mittels der Prinzipien von Design Case Studies durch iterative Prototypenverbesserungen, qualitative Interviews mit Sicherheitskräften und Feldstudien bei Großveranstaltungen bestimmt wurden. Mit Domänenexperten wurde die Nutzung boden- und luftgestützter Sensoren zur fusionierten Aufbereitung und Präsentation der aktuellen Lage bezüglich Verteilungen von Menschenmengen in einem geographischen Informationssystem (GIS) diskutiert. Dazu wurde ihnen der Prototyp mit einem synthetischen Datensatz zur Evaluierung vorgelegt. Nach der Beobachtung von Arbeitsprozessen der Einsatzleitung bei Veranstaltungssicherungen zum Finden von Schwachpunkten wurde das GIS-System auf die effiziente Bereitstellung von Stammdaten sowie der Visualisierung von Lagen für alle aktiven Stabsmitarbeiter ausgerichtet. Erkannte Schwächen konnten durch unterstützende Prototyp-Funktionen gemildert werden, wie die vergleichende Nachstellung von beobachteten Vorfällen mit dem Führungsunterstützungssystem im abschließenden Workshop zeigte.

Location-based games are currently more popular than ever for the general public. Games, such as Geocaching, Ingress and Pokemon Go have created a high demand in the app market and established themselves in a major category in the mobile gaming sector. Since location-based games are reliant on mobile sensors, battery life, cellular data connections and even environmental conditions, many problems can rise up while playing the game and hence, can reduce user experience and player enjoyment. The aim of this thesis is to improve the gaming experience of location-based games, which use map information to place virtual content at appropriate physical locations, with the assistance of an user-centered design approach. Therefore, a game named Geo Heroes was designed and implemented in order to evaluate it with existing quantitative and qualitative methods from research. The game was assessed in an empirical study with nine participants including a game-play session of about one hour. Participants were divided into an experimental and control group to author disparities in the implemented content placement algorithms. An already established questionnaire for traditional computer games, and one created by the author based on existing research in location-based games, were used to measure common factors in gaming experience. Additionally, participants sent log data with their current emotions during game-play after various interactions with game objects. Different outcome scenarios of interactions were considered to ensure a better analysis. Furthermore, an open group discussion was held to gather qualitative information from participants to reveal still undiscovered issues and to provide evidence from results of conducted quantitative methods. Results have shown that the questionnaire for location-based games is a useful tool to measure player enjoyment. In combination with the tracked emotions and a group interview, relevant information can be obtained in order to improve game design and mechanics.

Texts are of crucial importance for communicating and managing information. How- ever, text composition is still a challenge for many people: in order to effectively convey their message, writers need skills in planning and structuring, linguistic abil- ity, and also the ability to evaluate their own work. In this thesis, we look at how writers can be supported in all the tasks encom- passed in the writing process. To this end, and in addition to literature research, we conducted an experiment to analyse the characteristics of the writing processes as well as difficulties writers typically encounter when they search for information, plan the structure of their text, translate their ideas to words, and review their writing. We formulate requirements for aiding these tasks and propose support possibilities, with a special focus on digital solutions. Issues with existing tools are that they generally support only one aspect and interrupt the writing task. This was our motivation for developing a prototype of a comprehensive text composition tool which supports writers in all stages of their task. We chose to implement it as a Google Docs add-on, which means that it can be integrated seamlessly into the Google Docs text editor. The add-on offers a number of features specifically tailored to each writing process. Finally, we performed a user study to evaluate the features and the workflow while using the add-on.

This thesis develops a tool to collaboratively explore a collection of EEG signals and identify events. Certain data require events to be tagged in a post-hoc process. Current state-of-the-art tools used in research allow a single user to manually label events or artifacts in signal data. Although automatic methods can be applied, they usually have a precision below 80% and require subsequent manual labelling steps. We propose a tool to collaboratively label data. It allows several users to work together in identifying events/artifacts in the signal space. This tool offers several advantages, from saving time by splitting up work between users to obtaining a consensus between experts on the occurrence of events. The talk will describe the collaborative aspects of labelling events in signal data.

Im Rahmen der Masterarbeit wurde ein Prototyp für ein Assistenzsystem für Baufahrzeuge zur Erkennung von gefährdeten Personen im Baustellenbereich entwickelt und evaluiert. In Voruntersuchungen wurden ausgesuchte Sensorprinzipien zur Verwendung für die Personenerkennung analysiert. Eine Auswahl an kameraoptischen- und Distanzsensoren lieferten Daten aus der Umgebung des Fahrzeuges. Der Fokus der Arbeit lag auf dem Entwurf einer geeigneten Architektur, um alle im Assistenzsystem verwendeten Komponenten und Module für Personenerkennungsalgorithmen zu fusionieren. Im prototypischen Aufbau wurde die Mensch-Maschine-Schnittstelle in Form eines Live-Kamera-Streams, mit eingeblendeten Warnungen in einer einfach zu verstehenden und verwendbaren Benutzeroberfläche, integriert. Im Zuge von Testreihen wurde die Leistungsfähigkeit des Systems bei verschiedenen Fahrzeuggeschwindigkeiten untersucht. Für Kombinationen von eingesetzten Sensoren wurden höchste zugelassene Geschwindigkeiten ermittelt, damit das Fahrzeug zum Stillstand gebracht werden kann, um einen Unfall zu vermeiden. Testläufe unter möglichst realen Bedingunen haben gezeigt, dass Personenerkennung in Echtzeit durchgeführt werden kann, aber auch viel Raum für Verbesserungen vorhanden ist. Fahrer werden in Situationen mit hohem Unfallrisiko gut vom System unterstützt und sind dadurch in der Lage Unfälle zu vermeiden. Außerdem wurden die Stärken und Schwächen des Personenerkennungssystem analysiert und es konnten detaillierte und wichtige Informationen über Arbeitssituationen und -abläufe, Verhalten von Fahrern, einzelnen Komponenten und dem gesamten System gewonnen werden.

Bei Waldbrandsituation steht der Krisenstab oft vor Problemen in Bezug auf die Koordination, Entwicklung einer Einsatzstrategie und dem Bewahren der Übersicht während des Einsatzes. Ziel dieser Arbeit war ein Basisprototyp zur Demonstration von Unterstützungsmöglichkeiten für den Operator in der Einsatzleitung. Bei der Entwicklung dieses Prototypen stand die Usability im Vordergrund. Zur Verbesserung der Usability wurden während des Softwareentwicklungsprozesses Methoden des User Centered Designs(UCD) angewendet. Bei der Entwicklung einer Software mit kleiner Nutzergruppe, konnte herausgefunden werden, dass durch die Gegebenheit der Nischenposition der Nutzer andere Methoden angewendet werden müssen als bei einer größeren Nutzergruppe. Für die finale Präsentation des Prototyps wurde ein internationaler Expertenworkshop ausgewählt, bei dem die Software demonstriert und anschließend mit den Experten diskutiert wurde. Aus den Diskussionen konnte die Schlussfolgerung getroffen werden, dass eine solche Software derzeit noch nicht existiert und in vielen Aufgaben des Einsatzstabes benötigt wird. Grundsätzlich kann gesagt werden, dass Methoden aus dem UCD eine gute Basis für die Softwareentwicklung von Katastrophenschutzsoftware bilden und die Weiterentwicklung dieses Softwareprototyp einen guten Anfang für die Entwicklung eines Waldbrandmanagementsystems darstellt.

During the last decades, the amount of information available for researches has increased several fold, making the searches more difficult. Thus, Information Retrieval Systems (IR) are needed. In this master thesis, a tool has been developed to create a dataset with metadata of scientific articles. This tool parses the articles of Pubmed, extracts metadata from them and saves the metadata in a relational database. Once all the articles have been parsed, the tool generates three XML files with that metadata: Articles.xml, ExtendedArticles.xml and Citations.xml. The first file contains the title, authors and publication date of the parsed articles and the articles referenced by them. The second one contains the abstract, keywords, body and reference list of the parsed articles. Finally, the file Citations.xml file contains the citations found within the articles and their context. The tool has been used to parse 45.000 articles. After the parsing, the database contains 644.906 articles with their title, authors and publication date. The articles of the dataset form a digraph where the articles are the nodes and the references are the arcs of the digraph. The in-degree of the network follows a power law distribution: there is an small set of articles referenced very often while most of the articles are rarely referenced. Two IR systems have been developed to search the dataset: the Title Based IR and the Citation Based IR. The first one compares the query of the user to the title of the articles, computes the Jaccard index as a similarity measure and ranks the articles according to their similarity. The second IR compares the query to the paragraphs where the citations were found. The analysis of both IRs showed that the execution time needed by the Citation Based IR was bigger. Nevertheless, the recommendations given were much better, which proved that the parsing of the citations was worth it.

Open Information Extraction (OIE) targets domain- and relation-independent discovery of relations in text, scalable to the Web. Although German is a major European language, no research has been conducted in German OIE yet. In this paper we fill this knowledge gap and present GerIE, the first German OIE system. As OIE has received increasing attention lately and various potent approaches have already been proposed, we surveyed to what extent these methods can be applied to German language and which additionally principles could be valuable in a new system. The most promising approach, hand-crafted rules working on dependency parsed sentences, was implemented in GerIE. We also created two German OIE evaluation datasets, which showed that GerIE achieves at least 0.88 precision and recall with correctly parsed sentences, while errors made by the used dependency parser can reduce precision to 0.54 and recall to 0.48

Vernetzte Daten und Strukturen erfahren ein wachsendes Interesse und verdrängen bewährte Methoden der Datenhaltung in den Hintergrund. Einen neuen Ansatz für die Herausforderungen, die das Management von ausgeprägten und stark vernetzten Datenmengen mit sich bringen, liefern Graphdatenbanken. In der vorliegenden Masterarbeit wird die Leistungsfähigkeit von Graphdatenbanken gegenüber der etablierten relationalen Datenbank evaluiert. Die Ermittlung der Leistungsfähigkeit erfolgt durch Benchmarktests hinsichtlich der Verarbeitung von hochgradig vernetzten Daten, unter der Berücksichtigung eines umgesetzten feingranularen Berechtigungskonzepts. Im Rahmen der theoretischen Ausarbeitung wird zuerst auf die Grundlagen von Datenbanken und der Graphentheorie eingegangen. Diese liefern die Basis für die Bewertung des Funktionsumfangs und der Funktionalität der zur Evaluierung ausgewählten Graphdatenbanken. Die beschriebenen Berechtigungskonzepte liefern einen Überblick unterschiedlicher Zugriffskonzepte sowie die Umsetzung von Zugriffskontrollen in den Graphdatenbanken. Anhand der gewonnenen Informationen wird ein Java-Framework umgesetzt, welches es ermöglicht, die Graphdatenbanken, als auch die relationale Datenbank unter der Berücksichtigung des umgesetzten feingranularen Berechtigungskonzepts zu testen. Durch die Ausführung von geeigneten Testläufen kann die Leistungsfähigkeit in Bezug auf Schreib- und Lesevorgänge ermittelt werden. Benchmarktests für den schreibenden Zugriff erfolgen für Datenbestände unterschiedlicher Größe. Einzelne definierte Suchanfragen für die unterschiedlichen Größen an Daten erlauben die Ermittlung der Leseperformance. Es hat sich gezeigt, dass die relationale Datenbank beim Schreiben der Daten besser skaliert als die Graphdatenbanken. Das Erzeugen von Knoten und Kanten ist in Graphdatenbanken aufwendiger, als die Erzeugung eines neuen Tabelleneintrags in der relationalen Datenbank. Die Bewertung der Suchanfragen unter der Berücksichtigung des umgesetzten Zugriffkonzepts hat gezeigt, dass Graphdatenbanken bei ausgeprägten und stark vernetzten Datenmengen bedeutend besser skalieren als die relationale Datenbank. Je ausgeprägter der Vernetzungsgrad der Daten, desto mehr wird die JOIN-Problematik der relationalen Datenbank verdeutlicht.

Information validation is the process of determining whether a certain piece of information is true or false. Existing research in this area focuses on specific domains, but neglects cross-domain relations. This work will attempt to fill this gap and examine how various domains deal with the validation of information, providing a big picture across multiple domains. Therefore, we study how research areas, application domains and their definition of related terms in the field of information validation are related to each other, and show that there is no uniform use of the key terms. In addition we give an overview of existing fact finding approaches, with a focus on the data sets used for evaluation. We show that even baseline methods already achieve very good results, and that more sophisticated methods often improve the results only when they are tailored to specific data sets. Finally, we present the first step towards a new dynamic approach for information validation, which will generate a data set for existing fact finding methods on the fly by utilizing web search engines and information extraction tools. We show that with some limitations, it is possible to use existing fact finding methods to validate facts without a preexisting data set. We generate four different data sets with this approach, and use them to compare seven existing fact finding methods to each other. We discover that the performance of the fact validation process is strongly dependent on the type of fact that has to be validated as well as on the quality of the used information extraction tool

This thesis aims to shed light on the early classification of time series problem, by deriving the trade-off between classification accuracy and time series length for a number of different time series types and classification algorithms. Previous research on early classification of time series focused on keeping classification accuracy of reduced time series roughly at the level of the complete ones. Furthermore, that research work does not employ cutting-edge approaches like Deep Learning. This work fills that research gap by computing trade-off curves on classification ”earlyness” vs. accuracy and by empirically comparing algorithm performance in that context, with a focus on the comparison of Deep Learning with classical approaches. Such early classification trade-off curves are calculated for univariate and multivariate time series and the following algorithms: 1-Nearest Neighbor search with both the Euclidean and Frobenius distance, 1-Nearest Neighbor search with forecasts from ARIMA and linear models, and Deep Learning. The results obtained indicate that early classification is feasible in all types of time series considered. The derived tradeoff curves all share the common trait of slowly decreasing at first, and featuring sharp drops as time series lengths become exceedingly short. Results showed Deep Learning models were able to maintain higher classification accuracies for larger time series length reductions than other algorithms. However, their long run-times, coupled with complexity in parameter configuration, implies that faster, albeit less accurate, baseline algorithms like 1-Nearest Neighbor search may still be a sensible choice on a case-by-case basis. This thesis draws its motivation from areas like predictive maintenance, where the early classification of multivariate time series data may boost performance of early warning systems, for example in manufacturing processes.

Research on recommender systems has gained a tremendous popularity in recent years. Although various recommender approaches are available nowadays, there is still a lack of work that tackles real-time recommendation on large and sparse data. To tackle the data sparsity problem, this thesis analyzes different trust-based approaches which improve the accuracy of the usually used Collaborative Filtering recommendation approaches. To show how the trust-based approaches can also be applied to generate real-time recommendations, this thesis extended ScaR, a scalable recommendation framework, with recommendation approaches which calculate the trust values between users using the Apache Solr search engine. Experimental results showed that using trust-based approaches, high quality recommendations can be served in realtime.

Everyone knows the annoying situation when personal items of appreciated value are disappeared and several precious minutes, or even hours, are was- ted for frantic searching. Modern technologies are useful for assisting in such moments. For example electronic key finders triggered by whistling ore remo- te controls are available over years, but the acceptance for these gadgets are rather low. An important field of research on that topic is on using modern smartphones for locating your everyday objects. Today’s smartphones are equipped with various hardware that can be used for retrieving locations. The primary used technology for this purpose is the Global Positioning System (GPS). For example Apple is successfully offering the service Find my iPhone, which can locate a misplaced iPhone with the usage of GPS. The limitation of GPS is the lack of accuracy in urban areas and especially inside of buildings. To counteract this limitation, GPS is often used together with WiFi triangulation, which needs a well developed WiFi infrastructure for proper operation, which is difficult to achieve in private households. The goal of this thesis is to develop an easy to use application for indivi- duals for retrieving their lost items in- and outdoors only with technologies present in their smartphones. A hybrid solution of localization and motion sensing will be used for tracing the user’s location. The focus will be on indoor tracing using accelerometer, gyroscope and compass data. The proto- type is implemented as an iPhone application to record motion and location data and a web application to calculate and visualize the user’s trace. The web application will also provide a user interface for backtracking the user’s trace to a lost item by time filtering or by tagging items. 

During a typical day, we have several social interactions with different people belonging to different semantic groups (e.g. Friends, family, co-workers). In this paper we try to find promising hypothesis to link data collected from a mobile sensing application running on the users smartphone to the social interactions he has during a typical day. We will search for possibilities to reliably determine (1) the number of interactions he has during the day, (2) the length of social interactions, (3) the number of participants, (4) who the participants were and (5) the semantic context of the interaction using data collected by a pilot study, where, additionally to the date collected by the framework, users label their interactions during the day. 

With this thesis we try to determine the feasibility of detecting face-to-face social interactions based on standard smartphone sensors like Bluetooth, Global Positioning System (GPS) data, microphone or magnetic field sen- sor. We try to detect the number of social interactions by leveraging Mobile Sens- ing on modern smartphones. Mobile Sensing is the use of smartphones as ubiquitous sensing devices to collect data. Our focus lies on the standard smartphone sensors provided by the Android Software Development Kit (SDK) as opposed to previous work which mostly leverages only audio sig- nal processing or Bluetooth data. To mine data and collect ground truth data, we write an Android 2 app that collects sensor data using the Funf Open Sensing Framework[1] and addi- tionally allows the user to label their social interaction as they take place. With the app we perform two user studies over the course of three days with three participants each. We collect the data and add additional meta-data for every user during an interview. This meta-data consists of semantic labels for location data and the distinction of social interactions into private and business social interactions. We collected a total of 16M data points for the first group and 35M data points for the second group. Using the collected data and the ground truth labels collected by our partici- pants, we then explore how time of day, audio data, calendar appointments, magnetic field values, Bluetooth data and location data interacts with the number of social interactions of a person. We perform this exploration by creating various visualization for the data points and use time correlation to determine if they influence the social interaction behavior. We find that only calendar appointments provide some correlation with the social interactions and could be used in a detection algorithm to boost the accuracy of the result. The other data points show no correlation during our exploratory evaluation of the collected data. We also find that visualizing the interactions in the form of a heatmap on a map is a visualization that most participants find very interesting. Our participants also made clear that la- beling all social interactions over the course of a day is a very tedious task. We recommend that further research has to include audio signal process- ing and a carefully designed study setup. This design has to include what data needs to be sampled at what frequency and accuracy and must provide further assistance to the user for labeling the data. We release the data mining app and the code used to analyze the data as open source under the MIT License.  

Many people face the problem of misplaced personal items in their daily routine, especially when they are in a hurry, and often waste a lot of time searching these items. There are different gadgets and applications available on the market, which are trying to help people find lost items. Most often, help is given by creating an infrastructure that can locate lost items. This thesis presents a novel approach for finding lost items, namely by helping people re-trace their movements throughout the day. Movements are logged by indoor localization based on mobile phone sensing. An external infrastructure is not needed. The application is based on a step based pedestrian dead reckoning system, which is developed to collect real-time localization data. This data is used to draw a live visualization of the whole trace the user has covered, from where the user can retrieve the position of the lost personal items, after they were tagged using simple speech commands. The results from the field experiment, that was performed with twelve participants of different age and gender, showed that the application could successfully visualize the covered route of the pedestrians and reveal the position of the placed items.  

The amount of multimedia content being created is growing tremendously. In addition, the number of applications for processing, consuming, and sharing multimedia content is growing. Being able to create and process metadata describing this content is an important prerequisite to ensure a correct workflow of applications. The MPEG-7 standard enables the description of different types of multimedia content by creating standardized metadata descriptions. When using MPEG-7 practically, two major drawbacks are identified, namely complexity and fuzziness. Complexity is mainly based on the comprehensiveness of MPEG-7, while fuzziness is a result of the syntax variability. The notion of MPEG-7 profiles were introduced in order to address and possibly solve these issues. A profile defines the usage and semantics of MPEG-7 tailored to a particular application domain. Thus usage instructions and explanations, denoted as semantic constraints, can be expressed as English prose. However, this textual explanations leave space for potential misinterpretations since they have no formal grounding. While checking the conformance of an MPEG-7 profile description is possible on a syntactical level, the semantic constraints currently cannot be checked in an automated way. Being unable to handle the semantic constraints, inconsistent MPEG-7 profile descriptions can be created or processed leading to potential interoperability issues. Thus an approach for formalizing the semantic constraints of MPEG-7 profiles using ontologies and logical rules is presented in this thesis. Ontologies are used to model the characteristics of the different profiles with respect to the semantic constraints, while validation rules detect and flag violations of these constraints. In similar manner, profile-independent temporal semantic constraints are also formalized. The presented approach is the basis for a semantic validation service for MPEG-7 profile descriptions, called VAMP. VAMP verifies the conformance of a given MPEG-7 profile description with a selected MPEG-7 profile specification in terms of syntax and semantics. Three different profiles are integrated in VAMP. The temporal semantic constraints are also considered. As a proof of concept, VAMP is implemented as a web application for human users and as a RESTful web service for software agents.  

The goal of this thesis is to improve query suggestions for rare queries on faceted documents. While there has been extensive work on query suggestions for single facet documents there is only little known about how to provide query suggestions in the context of faceted documents. The constraint to provide suggestions also for uncommon or even previously unseen queries (so-called rare queries) increases the difficulty of the problem as the commonly used technique of mining query logs can not be easily applied.

In this thesis it was further assumed that the user of the information retrieval system always searches for one specific document - leading to uniformly distributed queries. Under these constraints it was tried to exploit the structure of the faceted documents to provide helpful query suggestions. In addition to theoretical exploration of such improvements a custom datastructure was developed to efficiently provide interactive query suggestions. Evaluation of the developed query suggestion algorithms was done on multiple document collections by comparing them to a baseline algorithm that reduces faceted documents to single facet documents. Results are promising as the final version of the new query suggestion algorithm consistently outperformed the baseline.

Motivation for and potential application of this work can be found in call centers for customer support. For call center employees it is crucial to quickly locate relevant customer information - information that is available in structured form (and can thus easily be transformed into faceted documents).

“Wiktionary”, is a free dictionary which is part of Wikmedia Foundation. This webpage contains translations, etymologies, synonyms and pronunciations of words in multiple languages in that case we just focus on English.

A syntactic analyser (parser) turns the entry text in other structures, which will make easier the analysis and capture of nest entrance.

Unter Wissenschaftlern ist Twitter ein sehr beliebtes soziales Netzwerk. Dort diskutieren sie verschiedenste Themen und werben für neue Ideen oder präsentieren Ergebnisse ihrer aktuellen Forschungsarbeit. Die in dieser Arbeit durchgeführten Experimente beruhen auf einem Twitter-Datensatz welcher aus den Tweets von Informatikern, deren Forschungsbereiche bekannt sind, besteht. Die vorliegende Diplomarbeit kann grob in vier Teile unterteilt werden: Zunächst wird beschrieben, wie der Twitter-Datensatz erstellt wurde. Danach werden diverse Statistiken zu diesem Datensatz präsentiert. Beispielsweise wurden die meisten Tweets während der Arbeitszeit erstellt und die Nutzer sind unterschiedlich stark aktiv. Aus den Follower-Beziehungen der Nutzer wurde ein Netzwerk erstellt, welches nachweislich small world Eigenschaften hat. Darüber hinaus sind in diesem Netzwerk auch die verschiedenen Forschungsbereiche sichtbar. Der dritte Teil dieser Arbeit ist der Untersuchung der Hashtagbenutzung gewidmet. Dabei zeigte sich, dass die meisten Hashtags nur selten benutzt werden. Über den gesamten Beobachtungszeitraum betrachtet ändert sich die Verwendung von Hashtags kaum, jedoch gibt es viele kurzfristige Schwankungen. Da die Forschungsbereiche der Nutzer bekannt sind, können auch die Bereiche der Hashtags bestimmt werden. Dadurch können die Hashtags dann in fachspezifische und generelle Hashtags unterteilt werden. Die Analyse der Weitergabe von Hashtags über das Twitter-Netzwerk wird im vierten Teil mittels sogenannter Informationsflussbäume betrachtet. Aufgrund dieser Informationsflussbäume kann gemessen werden wie gut ein Nutzer Informationen verbreitet und erzeugt. Dabei wurde auch die Hypothese bestätigt, dass diese Eigenschaften von der Anzahl der Tweets und Retweets und der Stellung im sozialen Netzwerk abhängen. Jedoch ist dieser Zusammenhang nur in Einzelfällen stark ausgeprägt.  

The presented research provides an answer for the detection of anomalies in big datawhen the processing of the information has to be done in “quasi” real-time.An overview of the following topics is given:• what big data is• available tools for processing big data, as well as doing it in real-time• existing approaches to detecting anomaliesDifferent outlier detection algorithms are not only studied theoretically, but also practi-cally. The strengths and flaws of each approach are evaluated to see which are more-likelyto be used in each instance to deal with the data at its arrival time.Furthermore, those algorithms are tested with a dataset in order to observe a practicalapplication of anomaly detection.In conclusion, a statistical approach to the outlier detection problem gives the mostaccurate results when using “near” real-time systems. The depth and density approachesalso obtain quality results, but run into problems when clusters of outliers are conformed.

In this thesis an approach for Authorship Attribution is presented with a focus on Webforums. The approach thereby is based on distance metrics for comparision betweenfrequency vectors of multiple feature spaces, which are extracted by the existing NaturalLanguage Processing tools and used in existing literature on authorship attribution.An algorithm trains a model using these features obtained for each of the authors withinthe data set. The source of the data are Web forums messages, which are crawled withthe existing tools for a subsequent HTML parse and further analysis. The classifierdecides the authorship weighting each of the features. In total three aproaches weretested, taking into account different feature space weighting strategies.To allow the conclussions to generalise, the evaluated data sets were assembled formultiple languages (English, German and Spanish), as well as multiple topics. Theresults achieved show a promising result, specially with longer messages, where moredata is available. In contrary to existing research n-gram features do not appear to bethe best feature for authorship attribution for Web forums.

Knowledge workers are exposed to many influences which have the potential to interrupt work. The impact of these influences on individual’s, not only knowledge workers, often cause detrimental effects on physical health and well-being. Twelve knowledge workers took part as participants of the experiment conducted for this thesis. The focus of the experiment was to analyse if sound level and computer interactions of knowledge workers can predict their self reported stress levels. A software system was developed using sensors on knowledge worker’s mobile and desktop devices. Records of PC activity contain information about foreground windows and computer idle times. Foreground window records include the timestamp when a window received focus, the duration the window was held in the foreground, the window title and the unique number identifying the window. Computer idle time records contain information about the timestamp when idle time began and the duration. Computer idle time was recorded only after a minimum idle interval of one minute. Sound levels were recorded using an smartphone’s microphone (Android). The average sound pressure level from the audio samples was computed over an one minute timeframe. Once initialized with an anonymous participant code, the sensors record PC activity and sound level and upload the records enriched with the code to a remote service. The service uses a key value based database system with the code as key and the collection of records as value. The service stores the records for each knowledge worker over a period of ten days. After this period, the preprocessing component of the system splits the records of PC activity and sound level into working days and computes measures approximating worktime fragmentation and noise. Foreground window records were used to compute the average time a window was held in the foreground and the average time an application was held in the foreground. Applications are sets of foreground window records which share the same window title. Computer idle time records were used to compute the number of idle times between one and five minutes and the period of those idle times which lasted more than twenty. From the sound pressure levels the average level and the period of all levels which exceeded 60 decibels were computed. The figures were computed with the scope of an participant’s working day for five different temporal resolutions. Additionally, the stress levels are computed from midday and evening scales. Participants recorded stress levels two times a working day and entered them manually in the system. The first self report was made close to lunch break and the second at the end of an day at work. Since participants forgot to enter self assessed stress levels, the number of working days containing data of all types ranges between eight and ten. As a result, the preprocessing component stores the measures and stress levels used by the stress predicition analysis component. The correlation of the measures with the self reported stress levels showed that a prediction of those stress levels is possible. The state of well-being (mood, calm) increased the higher the number of idle times between one and five minutes in combination with an sound pressure level not exceeding 60 decibels.

In letzter Zeit wurde das Potenzial von Twitter für forschungsrelevante Anwendungen vermehrt wahrgenommen. Dies führt unter anderem zur Nutzung von Twitter im Zuge wissenschaftlicher Konferenzen. Daraus kann geschlossen werden, dass entsprechende Communities während wissenschaftlichen Konferenzen interessante Informationen zur Verfügung stellen. Jedoch ist es fast unmöglich alle Tweets, die während einer Konferenz veröffentlicht werden, zu lesen oder überhaupt erst interessante Informationen aus Tweets manuell zu extrahieren. So wurden während der WWW2012 Konferenz beispielsweise 6901 Tweets, mit dem der Konferenz designierten Hash-Tag #www2012, veröffentlicht. Diese Arbeit beschreibt die Implementierung und Evaluierung eines Systems welches Tweets, die im Kontext einer wissenschaftlichen Konferenz veröffentlicht wurden, clustert. Die resultierenden Cluster wurden visualisiert, um sie für den Menschen verständlicher zu machen. Die Evaluierung des Systems anhand der Tweets, die während der WWW2012 veröffentlicht wurden, verdeutlicht, dass sowohl Themen als auch organisatorische Events extrahiert werden können. Darüber hinaus zeigen die Ergebnisse die Notwendigkeit weitere Clustering-Techniken zu evaluieren und zusätzliche Techniken zu implementieren, um Beziehungen zwischen den Clustern herzustellen.  

Betrachtet man die Entwicklung von mobilen Geräten der letzten Jahre, sieht man, dass Smartphones und Tablets immer mehr an Bedeutung gewinnen. Alleine in Österreich machen Smartphones bereits rund ein Drittel aller Mobiltelefone aus. Diese Geräte bringen aber nicht nur von Generation zu Generation schnellere Prozessoren, leistungsstärkere Grafikkarten und mehr Speicher, sondern auch immer mehr Sensoren die mittels APIs auslesbar sind. Das bietet Wissenschaftlern die Daten über das menschliche Verhalten (Bewegungen, Kommunikation, tägliche Abläufe, etc.) im echten Leben aufzeichnen wollen, eine sehr einfache Möglichkeit Benutzerdaten von einer möglichst breiten Zielgruppe zu erhalten. Nachdem diese Sensoren aber nun alle nur erdenklichen Informationen gesammelt haben, stellt sich die Frage nach einer passenden Visualisierung all dieser Zahlen. Diese Masterarbeit beschäftigt sich mit genau dieser Visualisierung von mobilen Sensordaten direkt auf mobilen Endgeräten. In einem ersten Schritt wird eine genaue Analyse der Aufgabenstellung durchgeführt und auf die zu visualisierenden Sensordaten, die vorherrschenden Limitierungen von mobilen Geräten hinsichtlich von Hardware-Resourcen sowie auf die speziellen User-Interaktions Paradigmen auf mobilen Geräten eingegangen. Weiters werden in dieser Arbeit grundsätzliche Visualisierungen vorgestellt, die es ermöglichen sehr viele verschiedene Arten von Daten effizient darzustellen. Nach einer genaueren Beleuchtung und einem Vergleich von ähnlichen Arbeiten, beschreibt der Hauptteil dieser Masterarbeit die Umsetzung eines Visualisierungs-Frameworks, dass eine performante und interaktive Darstellung von mobilen Sensordaten direkt am Smartphone bzw. Tablet erlaubt. Dieses Visualisierungs-Framework wurde mit einem Sensing-Framework zu einem voll funktionsfähigen Prototypen namens iPeeper kombiniert, der Sensordaten aufzeichnet, darstellt, und über mehrere Geräte synchronisiert.  

Das Internet entwickelt sich von einer Sammlung miteinander verknüpfter Dokumente hin zu einem interaktiven Medium, in dem der Begriff der „Bedeutung“ mit der vermehrten Veröffentlichung von strukturierten, untereinander verlinkten und für Maschinen verständ- lichen Daten eine große Rolle spielt. Im Kontext dieser Arbeit wird die Entwicklung eines „Semantic Web“ und der damit verwandten Technologien, wie das „Resource Description Framework“ (RDF) oder die Abfragesprache SPARQL erläutert, und ein Wizard zur automatisierten Generierung von Abfragen an Repositories der „Linked Open Data Cloud“ entwickelt. Mit diesem SPARQL-Wizard soll es für einen User auf möglichst einfache Art und Weise möglich sein, die Vorteile des Semantic Web bei der Informationsbeschaffung zu nutzen.

Das Konzept des Semantic Web sieht vor, Informationen anhand ihrer inhaltlichen Zusammenhänge strukturiert anzubieten, wodurch sie mit Hilfe der ihnen zugewiesenen Schlagwörter auffindbar gemacht werden können. Der Zugriff auf diese Daten funktioniert mit Hilfe der so genannten SPARQL-Suchanfragen (Queries), indem gezielt angegeben wird, welche Daten aus dieser großen Menge extrahiert werden sollen. Nach dem Erhalt können diese Daten weiterverarbeitet und unter anderem für eine benutzerfreundliche Darstellung visualisiert werden. Die Visualisierung der semantischen Daten hat mittlerweile eine sehr große Bedeutung im Bereich des Wissensmanagements und ist auch das Thema dieser Masterarbeit.Es wurde im Bereich der Visualisierung heterogener Daten schon einiges realisiert, dennoch wurde dabei auf das Thema der Wiederverwendung wenig eingegangen. Das Ziel dieser Arbeit ist einen generischen Ansatz zur visuellen Repräsentation der heterogenen Daten anzubieten. Die Idee der generischen Lösung basiert dabei auf dem Konzept der systematischen Wiederverwendung, nämlich der Softwareproduktlinien.Das für diesen Zweck entwickelte Framework unterstützt eine Reihe von interaktiven Diagrammen. Der Benutzer kann auf diesem Framework für eine SPARQL-Query eine Visualisierung durchführen und sie anschließen speichern. Um die gespeicherten Diagramme wieder zu verwenden kann das Framework anhand einer Query kontaktiert und das fertige Diagramm clientseitig angezeigt werden. Das Framework wurde so konzipiert, dass der Client ohne großen Aufwand ein Diagramm erstellen und in der Rolle des Entwicklers sogar das Framework um neue Diagramme erweitern kann, die dann als Visualisierungsvorlage angeboten werden.Die am Ende durchgeführte quantitative Evaluierung hat gezeigt, dass für dieses Framework vorgenommene Ansatz verglichen mit der traditionelle Methode zur Visualisierung effizienter ist.

Die zentrale Herausforderung für die Entwicklung von Software für arbeitsintegriertes Lernen (work-integrated learning, WIL) ist es, Lerninhalte bereitzustellen, die an die situativen Gegebenheiten und das Vorwissen der NutzerInnen angepasst sind (adaptive Systeme). Um Adaptivität zu realisieren ist ein Benutzermodell (User Model) erforderlich, das kontinuierlich an den Lernfortschritt angepasst wird. Im Gegensatz zum Schul- und Universitätskontext existieren kaum adaptive Systeme zur Unterstützung von WIL. Ziel meiner Masterarbeit war es, ein WIL User Model, WIL User Model Services und eine Software-Architektur zur Unterstützung von WIL zu entwickeln. Das WIL System sollte sich an die Arbeitsaufgabe und das Vorwissen der BenutzerInnen anpassen, reale Arbeitsdokumente als Lerninhalte benützen und in die Arbeitsumgebung der Benutzer integriert sein. Anforderungen für das System wurden einerseits aus der Theorie zu WIL und andererseits aus existierenden Use Cases abgeleitet. Die Anforderungsanalyse ergab, dass drei Arten von Funktionalität zentral für die Unterstützung von WIL erscheinen: Non-invasive Wissensdiagnose, Empfehlungen von Inhalten und Empfehlungen von ExpertInnen. In meiner Mas- terarbeit wurden diese Funktionalitäten über verschiedene Arten von User Model Services konzeptualisiert (Logging, Production, Inference und Control Services), die gemeinsam die WIL User Model Services (WIL UMS) bilden. Die WIL UMS wur- den prototypisch im adaptiven WIL System APOSDLE implementiert. APOSDLE’s Benutzermodell wird über Log Daten (“Knowledge Indicating Events”) automatisch aktualisiert. Ausgehend vom Benutzermodell empfiehlt APOSDLE reale Arbeitsdo- kumente und ExpertInnen. APOSDLE und die WIL UMS wurden als intelligente Lösung zur Unterstützung von WIL in vier Unternehmen installiert, und sind in die Arbeitsumgebung der BenutzerInnen integriert.  

Die Handhabung einer riesigen Menge an stetig steigender persönlicher Daten wird immer schwieriger. Durch unauffällige Überwachung des Benutzers kann der derzeitige Benutzerkontext erfasst werden und dem Wissensarbeiter dadurch bessere Unterstützung ermöglicht werden. Das Ziel dieser Masterarbeit ist es, für eine aktuelle Aufgabe relevante Entitäten in einem Benutzer-Interaktions-Kontext-Modell zu ermitteln. Ein Aktivierungsausbreitungsansatz wird auf die Graphenstruktur eines Benutzer-Interaktions-Kontext-Modells angewandt um, basierend auf dem derzeitigen Benutzerkontext, relevante Aufgaben des selben und eines anderen Benutzers zu finden. Das Benutzer-Interaktions-Kontext-Modell, die entstandene Ontologie and die automatischen Populationsmechanismen wurden von Andreas Rath als Teil seiner Forschungstätigkeit verwirklicht. Die Ziele dieser Masterarbeit sind (a) die Identifikation von relevanten Aufgaben in einem Benutzer-Interaktions-Kontext-Modell, (b) die Ermittlung von Konzepten und Eigenschaften der Benutzer-Interaktions-Kontext-Ontologie für den Aktivierungsausbreitungsansatz, (c) die Evaluierung der erforderlichen Anzahl an Iterationen sowie (d) die Evaluierung einer gute Ergebnisse liefernden Kombination von Aktivierungsabbau, Schwellwert und Relationsgewichtung für den Aktivierungsausbreitungsansatz und (e) die Visualisierung des Aktivierungsausbreitungsgraphen basierend auf dem Benutzer-Interaktions-Kontext-Graphen.  

In der dynamischen Welt der IT entwickelte sich das Kennzeichen der Adaptivität zu einem der wichtigsten. Um die Adaptivität eines Systems gewährleisten zu können, muss eine entsprechende Infrastruktur sichergestellt sein. Jede Komponente des Systems sollte die Adaptivität zu einem gewissen Grad unterstützen. Wie soll ein System entworfen werden welches einerseits so weit wie möglich abstrahiert ist und ein hohes Maß an Flexibilität bietet, andererseits jedoch die Genauigkeit der Recommendation nicht beeinflusst? Um dieses Problem lösen zu können, haben andere Systeme wie CUMULATE (Benutzermodel Komponente im KnowledgeTree System) so genannte intelligente Inferenz-Agenten vorgestellt. Diese Agenten waren jeweils für eine Eigenschaft des Benutzers zuständig (z.B. Motivation oder Wissen des Benutzers). Die vorliegende Arbeit hat ein ähnliches Konzept verfolgt. Anstatt auf die Eigenschaften des Nutzerprofils wird der Schwerpunkt auf die Umstände/Situationen in denen die Benutzer/Inenn arbeiten gesetzt. Eine Möglichkeit wäre die zu Hilfenahme mehrerer Typen von Inferenz-Agenten (Konfiguration der Benutzerprofil-Komponente), welche für verschiedene Situationen vorkonfiguriert sind. Unterschiedliche Situationen ergeben sich durch neue Systeme, neue Domänen, unterschiedliche Domänenzustände sowie neue Arbeits- und Verhaltensmuster. Sollte die aktuelle Konfiguration aus irgendeinem Grund nicht ausreichend sein, so sollte sie relativ einfach durch eine besser angepasste Konfiguration ausgetauscht werden können. Das Problem dabei ist allerdings, dass nicht bekannt ist, welche Konfiguration für die aktuelle Situation die passendste ist. Es muss demnach ein Überprüfungsmechanismus gefunden werden, welcher sich um diese Problematik kümmert. Dieser Mechanismus wird als Simulation Framework in dieser Masterarbeit vorgestellt. Den praktischen Teil dieser Masterarbeit stellt die Implementation des UPS Prototype 3 und des Simulation Framework dar, und darauf aufbauend die Simulationen von Benutzerverhaltenweisen um die UPS Komponente des APOSDLE-Systems kalibrieren zu können. Die Simulationen zeigen eindeutig, dass jene Algorithmen, welche den sogenannten Aging Faktor berücksichtigen, die besten Ergebnisse erzielen. Mit dieser Erkenntnis wurde die Anzahl der möglichen Konfigurationen im System von ursprünglich sechs auf letztendlich zwei reduziert.  

In der heutigen Zeit steigt die Menge an digitalen Daten tagtäglich. Durch das Internet ist ein Teil davon der breiten Masse jederzeit zugänglich. Dabei unterstützen Suchmaschinen den Benutzer aus der scheinbar unerschöpflichen Menge an Daten die gewünschten Informationen herauszufiltern. Ebenso wird es im Intranet eines großen Unternehmens zunehmend schwieriger die Unmengen an Daten adäquat zu organisieren und zu strukturieren, um die gesuchten Informationen schnell zu finden. Für den niederländischen Halbleiterhersteller namens NXP Semiconductors wurde in dieser Arbeit ein Wissensmanagementsystem entwickelt, um den Zugang zu den Spezifikationen des intern entwickelten JCOP-Betriebssystems zu optimieren. Dabei können Spezifikationen vom Benutzer verwaltet, gruppiert und durchsucht werden. Als Grundlage für die Volltextsuche wurde ein bewährtes Information-Retrieval-Verfahren namens Vektorraummodell verwendet. Bei der Indizierung der Spezifikationen wird der Text extrahiert, gefiltert und in einen Index eingebettet. Dadurch wird dem Benutzer ermöglicht den Volltext der Spezifikationen zu durchsuchen. Aufbauend auf die Volltextsuche der Spezifikationen wurde mittels eines maschinellen Lernverfahrens namens K-Nearest-Neighbour die vom Benutzer durchgeführte Gruppierung einzelner Spezifikationen mit den Ergebnissen der K-Nearest-Neighbour-Klassifikation verglichen. Nach mehreren Optimierungsschritten konnte der Recall der Klassifikation auf über 70% und die Accuracy auf über 90% verbessert werden.

Der enorme Zuwachs an Daten verschiedensten Typs und unterschiedlichster Herkunft f¨uhrte in den letzten Jahren zu riesigen, teils un¨ubersichtlichen und unstrukturierten Datenmengen. In Anbetracht dessen ist die geeignete Aufbereitung sowie die effiziente Handhabung großer Datenmengen von besonderer Relevanz. Computerunterst¨utzte Visualisierung bzw. speziell die Visualisierung von semantischen Graphstrukturen spielt dabei eine zentrale Rolle. Sowohl die derzeitige Situation als auch die Prognose der zuk¨unftigen Entwicklungen unterstreicht die Aktualit¨at und besondere Bedeutung dieses Themas. Die Masterarbeit beleuchtet zun¨achst den theoretischen Hintergrund ausgew¨ahlter Themenbereiche der Graph- und Information-Visualisation. Die anschließende Evaluierung bereits bestehender Tools, Packages und Frameworks soll Aufschluss ¨uber aktuell verf¨ugbare Softwarel¨osungen zur Visualisierung von Graphen bzw. semantischen Graphstrukturen geben. Im Rahmen des praktischen Teils der Arbeit erfolgt, unter Ber¨ucksichtigung der Ergebnisse der Evaluierung, die Implementierung eines Systems zur Visualisierung und dynamischen Aggregierung von RDF-Graphen.

Die zunehmende Vernetzung und Öffnung von Unternehmen nach extern bedingt, dass ein großer Teil erfolgsbestimmender Einflussfaktoren außerhalb des eigenen Wirkungsbereichs liegt. Deswegen ist es für erfolgreiches Management wichtiger denn je, Informationen zu den Sichtweisen und Motivationsfaktoren der zentralen, externen Stakeholder so zu erheben, dass sie mit den eigenen Sichtweisen zusammengeführt und weiterverarbeitet werden können. Die Informationen können auf verschiedene Weise eingeholt werden. Gerade das Web 2.0 bietet hier Chancen, unternehmensinterne, strukturierte Informationen mit weiterführenden anzureichern. Diese Anreicherung ist zentraler Betrachtungspunkt der Arbeit und leitet sich aus einem speziellen Anwendungsfall ab, in welchem Erfolgsfaktoren und deren Verbindungen in Unternehmen betrachtet werden. Es wird untersucht, wie durch Diskussion über gängige Web 2.0 Plattformen wie Blogs oder Wikis etc. vorhandene, strukturierte Informationen durch Dritte angereichert werden können. Als Lösungsansatz wird eine Plattform zur Diskussion strukturierter Informationen als Rich Internet Application konzipiert und entwickelt. Diese hat den Charakter einer Suchmaschine und im Gegensatz zu herkömmlichen Diskussionsplattformen, wie z.B. Blogs, erfolgt die Diskussion strukturiert mittels Bewertungen. Im Rahmen einer Pilotnutzung und Expertenbefragung wird diese evaluiert. Die vorliegenge Masterarbeit zeigt, dass mit Rich Internet Applications ein hohes Maß an Usability erreicht werden kann, um strukturierte Informationen in Form von Erfolgsfaktoren erfolgreich mit weiterführenden Informationen durch Dritte anzureichern. Der vorliegenden Masterarbeit liegt ein Auftrag der SUCCON Schachner & Partner KG an die TU Graz zugrunde. Einzelne der im Rahmen dieses Auftrages durchgeführten Konzeptions- und Entwicklungstätigkeiten wurden in Absprache mit der SUCCON zur Behandlung im Rahmen der vorliegenden Masterarbeit freigegeben.  

Kooperative Verschlagwortungssyteme erlauben es Anwendern, unterschiedliche Arten von Web-Ressourcen (URLs, Fotos, Publikationen etc.) mittels eines frei wählbaren und offenen Vokabulars, sogenannten "Tags" zu annotieren. Während die Forschung zu Beginn primär auf die Analyse der Struktur und der Dynamik von kooperativen Verschlagwortungssystemen fokusiert war, kam es kürzlich zur Untersuchung von Motivationsstrukturen, die der Verschlagwortung zu Grunde liegen. Die vorliegende Masterarbeit zielt auf ein tieferes Verständnis hinsichtlich der Verschlagwortungscharakteristiken von zwei grundverschiedenen Typen von Motivation ab - Kategorisierung versus Beschreibung. Sogenannte "Kategorisierer" verwenden Tags primär zum Aufbau und zur Pflege einer hilfreichen Navigationsstruktur ihrer Ressourcen. Dazu etablieren sie ein persönliches Vokabular an Tags, das dazu neigt, sich schnell zu stabilisieren und eine gleichmäßige Verwendungshäufigkeit der Tags aufweist. "Beschreiber" haben das vordergründige Ziel, Ressourcen äußerst detailliert zu annotieren, um die Suche möglichst gut zu unterstützen. Da sie ihre Tags ad-hoc und beschreibend einsetzen, wächst ihr Tag-Vokabular typischerweise viel stärker und weist zudem eine ungleichmäßige Verteilung auf. Basierend auf 10 Verschlagwortungsdatensätzen, die von 6 unterschiedlichen kooperativen Verschlagwortungssystemen (BibSonomy, CiteULike, Delicious, Flickr, Diigo und Movielens) akquiriert wurden, werden innerhalb dieser Masterarbeit die Verschlagwortungspraktiken von Kategorisierern und Beschreibern systematisch verglichen. Zu diesem Zweck wurde eine pragmatische Analyse durchgeführt, die auf ausgewählten statistischen Metriken basiert, welche unterschiedliche Intuitionen der Verschlagwortungscharakteristiken von Kartegorisierern und Beschreibern widerspiegeln. Die Masterarbeit beinhaltet überdies noch empirische Ergebnisse einer qualitativen Benutzerstudie. Im Zuge einer binären Klassifikationsaufgabe zur Abschätzung, ob Benutzer eher Kategorisierer oder Beschreiber darstellen, wurde untersucht, welche statistischen Metriken dabei am ehesten der menschlichen Beurteilung entsprechen. Die zentralen Ergebnisse dieser Masterarbeit beziehen sich folglich auf eine Reihe ausgewählter Verschlagwortungscharakteristiken, welche vergleichend für Kategorisierer und Beschreiber analysiert wurden. Die Ergebnisse zeigen, dass es mittels einfachen jedoch robusten statistischen Maßen möglich ist, die Unterschiede in der Verschlagwortungspragmatik von Benutzern automatisch zu identifizieren.  

Das 21. Jahrhundert ist durch Energie- und Ressourcenverschwendung gekennzeichnet. Die Auswirkungen dieses Verhaltens können in der ganzen Welt wahrgenommen werden. Der Wandel der Gesellschaft hin zur nachhaltigen Nutzung der zur Verfügung stehenden Ressourcen kann als Anstoß für diese Arbeit gesehen werden. Diese Masterarbeit beschäftigt sich mit der Erstellung eines webbasierten Ansatzes zur Überwachung und Visualisierung von Anlagedaten. Die zu überwachenden Informationen kommen aus dem Bereich der Energietechnik. Einleitend werden die Rahmenbedingungen dieser Arbeit definiert. Des Weiteren werden die funktionalen und nichtfunktionalen Anforderungen erläutert. Aufbauend darauf werden softwareentwicklungstechnische Entscheidungen getroffen, welche bei der Implementierung der RIA-Applikation erforderlich waren. Die Applikation ist Teil eines Gesamtproduktes, welches als verteiltes System realisiert ist. Der agile Softwareentwicklungsprozess Scrum wird als Prozess eingesetzt. Technologisch wurde die RIA- Applikation mit Silverlight realisiert. Der praktische Teil dieser Arbeit veranschaulicht die Realisierung der Visualisierung der Anlagedaten im zeitlichen, domain- und geospezifischen Kontext. Zu Beginn des praktischen Abschnittes wird ein Überblick über die RIA-Applikation gewährt, darauf aufbauend veranschaulichen Code-Ausschnitte zum besseren Verständnis die verschiedenen Visualisierungen. Abschluss dieses Teils bildet die Erklärung des Zusammenspiels der einzelnen Module der RIA-Applikation. Das Resümee über die eingesetzten Technologien und Praktiken bildet den Gesamtabschluss dieser Masterarbeit.  

Aufgrund der wachsenden Anzahl von Informationen, die ständig zu verarbeiten sind, werden aussagekräftige Visualisierungen von Informationen immer wichtiger. Gleichzeitig finden auch durch die ständig schneller werdende Computerhardware und die größer werdende Bandbreite bei Internetzugängen anspruchsvolle dreidimensionale Inhalte im Web immer stärkere Verbreitung. Die Darstellung von Informationsvisualisierungen im Web ist somit eine gute Möglichkeit, um viele Nutzer zu erreichen. Die Erweiterung von zweidimensionalen Visualisierungen um eine weitere Dimension kann hierbei zur besseren Strukturierung der Informationen genutzt werden. Die dreidimensionale Darstellung von Informationen im Web verlangt aber auch nach entsprechenden Technologien, die diese Aufgabe erfüllen können. Somit werden in dieser Arbeit derzeit verfügbare web-basierte 3D-Formate ermittelt, diese anhand von Kriterien verglichen und es wird auf Grund der hohen Verbreitung Flash für eine prototypische Umsetzung einer Informati-onsvisualisierung ausgewählt. Da es in Flash mehrere 3D-Engines gibt, werden diese einer genaueren Untersuchung unterzogen, um für die Informationsvisualisierung die passende auswählen zu können. Die Arbeit zeigt, dass es mit Flash möglich ist eine Informationsvisualisierung umzusetzen, auch wenn dabei aufgrund der geringen Hardwareunterstützung oft Kompromisse bei der Geschwindigkeit und der ungenauen Tiefenberechnung einzugehen sind.

Data Mining ist ein Schlagwort, in das heutzutage viele Erwartungen im Bereich der Informatik gesteckt werden. Das maschinen-unterstützte "Graben" und "Fördern" von komplexen Zusammenhängen in großen Datenbeständen ist auch von Interesse für eine Grazer Softwarefirma, die sich mit Datenanalyse im Bereich der Produktion beschäftigt. Diese Arbeit bearbeitet ein erstes Szenario dem sich die Firma widmen möchte: das Erkennen von Mustern in der Zustandshistorie einer produzierenden Maschine. Außer der Sequenz der Zustände und deren Dauer steht nur wenig Information zur Verfügung, die verwendet werden kann. Die Frage ist daher, ob zwischen den einzelnen Maschinenzuständen signifikante Korrelationen bestehen. Die sequentielle Natur der Daten bedingt zwei unterschiedliche Zugänge der Bearbeitung: zum einen bieten sich klassische Methoden der Klassifikation, zum anderen Methoden des so genannten Sequence- beziehungsweise des Episode-Mining an. Diese Arbeit präsentiert zunächst verschiedene mögliche Ansätze aus beiden Gebieten, um danach eine Methode aufzugreifen und Ergebnisse erster Versuche zu liefern. Diese Versuchen sollen zeigen, dass ein Auffinden von Mustern möglich ist.  

xxx

xxx

Suchmaschinen ermöglichen Benutzern des World Wide Web ihre Informationsbedürfnisse zu formulieren. Jedoch geht während der Formulierung von Suchabfragen oft die ursprüngliche Absicht verloren. Diese Masterarbeit widmet sich diesem Problem mit der Untersuchung einer Konstruktionsmethode von Graphen, die Suchziele aus Suchdatensätzen beinhalten. Während bisherige Arbeiten hauptsächlich die Klassifikation von Suchabfragen in Taxonomien behandeln, werden hier Suchabfragen, die explizite Ziele enthalten, untersucht. Um Beziehungen zwischen Suchzielen ableiten zu können, wird ein neuer Typ von Graphen vorgestellt, der aus Suchdatensätzen erstellt werden kann: Bipartite Ziel-Tag Graphen. Die Arbeit zeigt, wie diese Graphen dazu verwendet werden können, um die Absicht eines Benutzers zu einer von ihm abgesetzten Suchabfrage abzuleiten oder verwandte Ziele eines Ziels zu ermitteln. Einer der wesentlichen Beiträge dieser Arbeit ist eine parametrisierte Methode für die Erstellung der Graphen und die dazugehörigen qualitativen und quantitativen Evaluierungen. Des Weiteren wird SearchGoalNet - ein Netzwerk das 57562 Suchziele von Benutzern enthält - vorgestellt und Anwendungen, die darauf basieren, erläutert.  

xxx

Der aktuell zu beobachtende, stark steigende Umfang an maschinell zu verarbeitenden Daten macht es notwendig, neue Methoden zur Bewältigung dieser einzusetzen. Neben einer Verbesserung der Suchmöglichkeiten ist es der Wille, die textuellen Inhalte besser zu verstehen und zu nutzen, der als treibende Kraft zu identifizieren ist.Haben im 19. Jahrhundert Enzyklopädien dazu beigetragen, dass Menschen einen einheitlichen Wortschatz zur Kommunikation nutzen konnten, ist im 21. Jahrhundert die Notwendigkeit gegeben, dass sich Maschinen eines universalen Wortschatzes zur Kommunikation bedienen können.Enzyklopädien gelten als umfassendes Abbild des menschlichen Wissens einer Epoche. Der Wunsch dieses Wissen aus den textuellen Quellen zu extrahieren und zur weiteren automatisierten Verarbeitung aufzubereiten, ist zentraler Betrachtungspunkt dieser Arbeit.Dazu werden ausgewählte Methoden des Ontology Learning angewandt, um aus den Enzyklopädietexten Taxonomien und Konzepthierarchien abzuleiten. Die extrahierten Informationen werden evaluiert und unter Verwendung weiterer Verfahren, wie z.B. Onlinevaliderung automatisch verbessert.Die vorliegende Arbeit zeigt, dass unter der Vorraussetzung der geeigneten Methodik, qualitativ hochwertige semantische Informationen aus den enzyklopädischen Daten gewonnen werden können, welche als Grundlage für die Erstellung einer Ontologie verwendet werden.

Mit der vor allem im digitalen Sektor immer stärker wachsenden Informationsflut geht ein ebenfalls ständig größer werdender Wunsch nach Organisation und Beherrschung dieser Daten einher. Sei es, um allgemeine Informationen, wie sie täglich auf Milliarden von Internetseiten erscheinen, oder um spezielle Informationen, wie sie im schulischen und universitären Bereich auftreten, zu kontrollieren und zu erfassen. Eine gute Möglichkeit, Informationen in einer kontrollierten Art und Weise zu sammeln und aufzubereiten stellen digitale Bibliotheken dar. Wenn der Datenbestand solcher Bibliotheken jedoch über ein gewisses Maß hinauswächst und überdies auch Daten beinhaltet, die vertrauenswürdig zu behandeln sind, ist der Einsatz von Zugriffs-Kontrollsystemen, welche Asset- und Rechtemanagement beinhalten, unumgänglich.In dieser Arbeit wird nun dargestellt, welche Möglichkeiten sich aus heutiger Sicht bieten, ein Digital Rights Management System für den Einsatz im Bereich digitaler Bibliotheken aufzubauen. Beruhend auf der Heterogenität der aktuell existierenden DRM-Lösungen ergibt sich eine Inkompatibilität der einzelnen Standards zueinander. Folgedessen wird ein DRM-System vorgestellt, welches auf Basis von Ontologien diese Inkompatibilitäten weitestgehend zu überbrücken im Stande ist. Die praktische Einsetzbarkeit von DRM in Verbindung mit Information-Retrieval wird letztendlich durch eine prototypisch ausgeführte Implementierung eines digitalen Handapparates gezeigt.