The presented research provides an answer for the detection of anomalies in big datawhen the processing of the information has to be done in “quasi” real-time.An overview of the following topics is given:• what big data is• available tools for processing big data, as well as doing it in real-time• existing approaches to detecting anomaliesDifferent outlier detection algorithms are not only studied theoretically, but also practi-cally. The strengths and flaws of each approach are evaluated to see which are more-likelyto be used in each instance to deal with the data at its arrival time.Furthermore, those algorithms are tested with a dataset in order to observe a practicalapplication of anomaly detection.In conclusion, a statistical approach to the outlier detection problem gives the mostaccurate results when using “near” real-time systems. The depth and density approachesalso obtain quality results, but run into problems when clusters of outliers are conformed.

In this thesis an approach for Authorship Attribution is presented with a focus on Webforums. The approach thereby is based on distance metrics for comparision betweenfrequency vectors of multiple feature spaces, which are extracted by the existing NaturalLanguage Processing tools and used in existing literature on authorship attribution.An algorithm trains a model using these features obtained for each of the authors withinthe data set. The source of the data are Web forums messages, which are crawled withthe existing tools for a subsequent HTML parse and further analysis. The classifierdecides the authorship weighting each of the features. In total three aproaches weretested, taking into account different feature space weighting strategies.To allow the conclussions to generalise, the evaluated data sets were assembled formultiple languages (English, German and Spanish), as well as multiple topics. Theresults achieved show a promising result, specially with longer messages, where moredata is available. In contrary to existing research n-gram features do not appear to bethe best feature for authorship attribution for Web forums.

Knowledge workers are exposed to many influences which have the potential to interrupt work. The impact of these influences on individual’s, not only knowledge workers, often cause detrimental effects on physical health and well-being. Twelve knowledge workers took part as participants of the experiment conducted for this thesis. The focus of the experiment was to analyse if sound level and computer interactions of knowledge workers can predict their self reported stress levels. A software system was developed using sensors on knowledge worker’s mobile and desktop devices. Records of PC activity contain information about foreground windows and computer idle times. Foreground window records include the timestamp when a window received focus, the duration the window was held in the foreground, the window title and the unique number identifying the window. Computer idle time records contain information about the timestamp when idle time began and the duration. Computer idle time was recorded only after a minimum idle interval of one minute. Sound levels were recorded using an smartphone’s microphone (Android). The average sound pressure level from the audio samples was computed over an one minute timeframe. Once initialized with an anonymous participant code, the sensors record PC activity and sound level and upload the records enriched with the code to a remote service. The service uses a key value based database system with the code as key and the collection of records as value. The service stores the records for each knowledge worker over a period of ten days. After this period, the preprocessing component of the system splits the records of PC activity and sound level into working days and computes measures approximating worktime fragmentation and noise. Foreground window records were used to compute the average time a window was held in the foreground and the average time an application was held in the foreground. Applications are sets of foreground window records which share the same window title. Computer idle time records were used to compute the number of idle times between one and five minutes and the period of those idle times which lasted more than twenty. From the sound pressure levels the average level and the period of all levels which exceeded 60 decibels were computed. The figures were computed with the scope of an participant’s working day for five different temporal resolutions. Additionally, the stress levels are computed from midday and evening scales. Participants recorded stress levels two times a working day and entered them manually in the system. The first self report was made close to lunch break and the second at the end of an day at work. Since participants forgot to enter self assessed stress levels, the number of working days containing data of all types ranges between eight and ten. As a result, the preprocessing component stores the measures and stress levels used by the stress predicition analysis component. The correlation of the measures with the self reported stress levels showed that a prediction of those stress levels is possible. The state of well-being (mood, calm) increased the higher the number of idle times between one and five minutes in combination with an sound pressure level not exceeding 60 decibels.