Evaluation of Job Recommendations for the Studo Jobs Platfor

Today the internet is growing fast as users generate an increasing amount of data. Therefore, finding relevant information is getting more and more time- consuming. This happens as the internet consists of a larger amount of data that is distributed over various information sources. Search engines filter data, and reduce the time required to find relevant information. We focus on scientific literature search where search engines help to find scientific articles. An advantage of scientific articles is that they share a common structure to increase their readability. This structure is known is IMRaD (Introduction, Method, Results and Discussion). We tackle the question whether it is possible to improve the search result quality while searching for scientific works by leveraging IMRaD structure information. We use several state-of-the-art ranking algorithms, and compare them against each other in our experiments. Our results show that the importance of IMRaD chapter features depends on the complexity of the query. Finally, we focus on structured text retrieval and the influence of single chapters on the search result. We set out to tackle the problem to improve the quality of the results produced by state-of-the-art ranking algorithms for scientific literature research.

Automatically separating text into coherent segments sharing the same topic is a nontrivial task in research area of Natural Language Processing. Over the course of time text segmentation approaches were improved by applying existing knowledge from various science fields including linguistics, statistics and graph theory. At the same time obtaining a corpus of textual data varying in structure and vocabulary is problematic. Currently emerging application of neural network models in Natural Language Processing shows promise, which particularly can be seen on an example of Open Information Extraction. However the influence of knowledge obtained by an Open Information Extraction system on a text segmentation task remains unknown. This thesis introduces text segmentation pipeline supported by word embeddings and Open Information Extraction. Additionally, a fictional text corpus consisting of two parts, novels and subtitles, is presented. Given a baseline text segmentation algorithm, the effect of replacing word tokens with word embeddings is examined. Consequently, neural Open Information Extraction is applied to the corpus and the information contained in the extractions is transformed into word token weighting used on top of the baseline text segmentation algorithm. The evaluation shows that application of the pipeline to the corpus increased the performance for more than a half of novels and less than a half of subtitle files in comparison to the baseline text segmentation algorithm. Similar results are observed in a preliminary step in which word tokens were substituted by their word embedding representations. Taking into account complex structural features of the corpus, this work demonstrates that text segmentation may benefit from incorporating knowledge provided by an Open Information Extraction system.

Portable Document Format (PDF) is one of the most commonly used file formats. Many current PDF viewers support copy-and-paste for ordinary text, but not for mathematical expressions, which appear frequently in scientific documents. If one were able to extract a mathematical expression and convert them into another format, such as L A TEX or MathML, the information contained in this expression would become accessible for a wide array of applications, for instance screen readers. An important step to achieve this goal is finding the precise location of mathematical expressions, since this is the only unsolved step in the formula extraction pipeline. Accurately performing this crucial step is the main objective of this thesis. Unlike previous research, we use a novel whitespace analysis technique to demarcate coherent regions within a PDF page. We then use the identified regions to compute carefully selected features from two sources: the grayscale matrix of the rendered PDF file and the list of objects within the parsed PDF file. The computed features can be used as input for various classifiers based on machine learning techniques. In our experiments we contrast four different variants of our method, where each uses a different machine learning algorithm for classification. Further, we also aim to compare our approach with three state of the art formula detectors. However, the low reproducibility of these three methods combined with logical inconsistencies in their documentation greatly complicated a faithful comparison with our method, leaving the true state of the art unclear, which warrants further research.

This thesis presents a novel way of creating grid-based word puzzles, named the AI Cruciverbalist. These word puzzles have a large fan base of recreational players and are widespread in education. The puzzle creation process, an NP-hard problem, is not an effortless task, and even though some algorithms exist, manual puzzle creation achieved the best results so far. Since new technologies arose, es- pecially in the field of data science and machine learning, the time had come to evaluate new possibilities, replace existing algorithms and improve the quality and performance of puzzle generation. In particular neural networks and constraint programming were evalu- ated towards feasibility, and the results were compared. The black box of a trained model makes it hard to ensure positive results, and due to the impossibility of modelling some requirements and con- straints, neural networks are rated unsuitable for puzzle generation. The significance of correct values in puzzle fields, the approximative nature of neural networks, and the need for an extensive training set additionally make neural networks impractical. On the other hand, precisely modelling requirements for a constraint satisfaction prob- lem has shown to create excellent results, finding an exact solution, if a solution exists. The presented results achieved with the constraint programming approach are rated as successful by domain experts, and the algorithm has been successfully integrated into an existing puzzle generator software for use in production.

People use different styles of writing according to their personalities. These dis- tinctions can be used to find out who wrote an unknown text, given some texts of known authorship. Many different parts of the texts and writing style can be used as features for this. The focus in this thesis lies on topic-agnostic phrases that are used mostly unconsciously by authors. Two methods to extract these phrases from texts of authors are proposed, which work for different types of input data. The first method uses n-gram tf-idf calculations to weight phrases while the second method detects them using sequential pattern mining algorithms. The text data set used is gathered from a source of unstructured text with a plethora of topics, the online forum called Reddit. The first of the two proposed methods achieves average F1-scores (correct author predictions) per section of the data set ranging from 0.961 to 0.92 within the same topic and from 0.817 to 0.731 when different topics were used for attribution testing. The second method scores in the range from 0.652 to 0.073, depending on configuration parameters. In current times, due to the massive amount of content creation on such platforms, using a data set like this and using features that work for authorship attribution with texts of such nature is worth exploring. Since these phrases have been shown to work for specific configurations, they can now be used as a viable option or in addition to other commonly used features.

Political debates today are increasingly being held online, through social media andother channels. In times of Donald Trump, the American president, who mostlyannounces his messages via Twitter, it is important to clearly separate facts fromfalsehoods. Although there is an almost infinite amount of information online, toolssuch as recommender systems, filters and search encourage the formation of so-called filter bubbles. People who have similar opinions on polarizing topics groupthemselves and block other, challenging opinions. This leads to a deterioration ofthe general debate, as false facts are difficult to disprove for these groups.With this thesis, we want to provide an approach on how to propose different opin-ions to users in order to increase the diversity of viewpoints regarding a politicaltopic. We classify users into a politic spectrum, either pro-Trump or contra-Trump,and then suggest Tweets from the other spectrum. We then measure the impact ofthis process on diversity and serendipity.Our results show that the diversity and serendipity of the recommendations can beincreased by including opinions from the other political spectrum. In doing so, wewant to contribute to improving the overall discussion and reduce the formation ofgroups that tend to be radical in extreme cases

Diese Arbeit beschäftigt sich mit der Anwendung von Data Mining-Algorithmen zur Informati-onsgewinnung im Softwaresupport. Data Mining-Algorithmen sind Tools der sogenannten „Knowledge Discovery“, der interaktiven und iterativen Entdeckung von nützlichem Wissen. Sie werden eingesetzt, um Daten zu analysieren und über statistische Modelle wertvolle In-formationen einer Domäne zu finden. Die Domäne in dieser Arbeit ist der Softwaresupport, jene Abteilung in Softwareentwicklungs-Unternehmen, die Kundinnen und Kunden bei der Lösung von Problemen unterstützt. Meist sind diese Supportabteilungen als Callcenter organisiert und arbeiten zusätzlich mit Ticketsys-temen (einem E-Mail-basierten Kommunikationssystem). Zweck dieser Arbeit ist es zu prüfen, inwiefern Data Mining-Algorithmen im Softwaresupport angewendet werden und ob tatsächlich wertvolle Informationen identifiziert werden können. Erwartet wird, Informationen über das Supportverhalten von KundInnen sowie den Einfluss von externen Faktoren wie Wetter, Feiertage und Urlaubszeiten zu entdecken. Die Literaturrecherche dieser Arbeit, beinhaltet unter anderem die Themen Personaleinsatz-planung im Softwaresupport und Data Science (Zusammenfassender Begriff für Data Mining, Data Engineering oder Data-Driven Decision Making, etc.). Im „experimental Setup“ finden Interviews zum Thema Status quo- und Kennzahlen im Softwaresupport mit führenden öster-reichischen Softwarehäusern sowie eine Fallstudie zur Anwendung eines Data Mining-Vorgehensmodells statt. Letztlich wird in einem Feldexperiment geprüft, ob es mit Data Mi-ning-Algorithmen tatsächlich möglich ist, Informationen für den Softwaresupport zu entdecken. Als Ergebnis dieser Arbeit zählen einerseits die Identifikation von Möglichkeiten, um im Sup-port Kosten zu sparen und Effizienz zu gewinnen und andererseits das Finden von wertvollen Informationen über Abläufe und Zusammenhänge im Support. Die gewonnenen Informationen können in weiterer Folge in den Supportprozess einfließen, um effektivere und effizientere Prozesse zu schaffen. Ein weiteres Resultat des Informationsgewinns ist auch die Qualitäts-steigerung von Managemententscheidungen sein

Due to a rapid increase in the development of information technology, adding computing power to everyday objects has become a major discipline of computer science, known as “The Internet of Things”. Smart environments such as smart homes are a network of connected devices with sensors attached to detect what is going on inside the house and what actions can be taken automatically to assist the resident of the house. In this thesis, artificial intelligence algorithms to classify human activities of daily living (having breakfast, playing video games etc.) are investigated. The problem is a time series classification for sensor-based human activity recognition. In total, nine different standard machine learning algorithms (support vector machine, logistic regression, decision trees etc.) and three deep learning models (multilayer perceptron, long short-term neural network, convolu- tional neural network) were compared. The algorithms were trained and tested on the ucami Cup 2018 data set from sensor inputs captured in a smart lab over ten days. The data set contains sensor data from four different sources: intelligent floor, proximity, binary sensors and acceleration data from a smart watch. The mutlilayer perceptron reported a testing accuracy of 50.31%. The long short-term neural network showed an accuracy of 57.41% (+/-13.4), the convolutional neural network in 70.06% (+/-2.3) on average - resulting in only slightly higher scores than the best standard algorithm logistic regression with 65.63%. To sum up the observations of this thesis, deep learning is indeed suitable for human activity recognition. However, the convolutional neural network did not significantly outperform the best standard machine learning algorithm when using this particular data set. Unexpectedly, the long short-term neural network and the basic multilayer perceptron performed poorly. The key drawback of finding a fitting machine learning algorithm to solve a problem such as the one presented in this thesis is that there is no trivial solution. Experiments have to be conducted to empirically evaluate which technique and which hyperparameters yield the best results. Thus the results found in this thesis are valuable for other researchers to build on and develop further approaches based on the new insights.

The artificial classification of audio samples to an abstraction of the recorded location (e.g., Park, Public Square, etc.), denoted as Acoustic Scene Classification (ASC), represents an active field of research, popularized, inter alia, as part of the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge. Nevertheless, we are more concerned to artificially assign audio samples directly to the location of origin, i.e., to the location where the recording of the corresponding audio sample is conducted, which we denote as Acoustic Location Classification (ALC). The evidence for the feasibility of ALC contributes a supplementary challenge for acoustics-based Artificial Intelligence (AI), and enhances the capabilities of location dependent applications in terms of context-aware computing. Thus, we established a client-server infrastructure with an Android application as recording solution, and proposed a dataset which provides audio samples recorded at different locations on multiple consecutive dates. Based on this dataset, and on the dataset proposed for the DCASE 2019 ASC challenge, we evaluated the application of ALC, along with ASC, providing a special focus on constraining training and test sets temporally, and locally, respectively, to ensure reasonable generalization estimates with respect to the underlying Convolutional Neural Network (CNN). As indicated by our outcomes, employing ALC constitutes a comprehensive challenge, resulting in decent classification estimates, and hence motivates further research. However, increasing the number of samples within the proposed dataset, thus, providing daily recordings over a comparatively long period of time, e.g., several weeks or months, seems necessary to investigate the practicality and limitations of ALC to a sufficient degree.

Die Erkennung von Communities ist ein essenzielles Werkzeug für die Analyse von komplexen sozialen und biologischen Netzwerken, sowie von Informationsnetzwerken. Unter den bislang veröffentlichten, zahlreichen Community-Erkennungsalgorithmen ist Infomap ein prominentes und etabliertes Framework. In dieser Masterarbeit präsentieren wir eine neue Methode zur Erkennung von Communities, welche von Infomap inspiriert ist. Infomap wählt eine analytische Herangehensweise an das Community-Erkennungsproblem, indem die erwartete Beschreibungslänge eines Zufallslaufs auf einem Netzwerk minimiert wird. Im Gegensatz dazu minimiert unsere Methode die Unterschiedlichkeit, quantifiziert via Kullback-Leibler Divergenz, zwischen einem Graph-induzierten und einem synthetischen Zufallsläufer, um eine Partition in Communities zu erhalten. Daher nennen wir unsere Methode Synthesizing Infomap. Spezifischer behandeln wir Community-Erkennung in ungerichteten Netzwerken mit nicht-überlappenden Communities und zweischichtigen Hierarchien. In dieser Arbeit präsentieren wir eine Formalisierung sowie eine ausführliche Herleitung der Synthesizing Infomap Zielfunktion. Anhand der Anwendung von Synthesizing Infomap auf eine Gruppe von Standardgraphen erkunden wir dessen Eigenschaften und qualitatives Verhalten. Unsere Experimente an künstlich generierten Benchmark-Netzwerken zeigen, dass Synthesizing Infomap dessen ursprüngliche Version bezüglich „Adjusted Mutual Information“ auf Netzwerken mit schwacher Community-Struktur übertrifft. Beide Methoden zeigen gleichwertiges Verhalten bei Anwendung an einer Auswahl von realen Netzwerken. Dies indiziert, dass Synthesizing Infomap auch in praktischen Anwendungsfällen sinnvolle Ergebnisse liefert. Die vielversprechenden Resultate von Synthesizing Infomap motivieren eine weiterführende Evaluierung anhand von realen Netzwerken, sowie mögliche Erweiterungen für mehrstufige Hierarchien und überlappende Communities.

As the complexity of a software projects rises it can become difficult to add new features. Additionally to the maintainability, other quality attributes such as reliab- ility and usability may suffer from the increased complexity. To prevent complexity from becoming an overwhelming issue we use principles of good programming and reside to well known software architectures. We often do so, by choosing to use specific frameworks. However, we can only subjectively judge whether or not the usage of a specific framework resulted in less perceived complexity and an improvement in other quality attributes. In our work, we investigated the applicability of existing software measurements for measuring desired quality attributes and their applicability for framework com- parison. We chose a set of quantitative software measurements which are aimed at specific quality attributes, namely maintainability and flexibility. Additionally, we used well established software measurements such as McCabes Cyclomatic Com- plexity [44] and Halsteads Metrics [32] to measure the complexity of a software. By developing the same application using two different web frameworks, namely ReactJS and Laravel, over a set of predefined ‘sprints’, each containing a specific set of features, we were able to investigate the evolution of different software measurements. Our results show that some of the measurements are more applic- able to the frameworks chosen than others. Especially measurements aimed at quantitative attributes of the code such as the coupling measures by Martin [43] and the Cyclomatic Complexity by McCabe [44] proved particularly useful as there is a clear connection between the results of the measurements and attributes of the code. However, there is still the need for additional work which focuses on defining the exact scale each of the measurements operates on, as well as need for the development of tools which can be used to seamlessly integrate software measurements into existing software projects.

Traffic accident prediction has been a hot research topic in the last decades. With the rise of Big Data, Machine Learning, Deep Learning and the real- time availability of traffic flow data, this research field becomes more and more interesting. In this thesis different data sources as traffic flow, weather, population and the crash data set from the city of Graz are collected over 3 years between 01.01.2015 and 31.12.2017. In this period 5416 accidents, which were recored by Austrian police officers, happened. Further these data sets are matched to two different spatial road networks. Beside feature engineering and the crash likelihood prediction also different imputation strategies are applied for missing values in the data sets. Especially missing value prediction for traffic flow measurements is a big topic. To tackle the imbalance class problem of crash and no-crash samples, an informative sampling strategy is applied. Once the inference model is trained, the crash likelihood for a given street link at a certain hour of the day can be estimated. Experiment results reveal the efficiency of the Gradient Boosting approach by incorporating with these data sources. Especially the different districts of Graz and street graph related features like centrality measurements and the number of road lanes play an important role. Against that, including traffic flow measurements as pointwise explanatory variables can not lead to a more accurate output accuracy.

The entry point of this master thesis is the context-based Web-Information- Agent Back to the Future Search (bttfs) which was developed with the goal of shortening the period of vocational adjustment while working on different projects at once as well as providing different functionalities for finding and re-finding relevant sources of information. bttfs supports the learning of a context-based user profile in two different ways. The first way is to learn the user profile by the use of a cosine-distance function applied on the Term Frequency-Inverse Document Frequency (tf-idf) document vectors and the second approach is to learn the user profile with a one-class Support Vector Machine (svm). Furthermore, the Information Retrieval methods Best Matching 25 (bm25), Term Frequency (tf), and tf-idf, are used on the created model, to determine the most relevant search queries for the user’s context. The central question answered in this thesis is stated as follows: ”Is it possible to anticipate a users future information need by exploiting the past browsing behavior regarding a defined context of information need?” To answer this question the methods above were applied to the AOL- dataset1, which is a collection of query logs, that consists of roughly 500.000 anonymous user sessions. The evaluation showed that a combination of the cosine-distance learning function and the tf weighting function yielded promising results ranging between 18.22% - 19.85% matching rate on av- erage, for the first three single word queries that appeared in advancing order on the timeline of the user actions. While the difference in perfor- mance between the cosine-distance method and the svm method appeared to be insignificant, tf and tf-idf outperformed bm25 in both of the tested scenarios. Regarding to the gained results, it can be stated, that the future information need of a particular user can be derived from prior browsing behavior in many cases, when the context of information need remained in the same context. Therefore, there are scenarios in which systems like bttfs can aid and accelerate the user’s information generation process by providing automated context-based queries.