Scientific Work - Know-Center | Trusted AI and Data Science

Rebol Manuel

2017

Automatic Classification of Business Intent on Social Platforms

Bakk

People spend hours on social media and similar web platforms each day. They express a lot of their feelings and desires in the texts which they post online. Data analysts always try to find clever ways to get use of this information. The aim of this thesis is to first detect business intent in the different types of information users post on the internet. In a second step, the identified business intent is grouped into the two classes: buyers and sellers. This supports the idea of linking the two groups. Machine learning algorithms are used for classification. All the necessary data, which is needed to train the classifiers is retrieved and preprocessed using a Python tool which was developed. The data was taken from the web platforms Twitter and HolidayCheck. Results show that classification works accurately when focusing on a specific platform and domain. On Twitter 96 % of test data is classified correctly whereas on HolidayCheck the degree of accuracy reaches 67 %. When con- sidering cross-platform multiclass classification, the scores drop to 50 %. Although individual scores increase up to 95 % when performing binary classification, the findings suggest that features need to be improved fur- ther in order to achieve acceptable accuracy for cross-platform multiclass classification. The challenge for future work is to fully link buyers and sellers automatically. This would create business opportunities without the need of parties to know about each other beforehand.

Valentan Stephan

2017

How Design Patterns Impact Code Quality: A Controlled Experiment

Bakk

While design patterns are proposed as a standard way to achieve good software design little research is done on the actual impact of using these strategies on the code quality. Many books suggest that such methods increase flexibility and maintainability however they often lack any evi- dence. This bachelor thesis intends to empirically demonstrate that the use of design patterns actually improves code quality. To gather data about the code two applications were implemented, that are designed to meet the same requirements. While one application is developed following widespread guidelines and principles proposed by the object oriented programming, the other is implemented without paying attention to the topics of software maintenance. After complying to the basic requirements a number of additional features were implemented in two phases. At first a new graphical user interface is being supported, then a different data tier is added. The results show that the initial effort of implementing the program version following object oriented programming guidelines are noticeably higher in terms of code lines and necessary files. However, during the implementation of additional features fewer files needed to be modified and during one phase transition considerably less code was needed to be written while not performing worse in the other and furthermore the cyclomatic complexity of the code increased less rapid.

Suppan Johannes

2017

Semantischer RESTFul Web Service für die Visualisierung und Verwaltung von automotiven Entwicklungs-Tätigkeiten in einem Informations-Cockpit

Bakk

Product development starts with the product requirements. If these are defined, solutions are created for the individual components, which then correspond to the entire product requirements. The process of solution approaches and solution refinement is operated in many iterations until a corresponding quality of the product requirements is achieved. This entire ”knowledge process “is to be transferred into a knowledge management. This is why we are showing ways to make new information technologies of Web 2.0 usable for knowledge management in the automotive industry. It is based on a research project of the Virtual Vehicle Competence Center, which includes a software prototype (”information cockpit “). ”The information cockpit “links both the product requirements and development tasks with the project organization. Thus a Product Data Management (PDM) as well as a Requirement Management System (RQM) is mapped. The networking has succeeded in uniting the individual systems, which represents a novelty in this area. By networking the product data, request data and project organization, the user is able to obtain a quick overview of different data in the automotive development. As a result, the management as well as the design is able to use existing knowledge quickly and to provide newly generated knowledge for others in an unconventional manner. At present only the visualization is implemented. The data to be used are made available by ”Link-Nodes “from the data system. The goal is to transfer the demonstrator to the application ”information cockpit “. The ontology PROTARES (PROject TAsks RESources) is used here as a basis. This ontology includes the entire data schema. A semanitc representation-based transfer (REST) Ful Web Service was designed and implemented accordingly. The data storage layer is a triple-store database. ”The information cockpit “can be used to query the system, which graphically and structurally displays the information to the user. Through the use of these technologies it was possible to create a modular whole system for the system architecture. In the near future, data management can be tackled, not just visualization, but also changing the data. After that, you can still think about user administration, access control, and so on.

Veigl Robert

2017

Multiplatform Mobile App for Data Acquisition from External Sensors

Bakk

Mobile apps become more and more important for companies, because apps are needed to sell or operate their products. For being able to serve a wide range of customers, apps must be available for the most common platforms, at least Android and iOS. Considering Windows Phones as well, a company would need to provide three identical apps - one for each platform. As each platform comes with their own tools for app development, the apps must be implemented separately. That means development costs may raise by a factor of three in worst case. The Qt framework promises multi platform ability. This means an app needs to be implemented just once but still runs on several platforms. This bachelor’s thesis shall prove that by developing such a multi platform app using the Qt framework. The app shall be able to collect data from sensors connected to the mobile device and store the retrieved data on the phone. For the proof the supported platforms are limited to the most common ones - Android and iOS. Using this app for recording data from a real life scenario demonstrates its proper functioning.

Frank Sarah

2017

Automatic Generation of Regular Expressions

Bakk

This thesis deals with the creation of regular expressions from a list of input that should match the resulting expression. Since regular expressions match a pattern, they can be used to speed up work that includes large amounts of data, under the assumption that the user knows some examples of the pattern that should be matched. In the herein discussed program, a regular expression was created iteratively by working away from a very rudimentary regular expression, allowing for an adjustment of a threshold to mitigate the effect of not having any negative matches as input. The result is an easy creation of a sufficiently well-working regular expression, assuming a representative collection of input strings while requiring no negative examples from the user.

Kuhs Stefan Claudio

2017

DART: The Dataset Anonymization and Replication Tool

Bakk

Due to persistent issues concerning sensitive information, when working with big data, we present a new approach of generating articial data1in the form of datasets. For this purpose, we specify the term dataset to represent a UNIX directory structure, consisting of various les and folders. Especially in computer science, there exists a distinct need for data. Mostly, this data already exists, but contains sensitive information. Thus, such critical data is supposed to stay protected against third parties. Hence, this reservation of data leads to a lack of available data for open source developers as well as for researchers. Therefore, we discovered a way to produce replicated datasets, given an origin dataset as input. Such replicated datasets represent the origin dataset as accurate as possible, without leaking any sensitive information. Thus, we introduce the Dataset Anonymization and Replication Tool, short DART, a Python based framework, which allows the replication of datasets. Since we aim to encourage the data science community to participate in our work, we constructed DART as a framework with high degree of adaptability and extensibility. We started with the analysis of datasets and various le and MIME types to nd suitable properties which characterize datasets. Thus, we dened a broad range of properties, respectively characteristics, initiating with the number of les, to the point of le specic characteristics like permissions. In the next step, we explored several mathematical and statistical approaches to replicate the selected characteristics. Therefore, we chose to model characteristics using relative frequency distributions, respectively unigrams, discrete as well as continuous random variables. Finally, we started to produce replicated datasets and analyzed the replicated characteristics against the characteristics of the corresponding origin dataset. Thus, the comparison between origin and replicated datasets is exclusively based on the selected characteristics. The achieved results highly depend on the origin dataset as well as on the characteristics of interest. Thus, origin datasets, which indicate a simple structure, tend more likely to deliver utilizable results. Otherwise, large and complex origin datasets might struggle to be replicated succiently. Nevertheless, the results aspire that tools like DART will be utilized to provide articial data1for persistent use cases.

Kurzmann Lukas

2017

Data Mining - Variables and Feature Selection with Greedy and Non Greedy Algorithm

Bakk

This paper is about comparing variables and feature selection with greedy and non greedy algorithms. For the greedy solution the ID3 [J. Quinlan, 1986] algorithm is used in this paper, which serves as a baseline. This algorithm is fast and provides good results for smaller datasets. However if the dataset gets larger and the information, which we want to get out of it has to be more precise, several combinations should be checked. Therefore a non greedy solution is a possible way to achieve that goal. This way of getting information out of data tries every possibility/combination to get the optimal results. This results may contain combinations of variables. One variable on its own possibly provides no information about the dataset, but in combination with another variable it does. That is one reason, why it is useful to check every combination. Besides the precision, which is very good, the algorithm needs higher computational time, at least W(n!). The higher the amount of attributes in a dataset is the higher the computational complexity is. The results have shown, even for smaller datasets that the non greedy algorithm finds more precise results, especially in view of combination of several attributes/variables. Taken together, if the dataset needs to be analysed in a more precise way and the hardware allows it, then the non greedy version of the algorithm is a tool, which provides precise data especially at combinational point of view.