Gursch Heimo, Körner Stefan, Thaler Franz, Waltner Georg, Ganster Harald, Rinnhofer Alfred, Oberwinkler Christian, Meisenbichler Reinhard, Bischof Horst, Kern Roman
Refuse separation and sorting is currently done by recycling plants that are manually optimised for a fixed refuse composition. Since the refuse compositions constantly change, these plants deliver either suboptimal sorting performances or require constant monitoring and adjustments by the plant operators. Image recognition offers the possibility to continuously monitor the refuse composition on the conveyor belts in a sorting facility. When information about the refuse composition is combined with parameters and measurements of the sorting machinery, the sorting performance of a plant can be continuously monitored, problems detected, optimisations suggested and trends predicted. This article describes solutions for multispectral and 3D image capturing of refuse streams and evaluates the performance of image segmentation models. The image segmentation models are trained with synthetic training data to reduce the manual labelling effort thus reducing the costs of the image recognition introduction. Furthermore, an outlook on the combination of image recognition data with parameters and measurements of the sorting machinery in a combined time series analysis is provided.
Liu Xinglan, Hussain Hussain, Razouk Houssam, Kern Roman
Graph embedding methods have emerged as effective solutions for knowledge graph completion. However, such methods are typically tested on benchmark datasets such as Freebase, but show limited performance when applied on sparse knowledge graphs with orders of magnitude lower density. To compensate for the lack of structure in a sparse graph, low dimensional representations of textual information such as word2vec or BERT embeddings have been used. This paper proposes a BERT-based method (BERT-ConvE), to exploit transfer learning of BERT in combination with a convolutional network model ConvE. Comparing to existing text-aware approaches, we effectively make use of the context dependency of BERT embeddings through optimizing the features extraction strategies. Experiments on ConceptNet show that the proposed method outperforms strong baselines by 50% on knowledge graph completion tasks. The proposed method is suitable for sparse graphs as also demonstrated by empirical studies on ATOMIC and sparsified-FB15k-237 datasets. Its effectiveness and simplicity make it appealing for industrial applications.
Salhofer Eileen, Liu Xinglan, Kern Roman
State of the art performances for entity extrac-tion tasks are achieved by supervised learning,specifically, by fine-tuning pretrained languagemodels such as BERT. As a result, annotatingapplication specific data is the first step in manyuse cases. However, no practical guidelinesare available for annotation requirements. Thiswork supports practitioners by empirically an-swering the frequently asked questions (1) howmany training samples to annotate? (2) whichexamples to annotate? We found that BERTachieves up to 80% F1 when fine-tuned on only70 training examples, especially on biomedicaldomain. The key features for guiding the selec-tion of high performing training instances areidentified to be pseudo-perplexity and sentence-length. The best training dataset constructedusing our proposed selection strategy shows F1score that is equivalent to a random selectionwith twice the sample size. The requirementof only a small number of training data im-plies cheaper implementations and opens doorto wider range of applications.
Gursch Heimo, Pramhas Martin, Bernhard Knopper, Daniel Brandl, Markus Gratzl, Schlager Elke, Kern Roman
Im Projekt COMFORT (Comfort Orientated and Management Focused Operation of Room condiTions) wird die Behaglichkeit von Büroräumen mit Simulationen und datengetriebenen Verfahren untersucht. Während die datengetriebenen Verfahren auf Messdaten setzen, benötigt die Simulation umfangreiche Beschreibungen der Büroräume, welche sich vielfach mit im Building Information Model (BIM) erfassten Informationen decken. Trotz großer Fortschritte in den letzten Jahren, ist die Integration von BIM und Simulation noch nicht vollständig automatisiert. An dem Fallbeispiel der Aufstockung eines Bürogebäudes der Thomas Lorenz ZT GmbH wird die Übergabe von BIM-Daten an Building Energy Simulation (BES) und Computational Fluid Dynamics (CFD) Simulationen untersucht. Beim untersuchten Gebäude wurde der gesamte Planungsprozess anhand des BIM durchgeführt. Damit konnten Einreichplanung, Ausschreibungsplanung für sämtliche Gewerke inkl. Massenableitung, Ausführungspläne wie Polier-, Schalungs- und Bewehrungspläne aus dem Modell abgeleitet werden und das Haustechnikmodell frühzeitig mit Architektur- und Tragwerksplanungsmodell verknüpft werden.Ausgehend vom BIM konnten die nötigen Daten im IFC-Format an die BES übergeben werden. Die verwendete Software konnte aber noch keine automatische Übergabe durchführen, weshalb eine manuelle Nachbearbeitung der Räume erforderlich war. Für die CFD-Simulation wurden nur ausgewählte Räume betrachtet, denn der Zusatzaufwand zur Übergabe im STEP-Format ist bei normaler Bearbeitung des BIM immer noch sehr groß. Dabei muss der freie Luftraum im BIM separat modelliert und bestimmte geometrischen Randbedingungen erfüllt werden. Ebenso müssen Angaben zu Wärmequellen und Möbel in einer sehr hohen Planungstiefe vorliegen. Der Austausch von Randbedingungen an den Grenzflächen zwischen Luft und Hülle musste noch manuell geschehen.Die BES- und CFD-Simulationsergebnisse sind bezüglich ihrer Aussagekraft mit denen aus herkömmlichen, manuell erstellten Simulationsmodellen als identisch zu betrachten. Eine automatische Übernahme von Parameterwerten scheitert momentan noch an der mangelnden Interpretier- bzw. Zuordenbarkeit in der Simulationssoftware. In Zukunft sollen es die Etablierung von IFC 4 und zusätzlicher Industry Foundation Class (IFC) Parameter einfacher machen die benötigten Daten im Modell strukturiert zu hinterlegen. Besonderes Augenmerk ist dabei auf die Integration von Raumbuchdaten in BIM zu legen, da diese Informationen nicht nur für die Simulation von großem Nutzen sind. Diese Informationsintegrationen sind nicht auf eine einmalige Übermittlung beschränkt, sondern zielen auf eine Integration zur automatischen Übernahme von Änderungen zwischen BIM, Simulation und anknüpfenden Bereichen ab.
Lovric Mario, Kern Roman, Fadljevic Leon, Gerdenitsch, Johann, Steck, Thomas, Peche, Ernst
In industrial electro galvanizing lines, the performance of the dimensionally stable anodes (Ti +IrOx) is a crucial factor for product quality. Ageing of the anodes causes worsened zinc coatingdistribution on the steel strip and a significant increase in production costs due to a higher resistivityof the anodes. Up to now, the end of the anode lifetime has been detected by visual inspectionevery several weeks. The voltage of the rectifiers increases much earlier, indicating the deteriorationof anode performance. Therefore monitoring rectifier voltage has the potential for a prematuredetermination of the end of anode lifetime. Anode condition is only one of many parameters affectingthe rectifier voltage. In this work we employed machine learning to predict expected baseline rectifiervoltages for a variety of steel strips and operating conditions at an industrial electro galvanizingline. In the plating section the strip passes twelve “Gravitel” cells and zinc from the electrolyte isdeposited on the surface at high current densities. Data, collected on one exemplary rectifier unitequipped with two anodes, have been studied for a period of two years. The dataset consists of onetarget variable (rectifier voltage) and nine predictive variables describing electrolyte, current andsteel strip characteristics. For predictive modelling, we used selected Random Forest Regression.Training was conducted on intervals after the plating cell was equipped with new anodes. Our resultsshow a Normalized Root Mean Square Error of Prediction (NRMSEP) of 1.4 % for baseline rectifiervoltage during good anode condition. When anode condition was estimated as bad (by manualinspection), we observe a large distinctive deviation in regard to the predicted baseline voltage. Thegained information about the observed deviation can be used for early detection resp. classificationof anode ageing to recognize the onset of damage and reduce total operation cost
Schrunner Stefan, Geiger Bernhard, Zernig Anja, Kern Roman
Classification has been tackled by a large number of algorithms, predominantly following a supervised learning setting. Surprisingly little research has been devoted to the problem setting where a dataset is only partially labeled, including even instances of entirely unlabeled classes. Algorithmic solutions that are suited for such problems are especially important in practical scenarios, where the labelling of data is prohibitively expensive, or the understanding of the data is lacking, including cases, where only a subset of the classes is known. We present a generative method to address the problem of semi-supervised classification with unknown classes, whereby we follow a Bayesian perspective. In detail, we apply a two-step procedure based on Bayesian classifiers and exploit information from both a small set of labeled data in combination with a larger set of unlabeled training data, allowing that the labeled dataset does not contain samples from all present classes. This represents a common practical application setup, where the labeled training set is not exhaustive. We show in a series of experiments that our approach outperforms state-of-the-art methods tackling similar semi-supervised learning problems. Since our approach yields a generative model, which aids the understanding of the data, it is particularly suited for practical applications.
Gursch Heimo, Cemernek David, Wuttei Andreas, Kern Roman
The increasing potential of Information and Communications Technology (ICT) drives higher degrees of digitisation in the manufacturing industry. Such catchphrases as “Industry 4.0” and “smart manufacturing” reflect this tendency. The implementation of these paradigms is not merely an end to itself, but a new way of collaboration across existing department and process boundaries. Converting the process input, internal and output data into digital twins offers the possibility to test and validate the parameter changes via simulations, whose results can be used to update guidelines for shop-floor workers. The result is a Cyber-Physical System (CPS) that brings together the physical shop-floor, the digital data created in the manufacturing process, the simulations, and the human workers. The CPS offers new ways of collaboration on a shared data basis: the workers can annotate manufacturing problems directly in the data, obtain updated process guidelines, and use knowledge from other experts to address issues. Although the CPS cannot replace manufacturing management since it is formalised through various approaches, e. g., Six-Sigma or Advanced Process Control (APC), it is a new tool for validating decisions in simulation before they are implemented, allowing to continuously improve the guidelines.
Remonda Adrian, Krebs Sarah, Luzhnica Granit, Kern Roman, Veas Eduardo Enrique
This paper explores the use of reinforcement learning (RL) models for autonomous racing. In contrast to passenger cars, where safety is the top priority, a racing car aims to minimize the lap-time. We frame the problem as a reinforcement learning task witha multidimensional input consisting of the vehicle telemetry, and a continuous action space. To findout which RL methods better solve the problem and whether the obtained models generalize to drivingon unknown tracks, we put 10 variants of deep deterministic policy gradient (DDPG) to race in two experiments: i) studying how RL methods learn to drive a racing car and ii) studying how the learning scenario influences the capability of the models to generalize. Our studies show that models trained with RL are not only able to drive faster than the baseline open source handcrafted bots but also generalize to unknown tracks.
Kowald Dominik, Traub Matthias, Theiler Dieter, Gursch Heimo, Lacic Emanuel, Lindstaedt Stefanie , Kern Roman, Lex Elisabeth
Toller Maximilian, Geiger Bernhard, Kern Roman
Distance-based classification is among the most competitive classification methods for time series data. The most critical componentof distance-based classification is the selected distance function.Past research has proposed various different distance metrics ormeasures dedicated to particular aspects of real-world time seriesdata, yet there is an important aspect that has not been considered so far: Robustness against arbitrary data contamination. In thiswork, we propose a novel distance metric that is robust against arbitrarily “bad” contamination and has a worst-case computationalcomplexity of O(n logn). We formally argue why our proposedmetric is robust, and demonstrate in an empirical evaluation thatthe metric yields competitive classification accuracy when appliedin k-Nearest Neighbor time series classification.
Winter Kevin, Kern Roman
This paper presents the Know-Center system submitted for task 5 of the SemEval-2019workshop. Given a Twitter message in either English or Spanish, the task is to first detect whether it contains hateful speech and second,to determine the target and level of aggression used. For this purpose our system utilizes word embeddings and a neural network architecture, consisting of both dilated and traditional convolution layers. We achieved aver-age F1-scores of 0.57 and 0.74 for English and Spanish respectively.
Geiger Bernhard, Schrunner Stefan, Kern Roman
Schrunner and Geiger have contributed equally to this work.
Gursch Heimo, Silva Nelson, Reiterer Bernhard , Paletta Lucas , Bernauer Patrick, Fuchs Martin, Veas Eduardo Enrique, Kern Roman
The project Flexible Intralogistics for Future Factories (FlexIFF) investigates human-robot collaboration in intralogistics teams in the manufacturing industry, which form a cyber-physical system consisting of human workers, mobile manipulators, manufacturing machinery, and manufacturing information systems. The workers use Virtual Reality (VR) and Augmented Reality (AR) devices to interact with the robots and machinery. The right information at the right time is key for making this collaboration successful. Hence, task scheduling for mobile manipulators and human workers must be closely linked with the enterprise’s information systems, offering all actors on the shop floor a common view of the current manufacturing status. FlexIFF will provide useful, well-tested, and sophisticated solutions for cyberphysicals systems in intralogistics, with humans and robots making the most of their strengths, working collaboratively and helping each other.
Cuder Gerald, Breitfuß Gert, Kern Roman
Electric vehicles have enjoyed a substantial growth in recent years. One essential part to ensure their success in the future is a well-developed and easy-to-use charging infrastructure. Since charging stations generate a lot of (big) data, gaining useful information out of this data can help to push the transition to E-Mobility. In a joint research project, the Know-Center, together with the GmbH applied data analytics methods and visualization technologies on the provided data sets. One objective of the research project is, to provide a consumption forecast based on the historical consumption data. Based on this information, the operators of charging stations are able to optimize the energy supply. Additionally, the infrastructure data were analysed with regard to "predictive maintenance", aiming to optimize the availability of the charging stations. Furthermore, advanced prediction algorithms were applied to provide services to the end user regarding availability of charging stations.
Andrusyak Bohdan, Kugi Thomas, Kern Roman
The stock and foreign exchange markets are the two fundamental financial markets in the world and play acrucial role in international business. This paper examines the possibility of predicting the foreign exchangemarket via machine learning techniques, taking the stock market into account. We compare prediction modelsbased on algorithms from the fields of shallow and deep learning. Our models of foreign exchange marketsbased on information from the stock market have been shown to be able to predict the future of foreignexchange markets with an accuracy of over 60%. This can be seen as an indicator of a strong link between thetwo markets. Our insights offer a chance of a better understanding guiding the future of market predictions.We found the accuracy depends on the time frame of the forecast and the algorithms used, where deeplearning tends to perform better for farther-reaching forecasts
Lovric Mario, Krebs Sarah, Cemernek David, Kern Roman
The use of big data technologies has a deep impact on today’s research (Tetko et al., 2016) and industry (Li et al., n.d.), but also on public health (Khoury and Ioannidis, 2014) and economy (Einav and Levin, 2014). These technologies are particularly important for manufacturing sites, where complex processes are coupled with large amounts of data, for example in chemical and steel industry. This data originates from sensors, processes. and quality-testing. Typical application of these technologies is related to predictive maintenance and optimisation of production processes. Media makes the term “big data” a hot buzzword without going to deep into the topic. We noted a lack in user’s understanding of the technologies and techniques behind it, making the application of such technologies challenging. In practice the data is often unstructured (Gandomi and Haider, 2015) and a lot of resources are devoted to cleaning and preparation, but also to understanding causalities and relevance among features. The latter one requires domain knowledge, making big data projects not only challenging from a technical perspective, but also from a communication perspective. Therefore, there is a need to rethink the big data concept among researchers and manufacturing experts including topics like data quality, knowledge exchange and technology required. The scope of this presentation is to present the main pitfalls in applying big data technologies amongst users from industry, explain scaling principles in big data projects, and demonstrate common challenges in an industrial big data project
Santos Tiago, Kern Roman
Semiconductor manufacturing processes critically depend on hundreds of highly complex process steps, which may cause critical deviations in the end-product.Hence, a better understanding of wafer test data patterns, which represent stress tests conducted on devices in semiconductor material slices, may lead to an improved production process.However, the shapes and types of these wafer patterns, as well as their relation to single process steps, are unknown.In a first step to address these issues, we tailor and apply a variational auto-encoder (VAE) to wafer pattern images.We find the VAE's generator allows for explorative wafer pattern analysis, andits encoder provides an effective dimensionality reduction algorithm, which, in a clustering application, performs better than several baselines such as t-SNE and yields interpretable clusters of wafer patterns.
Urak Günter, Ziak Hermann, Kern Roman
The task of federated search is to combine results from multiple knowledge bases into a single, aggregated result list, where the items typically range from textual documents toimages. These knowledge bases are also called sources, and the process of choosing the actual subset of sources for a given query is called source selection. A scenario wherethese sources do not provide information about their content in a standardized way is called uncooperative setting. In our work we focus on knowledge bases providing long tail content, i.e., rather specialized sources offering a low number of relevant documents. These sources are often neglected in favor of more popular knowledge sources, both by today’s Web users as well as by most of the existing source selection techniques. We propose a system for source selection which i) could be utilized to automatically detect long tail knowledge bases and ii) generates aggregated search results that tend to incorporate results from these long tail sources. Starting from the current state-of-the-art we developed components that allowed to adjust the amount of contribution from long tail sources. Our evaluation is conducted on theTREC 2014 Federated WebSearch dataset. As this dataset also favors the most popular sources, systems that include many long tail knowledge bases will yield low performancemeasures. Here, we propose a system where just a few relevant long tail sources are integrated into the list of more popular knowledge bases. Additionally, we evaluated the implications of an uncooperative setting, where only minimal information of the sources is available to the federated search system. Here a severe drop in performance is observed once the share of long tail sources is higher than 40%. Our work is intended to steer the development of federated search systems that aim at increasing the diversity and coverage of the aggregated search result.
Kern Roman, Falk Stefan, Rexha Andi
This paper describes our participation inSemEval-2017 Task 10, named ScienceIE(Machine Reading for Scientist). We competedin Subtask 1 and 2 which consist respectivelyin identifying all the key phrasesin scientific publications and label them withone of the three categories: Task, Process,and Material. These scientific publicationsare selected from Computer Science, MaterialSciences, and Physics domains. We followeda supervised approach for both subtasksby using a sequential classifier (CRF - ConditionalRandom Fields). For generating oursolution we used a web-based application implementedin the EU-funded research project,named CODE. Our system achieved an F1score of 0.39 for the Subtask 1 and 0.28 forthe Subtask 2.
Rexha Andi, Kern Roman, Ziak Hermann, Dragoni Mauro
Retrieval of domain-specific documents became attractive for theSemantic Web community due to the possibility of integrating classicInformation Retrieval (IR) techniques with semantic knowledge.Unfortunately, the gap between the construction of a full semanticsearch engine and the possibility of exploiting a repository ofontologies covering all possible domains is far from being filled.Recent solutions focused on the aggregation of different domain-specificrepositories managed by third-parties. In this paper, wepresent a semantic federated search engine developed in the contextof the EEXCESS EU project. Through the developed platform,users are able to perform federated queries over repositories in atransparent way, i.e. without knowing how their original queries aretransformed before being actually submitted. The platform implementsa facility for plugging new repositories and for creating, withthe support of general purpose knowledge bases, knowledge graphsdescribing the content of each connected repository. Such knowledgegraphs are then exploited for enriching queries performed byusers.
Schrunner Stefan, Bluder Olivia, Zernig Anja, Kaestner Andre, Kern Roman
In semiconductor industry it is of paramount im- portance to check whether a manufactured device fulfills all quality specifications and is therefore suitable for being sold to the customer. The occurrence of specific spatial patterns within the so-called wafer test data, i.e. analog electric measurements, might point out on production issues. However the shape of these critical patterns is unknown. In this paper different kinds of process patterns are extracted from wafer test data by an image processing approach using Markov Random Field models for image restoration. The goal is to develop an automated procedure to identify visible patterns in wafer test data to improve pattern matching. This step is a necessary precondition for a subsequent root-cause analysis of these patterns. The developed pattern ex- traction algorithm yields a more accurate discrimination between distinct patterns, resulting in an improved pattern comparison than in the original dataset. In a next step pattern classification will be applied to improve the production process control.
Cemernek David, Gursch Heimo, Kern Roman
The catchphrase “Industry 4.0” is widely regarded as a methodology for succeeding in modern manufacturing. This paper provides an overview of the history, technologies and concepts of Industry 4.0. One of the biggest challenges to implementing the Industry 4.0 paradigms in manufacturing are the heterogeneity of system landscapes and integrating data from various sources, such as different suppliers and different data formats. These issues have been addressed in the semiconductor industry since the early 1980s and some solutions have become well-established standards. Hence, the semiconductor industry can provide guidelines for a transition towards Industry 4.0 in other manufacturing domains. In this work, the methodologies of Industry 4.0, cyber-physical systems and Big data processes are discussed. Based on a thorough literature review and experiences from the semiconductor industry, we offer implementation recommendations for Industry 4.0 using the manufacturing process of an electronics manufacturer as an example.
Gursch Heimo, Cemernek David, Kern Roman
In manufacturing environments today, automated machinery works alongside human workers. In many cases computers and humans oversee different aspects of the same manufacturing steps, sub-processes, and processes. This paper identifies and describes four feedback loops in manufacturing and organises them in terms of their time horizon and degree of automation versus human involvement. The data flow in the feedback loops is further characterised by features commonly associated with Big Data. Velocity, volume, variety, and veracity are used to establish, describe and compare differences in the data flows.
Traub Matthias, Gursch Heimo, Lex Elisabeth, Kern Roman
New business opportunities in the digital economy are established when datasets describing a problem, data services solving the said problem, the required expertise and infrastructure come together. For most real-word problems finding the right data sources, services consulting expertise, and infrastructure is difficult, especially since the market players change often. The Data Market Austria (DMA) offers a platform to bring datasets, data services, consulting, and infrastructure offers to a common marketplace. The recommender systems included in DMA analyses all offerings, to derive suggestions for collaboration between them, like which dataset could be best processed by which data service. The suggestions should help the costumers on DMA to identify new collaborations reaching beyond traditional industry boundaries to get in touch with new clients or suppliers in the digital domain. Human brokers will work together with the recommender system to set up data value chains matching different offers to create a data value chain solving the problems in various domains. In its final expansion stage, DMA is intended to be a central hub for all actors participating in the Austrian data economy, regardless of their industrial and research domain to overcome traditional domain boundaries.
Ziak Hermann, Kern Roman
The combination of different knowledge bases in thefield of information retrieval is called federated or aggregated search. It has several benefits over single source retrieval but poses some challenges as well. This work focuses on the challenge of result aggregation; especially in a setting where the final result list should include a certain degree of diversity and serendipity. Both concepts have been shown to have an impact on how user perceive an information retrieval system. In particular, we want to assess if common procedures for result list aggregation can be utilized to introduce diversity and serendipity. Furthermore, we study whether a blocking or interleaving for result aggregation yields better results. In a cross vertical aggregated search the so-called verticalscould be news, multimedia content or text. Block ranking is one approach to combine such heterogeneous result. It relies on the idea that these verticals are combined into a single result list as blocks of several adjacent items. An alternative approach for this is interleaving. Here the verticals are blended into one result list on an item by item basis, i.e. adjacent items in the result list may come from different verticals. To generate the diverse and serendipitous results we reliedon a query reformulation technique which we showed to be beneficial to generate diversified results in previous work. To conduct this evaluation we created a dedicated dataset. This dataset served as a basis for three different evaluation settings on a crowd sourcing platform, with over 300 participants. Our results show that query based diversification can be adapted to generate serendipitous results in a similar manner. Further, we discovered that both approaches, interleaving and block ranking, appear to be beneficial to introduce diversity and serendipity. Though it seems that queries either benefit from one approach or the other but not from both.
Toller Maximilian, Kern Roman
The in-depth analysis of time series has gained a lot of re-search interest in recent years, with the identification of pe-riodic patterns being one important aspect. Many of themethods for identifying periodic patterns require time series’season length as input parameter. There exist only a few al-gorithms for automatic season length approximation. Manyof these rely on simplifications such as data discretization.This paper presents an algorithm for season length detec-tion that is designed to be sufficiently reliable to be used inpractical applications. The algorithm estimates a time series’season length by interpolating, filtering and detrending thedata. This is followed by analyzing the distances betweenzeros in the directly corresponding autocorrelation function.Our algorithm was tested against a comparable algorithmand outperformed it by passing 122 out of 165 tests, whilethe existing algorithm passed 83 tests. The robustness of ourmethod can be jointly attributed to both the algorithmic ap-proach and also to design decisions taken at the implemen-tational level.
Rexha Andi, Kröll Mark, Ziak Hermann, Kern Roman
Our work is motivated by the idea to extend the retrieval of related scientific literature to cases, where the relatedness also incorporates the writing style of individual scientific authors. Therefore we conducted a pilot study to answer the question whether humans can identity authorship once the topological clues have been removed. As first result, we found out that this task is challenging, even for humans. We also found some agreement between the annotators. To gain a better understanding how humans tackle such a problem, we conducted an exploratory data analysis. Here, we compared the decisions against a number of topological and stylometric features. The outcome of our work should help to improve automatic authorship identificationalgorithms and to shape potential follow-up studies.
Rexha Andi, Kern Roman, Dragoni Mauro , Kröll Mark
With different social media and commercial platforms, users express their opinion about products in a textual form. Automatically extracting the polarity (i.e. whether the opinion is positive or negative) of a user can be useful for both actors: the online platform incorporating the feedback to improve their product as well as the client who might get recommendations according to his or her preferences. Different approaches for tackling the problem, have been suggested mainly using syntactic features. The “Challenge on Semantic Sentiment Analysis” aims to go beyond the word-level analysis by using semantic information. In this paper we propose a novel approach by employing the semantic information of grammatical unit called preposition. We try to drive the target of the review from the summary information, which serves as an input to identify the proposition in it. Our implementation relies on the hypothesis that the proposition expressing the target of the summary, usually containing the main polarity information.
Ziak Hermann, Kern Roman
Within this work represents the documentation of our ap-proach on the Social Book Search Lab 2016 where we took part in thesuggestion track. The main goal of the track was to create book recom-mendation for readers only based on their stated request within a forum.The forum entry contained further contextual information, like the user’scatalogue of already read books and the list of example books mentionedin the user’s request. The presented approach is mainly based on themetadata included in the book catalogue provided by the organizers ofthe task. With the help of a dedicated search index we extracted severalpotential book recommendations which were re-ranked by the use of anSVD based approach. Although our results did not meet our expectationwe consider it as first iteration towards a competitive solution.
Gursch Heimo, Körner Stefan, Krasser Hannes, Kern Roman
Painting a modern car involves applying many coats during a highly complex and automated process. The individual coats not only serve a decoration purpose but are also curial for protection from damage due to environmental influences, such as rust. For an optimal paint job, many parameters have to be optimised simultaneously. A forecasting model was created, which predicts the paint flaw probability for a given set of process parameters, to help the production managers modify the process parameters to achieve an optimal result. The mathematical model was based on historical process and quality observations. Production managers who are not familiar with the mathematical concept of the model can use it via an intuitive Web-based Graphical User Interface (Web-GUI). The Web-GUI offers production managers the ability to test process parameters and forecast the expected quality. The model can be used for optimising the process parameters in terms of quality and costs.
Gursch Heimo, Kern Roman
Many different sensing, recording and transmitting platforms are offered on today’s market for Internet of Things (IoT) applications. But taking and transmitting measurements is just one part of a complete system. Also long time storage and processing of recorded sensor values are vital for IoT applications. Big Data technologies provide a rich variety of processing capabilities to analyse the recorded measurements. In this paper an architecture for recording, searching, and analysing sensor measurements is proposed. This architecture combines existing IoT and Big Data technologies to bridge the gap between recording, transmission, and persistency of raw sensor data on one side, and the analysis of data on Hadoop clusters on the other side. The proposed framework emphasises scalability and persistence of measurements as well as easy access to the data from a variety of different data analytics tools. To achieve this, a distributed architecture is designed offering three different views on the recorded sensor readouts. The proposed architecture is not targeted at one specific use-case, but is able to provide a platform for a large number of different services.
Rexha Andi, Klampfl Stefan, Kröll Mark, Kern Roman
To bring bibliometrics and information retrieval closer together, we propose to add the concept of author attribution into the pre-processing of scientific publications. Presently, common bibliographic metrics often attribute the entire article to all the authors affecting author-specific retrieval processes. We envision a more finegrained analysis of scientific authorship by attributing particular segments to authors. To realize this vision, we propose a new feature representation of scientific publications that captures the distribution of tylometric features. In a classification setting, we then seek to predict the number of authors of a scientific article. We evaluate our approach on a data set of ~ 6100 PubMed articles and achieve best results by applying random forests, i.e., 0.76 precision and 0.76 recall averaged over all classes.
Rexha Andi, Kröll Mark, Kern Roman
Monitoring (social) media represents one means for companies to gain access to knowledge about, for instance, competitors, products as well as markets. As a consequence, social media monitoring tools have been gaining attention to handle amounts of data nowadays generated in social media. These tools also include summarisation services. However, most summarisation algorithms tend to focus on (i) first and last sentences respectively or (ii) sentences containing keywords.In this work we approach the task of summarisation by extracting 4W (who, when, where, what) information from (social)media texts. Presenting 4W information allows for a more compact content representation than traditional summaries. Inaddition, we depart from mere named entity recognition (NER) techniques to answer these four question types by includingnon-rigid designators, i.e. expressions which do not refer to the same thing in all possible worlds such as “at the main square”or “leaders of political parties”. To do that, we employ dependency parsing to identify grammatical characteristics for each question type. Every sentence is then represented as a 4W block. We perform two different preliminary studies: selecting sentences that better summarise texts by achieving an F1-measure of 0.343, as well as a 4W block extraction for which we achieve F1-measures of 0.932; 0.900; 0.803; 0.861 for “who”, “when”, “where” and “what” category respectively. In a next step the 4W blocks are ranked by relevance. The top three ranked blocks, for example, then constitute a summary of the entire textual passage. The relevance metric can be customised to the user’s needs, for instance, ranked by up-to-dateness where the sentences’ tense is taken into account. In a user study we evaluate different ranking strategies including (i) up-todateness,(ii) text sentence rank, (iii) selecting the firsts and lasts sentences or (iv) coverage of named entities, i.e. based on the number of named entities in the sentence. Our 4W summarisation method presents a valuable addition to a company’s(social) media monitoring toolkit, thus supporting decision making processes.
Pimas Oliver, Rexha Andi, Kröll Mark, Kern Roman
The PAN 2016 author profiling task is a supervised classification problemon cross-genre documents (tweets, blog and social media posts). Our systemmakes use of concreteness, sentiment and syntactic information present in thedocuments. We train a random forest model to identify gender and age of a document’sauthor. We report the evaluation results received by the shared task.
Kern Roman, Klampfl Stefan, Rexha Andi
This report describes our contribution to the 2nd ComputationalLinguistics Scientific Document Summarization Shared Task (CLSciSumm2016), which asked to identify the relevant text span in a referencepaper that corresponds to a citation in another document that citesthis paper. We developed three different approaches based on summarisationand classification techniques. First, we applied a modified versionof an unsupervised summarisation technique, TextSentenceRank, to thereference document, which incorporates the similarity of sentences tothe citation on a textual level. Second, we employed classification to selectfrom candidates previously extracted through the original TextSentenceRankalgorithm. Third, we used unsupervised summarisation of therelevant sub-part of the document that was previously selected in a supervisedmanner.
Gursch Heimo, Ziak Hermann, Kröll Mark, Kern Roman
Modern knowledge workers need to interact with a large number of different knowledge sources with restricted or public access. Knowledge workers are thus burdened with the need to familiarise and query each source separately. The EEXCESS (Enhancing Europe’s eXchange in Cultural Educational and Scientific reSources) project aims at developing a recommender system providing relevant and novel content to its users. Based on the user’s work context, the EEXCESS system can either automatically recommend useful content, or support users by providing a single user interface for a variety of knowledge sources. In the design process of the EEXCESS system, recommendation quality, scalability and security where the three most important criteria. This paper investigates the scalability aspect achieved by federated design of the EEXCESS recommender system. This means that, content in different sources is not replicated but its management is done in each source individually. Recommendations are generated based on the context describing the knowledge worker’s information need. Each source offers result candidates which are merged and re-ranked into a single result list. This merging is done in a vector representation space to achieve high recommendation quality. To ensure security, user credentials can be set individually by each user for each source. Hence, access to the sources can be granted and revoked for each user and source individually. The scalable architecture of the EEXCESS system handles up to 100 requests querying up to 10 sources in parallel without notable performance deterioration. The re-ranking and merging of results have a smaller influence on the system's responsiveness than the average source response rates. The EEXCESS recommender system offers a common entry point for knowledge workers to a variety of different sources with only marginally lower response times as the individual sources on their own. Hence, familiarisation with individual sources and their query language is not necessary.
Rexha Andi, Dragoni Mauro, Kern Roman, Kröll Mark
Ontology matching in a multilingual environment consists of finding alignments between ontologies modeled by using more than one language. Such a research topic combines traditional ontology matching algorithms with the use of multilingual resources, services, and capabilities for easing multilingual matching. In this paper, we present a multilingual ontology matching approach based on Information Retrieval (IR) techniques: ontologies are indexed through an inverted index algorithm and candidate matches are found by querying such indexes. We also exploit the hierarchical structure of the ontologies by adopting the PageRank algorithm for our system. The approaches have been evaluated using a set of domain-specific ontologies belonging to the agricultural and medical domain. We compare our results with existing systems following an evaluation strategy closely resembling a recommendation scenario. The version of our system using PageRank showed an increase in performance in our evaluations.
Mutlu Belgin, Sabol Vedran, Gursch Heimo, Kern Roman
Graphical interfaces and interactive visualisations are typical mediators between human users and data analytics systems. HCI researchers and developers have to be able to understand both human needs and back-end data analytics. Participants of our tutorial will learn how visualisation and interface design can be combined with data analytics to provide better visualisations. In the first of three parts, the participants will learn about visualisations and how to appropriately select them. In the second part, restrictions and opportunities associated with different data analytics systems will be discussed. In the final part, the participants will have the opportunity to develop visualisations and interface designs under given scenarios of data and system settings.
Santos Tiago, Kern Roman
This paper provides an overview of current literature on timeseries classification approaches, in particular of early timeseries classification.A very common and effective time series classification ap-proach is the 1-Nearest Neighbor classifier, with differentdistance measures such as the Euclidean or dynamic timewarping distances. This paper starts by reviewing thesebaseline methods.More recently, with the gain in popularity in the applica-tion of deep neural networks to the field of computer vision,research has focused on developing deep learning architec-tures for time series classification as well. The literature inthe field of deep learning for time series classification hasshown promising results.Early time series classification aims to classify a time se-ries with as few temporal observations as possible, whilekeeping the loss of classification accuracy at a minimum.Prominent early classification frameworks reviewed by thispaper include, but are not limited to, ECTS, RelClass andECDIRE. These works have shown that early time seriesclassification may be feasible and performant, but they alsoshow room for improvement
Kern Roman, Ziak Hermann
Context-driven query extraction for content-basedrecommender systems faces the challenge of dealing with queriesof multiple topics. In contrast to manually entered queries, forautomatically generated queries this is a more frequent problem. For instances if the information need is inferred indirectly viathe user's current context. Especially for federated search systemswere connected knowledge sources might react vastly differentlyon such queries, an algorithmic way how to deal with suchqueries is of high importance. One such method is to split mixedqueries into their individual subtopics. To gain insight how amulti topic query can be split into its subtopics we conductedan evaluation where we compared a naive approach against amore complex approaches based on word embedding techniques:One created using Word2Vec and one created using GloVe. Toevaluate these two approaches we used the Webis-QSeC-10 queryset, consisting of about 5,000 multi term queries. Queries of thisset were concatenated and passed through the algorithms withthe goal to split those queries again. Hence the naive approach issplitting the queries into several groups, according to the amountof joined queries, assuming the topics are of equal query termcount. In the case of the Word2Vec and GloVe based approacheswe relied on the already pre-trained datasets. The Google Newsmodel and a model trained with a Wikipedia dump and theEnglish Gigaword newswire text archive. The out of this datasetsresulting query term vectors were grouped into subtopics usinga k-Means clustering. We show that a clustering approach basedon word vectors achieves better results in particular when thequery is not in topical order. Furthermore we could demonstratethe importance of the underlying dataset.
Klampfl Stefan, Kern Roman
Semantic enrichment of scientific publications has an increasing impact on scholarly communication. This document describes our contribution to Semantic Publishing Challenge 2016, which aims at investigating novel approaches for improving scholarly publishing through semantic technologies. We participated in Task 2 of this challenge, which requires the extraction of information from the content of a paper given as PDF. The extracted information allows answering queries about the paper’s internal organisation and the context in which it was written. We build upon our contribution to the previous edition of the challenge, where we categorised meta-data, such as authors and affiliations, and extracted funding information. Here we use unsupervised machine learning techniques in order to extend the analysis of the logical structure of the document as to identify section titles and captions of figures and tables. Furthermore, we employ clustering techniques to create the hierarchical table of contents of the article. Our system is modular in nature and allows a separate training of different stages on different training sets.
Urak Günter, Ziak Hermann, Kern Roman
The core approach to distributed knowledge bases is federated search. Two of the main challenges for federated search are the source representation and source selection. Different solutions to these problems were proposed in the literature. Within this work we present our novel approach for query-based sampling by relying on knowledge bases. We show the basic correctness of our approach and we came to the insight that the ambiguity of the probing terms has just a minor impact on the representation of the collection. Finally, we show that our method can be used to distinguish between niche and encyclopedic knowledge bases.
Horn Christopher, Gursch Heimo, Kern Roman, Cik Michael
Models describing human travel patterns are indispensable to plan and operate road, rail and public transportation networks. For most kind of analyses in the field of transportation planning, there is a need for origin-destination (OD) matrices, which specify the travel demands between the origin and destination zones in the network. The preparation of OD matrices is traditionally a time consuming and cumbersome task. The presented system, QZTool, reduces the necessary effort as it is capable of generating OD matrices automatically. These matrices are produced starting from floating phone data (FPD) as raw input. This raw input is processed by a Hadoop-based big data system. A graphical user interface allows for an easy usage and hides the complexity from the operator. For evaluation, we compare a FDP-based OD matrix to an OD matrix created by a traffic demand model. Results show that both matrices agree to a high degree, indicating that FPD-based OD matrices can be used to create new, or to validate or amend existing OD matrices.
Falk Stefan, Rexha Andi, Kern Roman
This paper describes our participation in SemEval-2016 Task 5 for Subtask 1, Slot 2.The challenge demands to find domain specific target expressions on sentence level thatrefer to reviewed entities. The detection of target words is achieved by using word vectorsand their grammatical dependency relationships to classify each word in a sentence into target or non-target. A heuristic based function then expands the classified target words tothe whole target phrase. Our system achievedan F1 score of 56.816% for this task.
Dragoni Mauro, Rexha Andi, Kröll Mark, Kern Roman
Twitter is one of the most popular micro-blogging serviceson the web. The service allows sharing, interaction and collaboration viashort, informal and often unstructured messages called tweets. Polarityclassification of tweets refers to the task of assigning a positive or a nega-tive sentiment to an entire tweet. Quite similar is predicting the polarityof a specific target phrase, for instance@Microsoftor#Linux,whichiscontained in the tweet.In this paper we present a Word2Vec approach to automatically pre-dict the polarity of a target phrase in a tweet. In our classification setting,we thus do not have any polarity information but use only semantic infor-mation provided by a Word2Vec model trained on Twitter messages. Toevaluate our feature representation approach, we apply well-establishedclassification algorithms such as the Support Vector Machine and NaiveBayes. For the evaluation we used theSemeval 2016 Task #4dataset.Our approach achieves F1-measures of up to∼90 % for the positive classand∼54 % for the negative class without using polarity informationabout single words.
Pimas Oliver, Klampfl Stefan, Kohl Thomas, Kern Roman, Kröll Mark
Patents and patent applications are important parts of acompany’s intellectual property. Thus, companies put a lot of effort indesigning and maintaining an internal structure for organizing their ownpatent portfolios, but also in keeping track of competitor’s patent port-folios. Yet, official classification schemas offered by patent offices (i) areoften too coarse and (ii) are not mappable, for instance, to a company’sfunctions, applications, or divisions. In this work, we present a first steptowards generating tailored classification. To automate the generationprocess, we apply key term extraction and topic modelling algorithmsto 2.131 publications of German patent applications. To infer categories,we apply topic modelling to the patent collection. We evaluate the map-ping of the topics found via the Latent Dirichlet Allocation method tothe classes present in the patent collection as assigned by the domainexpert.
Ziak Hermann, Rexha Andi, Kern Roman
This paper describes our system for the mining task of theSocial Book Search Lab in 2016. The track consisted of two task, theclassification of book request postings and the task of linking book identifierswith references mentioned within the text. For the classificationtask we used text mining features like n-grams and vocabulary size, butalso included advanced features like average spelling errors found withinthe text. Here two datasets were provided by the organizers for this taskwhich were evaluated separately. The second task, the linking of booktitles to a work identifier, was addressed by an approach based on lookuptables. For the dataset of the first task our approach was ranked third,following two baseline approaches of the organizers with an accuracy of91 percent. For the second dataset we achieved second place with anaccuracy of 82 percent. Our approach secured the first place with anF-score of 33.50 for the second task.
Gursch Heimo, Ziak Hermann, Kern Roman
The objective of the EEXCESS (Enhancing Europe’s eXchange in Cultural Educational and Scientific reSources) project is to develop a system that can automatically recommend helpful and novel content to knowledge workers. The EEXCESS system can be integrated into existing software user interfaces as plugins which will extract topics and suggest the relevant material automatically. This recommendation process simplifies the information gathering of knowledge workers. Recommendations can also be triggered manually via web frontends. EEXCESS hides the potentially large number of knowledge sources by semi or fully automatically providing content suggestions. Hence, users only have to be able to in use the EEXCESS system and not all sources individually. For each user, relevant sources can be set or auto-selected individually. EEXCESS offers open interfaces, making it easy to connect additional sources and user program plugins.
Schulze Gunnar, Horn Christopher, Kern Roman
This paper presents an approach for matching cell phone trajectories of low spatial and temporal accuracy to the underlying road network. In this setting, only the position of the base station involved in a signaling event and the timestamp are known, resulting in a possible error of several kilometers. No additional information, such as signal strength, is available. The proposed solution restricts the set of admissible routes to a corridor by estimating the area within which a user is allowed to travel. The size and shape of this corridor can be controlled by various parameters to suit different requirements. The computed area is then used to select road segments from an underlying road network, for instance OpenStreetMap. These segments are assembled into a search graph, which additionally takes the chronological order of observations into account. A modified Dijkstra algorithm is applied for finding admissible candidate routes, from which the best one is chosen. We performed a detailed evaluation of 2249 trajectories with an average sampling time of 260 seconds. Our results show that, in urban areas, on average more than 44% of each trajectory are matched correctly. In rural and mixed areas, this value increases to more than 55%. Moreover, an in-depth evaluation was carried out to determine the optimal values for the tunable parameters and their effects on the accuracy, matching ratio and execution time. The proposed matching algorithm facilitates the use of large volumes of cell phone data in Intelligent Transportation Systems, in which accurate trajectories are desirable.
Ziak Hermann, Kern Roman
Cross vertical aggregated search is a special form of meta search, were multiple search engines from different domains and varying behaviour are combined to produce a single search result for each query. Such a setting poses a number of challenges, among them the question of how to best evaluate the quality of the aggregated search results. We devised an evaluation strategy together with an evaluation platform in order to conduct a series of experiments. In particular, we are interested whether pseudo relevance feedback helps in such a scenario. Therefore we implemented a number of pseudo relevance feedback techniques based on knowledge bases, where the knowledge base is either Wikipedia or a combination of the underlying search engines themselves. While conducting the evaluations we gathered a number of qualitative and quantitative results and gained insights on how different users compare the quality of search result lists. In regard to the pseudo relevance feedback we found that using Wikipedia as knowledge base generally provides a benefit, unless for entity centric queries, which are targeting single persons or organisations. Our results will enable to help steering the development of cross vertical aggregated search engines and will also help to guide large scale evaluation strategies, for example using crowd sourcing techniques.
Pimas Oliver, Kröll Mark, Kern Roman
Our system for the PAN 2015 authorship verification challenge is basedupon a two step pre-processing pipeline. In the first step we extract different fea-tures that observe stylometric properties, grammatical characteristics and purestatistical features. In the second step of our pre-processing we merge all thosefeatures into a single meta feature space. We train an SVM classifier on the gener-ated meta features to verify the authorship of an unseen text document. We reportthe results from the final evaluation as well as on the training datasets
Rubien Raoul, Ziak Hermann, Kern Roman
Underspecified search queries can be performed via result list diversification approaches, which are often compu- tationally complex and require longer response times. In this paper, we explore an alternative, and more efficient way to diversify the result list based on query expansion. To that end, we used a knowledge base pseudo-relevance feedback algorithm. We compared our algorithm to IA-Select, a state-of-the-art diversification method, using its intent-aware version of the NDCG (Normalized Discounted Cumulative Gain) metric. The results indicate that our approach can guarantee a similar extent of diversification as IA-Select. In addition, we showed that the supported query language of the underlying search engines plays an important role in the query expansion based on diversification. Therefore, query expansion may be an alternative when result diversification is not feasible, for example in federated search systems where latency and the quantity of handled search results are critical issues.
Rexha Andi, Klampfl Stefan, Kröll Mark, Kern Roman
The overwhelming majority of scientific publications are authored by multiple persons; yet, bibliographic metrics are only assigned to individual articles as single entities. In this paper, we aim at a more fine-grained analysis of scientific authorship. We therefore adapt a text segmentation algorithm to identify potential author changes within the main text of a scientific article, which we obtain by using existing PDF extraction techniques. To capture stylistic changes in the text, we employ a number of stylometric features. We evaluate our approach on a small subset of PubMed articles consisting of an approximately equal number of research articles written by a varying number of authors. Our results indicate that the more authors an article has the more potential author changes are identified. These results can be considered as an initial step towards a more detailed analysis of scientific authorship, thereby extending the repertoire of bibliometrics.
Klampfl Stefan, Kern Roman
Scholarly publishing increasingly requires automated systems that semantically enrich documents in order to support management and quality assessment of scientific output.However, contextual information, such as the authors' affiliations, references, and funding agencies, is typically hidden within PDF files.To access this information we have developed a processing pipeline that analyses the structure of a PDF document incorporating a diverse set of machine learning techniques.First, unsupervised learning is used to extract contiguous text blocks from the raw character stream as the basic logical units of the article.Next, supervised learning is employed to classify blocks into different meta-data categories, including authors and affiliations.Then, a set of heuristics are applied to detect the reference section at the end of the paper and segment it into individual reference strings.Sequence classification is then utilised to categorise the tokens of individual references to obtain information such as the journal and the year of the reference.Finally, we make use of named entity recognition techniques to extract references to research grants, funding agencies, and EU projects.Our system is modular in nature.Some parts rely on models learnt on training data, and the overall performance scales with the quality of these data sets.
Horn Christopher, Kern Roman
In this paper, we propose an approach to deriving public transportation timetables of a region (i.e. country) based on (i) large- scale, non-GPS cell phone data and (ii) a dataset containing geographic information of public transportation stations. The presented algorithm is designed to work with movements data, which are scarce and have a low spatial accuracy but exists in vast amounts (large-scale). Since only aggregated statistics are used, our algorithm copes well with anonymized data. Our evaluation shows that 89% of the departure times of popular train connections are correctly recalled with an allowed deviation of 5 minutes. The timetable can be used as feature for transportation mode detection to separate public from private transport when no public timetable is available.
Kern Roman, Frey Matthias
Table recognition and table extraction are important tasks in information extraction, especially in the domain of schol- arly communication. In this domain tables are commonplace and contain valuable information. Many different automatic approaches for table recognition and extraction exist. Com- mon to many of these approaches is the need for ground truth datasets, to train algorithms or to evaluate the results. In this paper we present the PDF Table Annotator, a web based tool for annotating elements and regions in PDF doc- uments, in particular tables. The annotated data is intended to serve as a ground truth useful to machine learning algo- rithms for detecting table regions and table structure. To make the task of manual table annotation as convenient as possible, the tool is designed to allow an efficient annotation process that may spawn multiple session by multiple users. An evaluation is conducted where we compare our tool to three alternative ways of creating ground truth of tables in documents. Here we found that our tool overall provides an efficient and convenient way to annotate tables. In addition, our tool is particularly suitable for complex table structures, where it provided the lowest annotation time and the highest accuracy. Furthermore, our tool allows to annotate tables following a logical or a functional model. Given that by the use of our tool ground truth datasets for table recognition and extraction are easier to produce, the quality of auto- matic tables extraction should greatly benefit. General
Kern Roman, Zechner Mario, Granitzer Michael
Author disambiguation is a prerequisite for utilizingbibliographic metadata in citation analysis. Automaticdisambiguation algorithms mostly rely on cluster-based disambiguationstrategies for identifying unique authors given theirnames and publications. However, most approaches rely onknowing the correct number of unique authors a-priori, whichis rarely the case in real world settings. In this publicationwe analyse cluster-based disambiguation strategies and developa model selection method to estimate the number of distinctauthors based on co-authorship networks. We show that, givenclean textual features, the developed model selection methodprovides accurate guesses of the number of unique authors.
Kern Roman, Granitzer Michael, Muhr M.
Word sense induction and discrimination(WSID) identifies the senses of an ambiguousword and assigns instances of thisword to one of these senses. We have builda WSID system that exploits syntactic andsemantic features based on the results ofa natural language parser component. Toachieve high robustness and good generalizationcapabilities, we designed our systemto work on a restricted, but grammaticallyrich set of features. Based on theresults of the evaluations our system providesa promising performance and robustness.
Kern Roman, Granitzer Michael, Muhr M.
Cluster label quality is crucial for browsing topic hierarchiesobtained via document clustering. Intuitively, the hierarchicalstructure should influence the labeling accuracy. However,most labeling algorithms ignore such structural propertiesand therefore, the impact of hierarchical structureson the labeling accuracy is yet unclear. In our work weintegrate hierarchical information, i.e. sibling and parentchildrelations, in the cluster labeling process. We adaptstandard labeling approaches, namely Maximum Term Frequency,Jensen-Shannon Divergence, χ2 Test, and InformationGain, to take use of those relationships and evaluatetheir impact on 4 different datasets, namely the Open DirectoryProject, Wikipedia, TREC Ohsumed and the CLEFIP European Patent dataset. We show, that hierarchicalrelationships can be exploited to increase labeling accuracyespecially on high-level nodes.
Lindstaedt Stefanie , Pammer-Schindler Viktoria, Mörzinger Roland, Kern Roman, Mülner Helmut, Wagner Claudia
Imagine you are member of an online social systemand want to upload a picture into the community pool. In currentsocial software systems, you can probably tag your photo, shareit or send it to a photo printing service and multiple other stuff.The system creates around you a space full of pictures, otherinteresting content (descriptions, comments) and full of users aswell. The one thing current systems do not do, is understandwhat your pictures are about.We present here a collection of functionalities that make a stepin that direction when put together to be consumed by a tagrecommendation system for pictures. We use the data richnessinherent in social online environments for recommending tags byanalysing different aspects of the same data (text, visual contentand user context). We also give an assessment of the quality ofthus recommended tags.