Publikationen

Hier finden Sie von Know-Center MitarbeiterInnen verfasste wissenschaftliche Publikationen

2018

Andrusyak Bohdan, Kugi Thomas, Kern Roman

Daily Prediction of Foreign Exchange Rates Based on the Stock Marke

Proceedings of the PEFNet 2017 conference, Jana Stávková, Mendel University Press, Brno, 2018

Konferenz
The stock and foreign exchange markets are the two fundamental financial markets in the world and play acrucial role in international business. This paper examines the possibility of predicting the foreign exchangemarket via machine learning techniques, taking the stock market into account. We compare prediction modelsbased on algorithms from the fields of shallow and deep learning. Our models of foreign exchange marketsbased on information from the stock market have been shown to be able to predict the future of foreignexchange markets with an accuracy of over 60%. This can be seen as an indicator of a strong link between thetwo markets. Our insights offer a chance of a better understanding guiding the future of market predictions.We found the accuracy depends on the time frame of the forecast and the algorithms used, where deeplearning tends to perform better for farther-reaching forecasts
2018

Lovric Mario, Krebs Sarah, Cemernek David, Kern Roman

BIG DATA IN INDUSTRIAL APPLICATION

XII Meeting of Young Chemical Engineers, Zagreb, Kroatien, 2018

Konferenz
The use of big data technologies has a deep impact on today’s research (Tetko et al., 2016) and industry (Li et al., n.d.), but also on public health (Khoury and Ioannidis, 2014) and economy (Einav and Levin, 2014). These technologies are particularly important for manufacturing sites, where complex processes are coupled with large amounts of data, for example in chemical and steel industry. This data originates from sensors, processes. and quality-testing. Typical application of these technologies is related to predictive maintenance and optimisation of production processes. Media makes the term “big data” a hot buzzword without going to deep into the topic. We noted a lack in user’s understanding of the technologies and techniques behind it, making the application of such technologies challenging. In practice the data is often unstructured (Gandomi and Haider, 2015) and a lot of resources are devoted to cleaning and preparation, but also to understanding causalities and relevance among features. The latter one requires domain knowledge, making big data projects not only challenging from a technical perspective, but also from a communication perspective. Therefore, there is a need to rethink the big data concept among researchers and manufacturing experts including topics like data quality, knowledge exchange and technology required. The scope of this presentation is to present the main pitfalls in applying big data technologies amongst users from industry, explain scaling principles in big data projects, and demonstrate common challenges in an industrial big data project
2018

Santos Tiago, Kern Roman

Understanding semiconductor production with variational auto-encoders

European Symposium on Artificial Neural Network (ESANN) 2018, 2018

Konferenz
Semiconductor manufacturing processes critically depend on hundreds of highly complex process steps, which may cause critical deviations in the end-product.Hence, a better understanding of wafer test data patterns, which represent stress tests conducted on devices in semiconductor material slices, may lead to an improved production process.However, the shapes and types of these wafer patterns, as well as their relation to single process steps, are unknown.In a first step to address these issues, we tailor and apply a variational auto-encoder (VAE) to wafer pattern images.We find the VAE's generator allows for explorative wafer pattern analysis, andits encoder provides an effective dimensionality reduction algorithm, which, in a clustering application, performs better than several baselines such as t-SNE and yields interpretable clusters of wafer patterns.
2018

Urak Günter, Ziak Hermann, Kern Roman

Source Selection of Long Tail Sources for Federated Search in an Uncooperative Setting

SAC, 2018

Konferenz
The task of federated search is to combine results from multiple knowledge bases into a single, aggregated result list, where the items typically range from textual documents toimages. These knowledge bases are also called sources, and the process of choosing the actual subset of sources for a given query is called source selection. A scenario wherethese sources do not provide information about their content in a standardized way is called uncooperative setting. In our work we focus on knowledge bases providing long tail content, i.e., rather specialized sources offering a low number of relevant documents. These sources are often neglected in favor of more popular knowledge sources, both by today’s Web users as well as by most of the existing source selection techniques. We propose a system for source selection which i) could be utilized to automatically detect long tail knowledge bases and ii) generates aggregated search results that tend to incorporate results from these long tail sources. Starting from the current state-of-the-art we developed components that allowed to adjust the amount of contribution from long tail sources. Our evaluation is conducted on theTREC 2014 Federated WebSearch dataset. As this dataset also favors the most popular sources, systems that include many long tail knowledge bases will yield low performancemeasures. Here, we propose a system where just a few relevant long tail sources are integrated into the list of more popular knowledge bases. Additionally, we evaluated the implications of an uncooperative setting, where only minimal information of the sources is available to the federated search system. Here a severe drop in performance is observed once the share of long tail sources is higher than 40%. Our work is intended to steer the development of federated search systems that aim at increasing the diversity and coverage of the aggregated search result.
2018

Bassa Akim, Kröll Mark, Kern Roman

GerIE - An Open InformationExtraction System for the German Language

Journal of Universal Computer Science, 2018

Journal
Open Information Extraction (OIE) is the task of extracting relations fromtext without the need of domain speci c training data. Currently, most of the researchon OIE is devoted to the English language, but little or no research has been conductedon other languages including German. We tackled this problem and present GerIE, anOIE parser for the German language. Therefore we started by surveying the availableliterature on OIE with a focus on concepts, which may also apply to the Germanlanguage. Our system is built upon the output of a dependency parser, on which anumber of hand crafted rules are executed. For the evaluation we created two dedicateddatasets, one derived from news articles and one based on texts from an encyclopedia.Our system achieves F-measures of up to 0.89 for sentences that have been correctlypreprocessed.
2018

Rexha Andi, Kröll Mark, Ziak Hermann, Kern Roman

Authorship Identification of Documents with High Content Similarity

Scientometrics, Wolfgang Glänzel, Springer Link, 2018

Journal
The goal of our work is inspired by the task of associating segments of text to their real authors. In this work, we focus on analyzing the way humans judge different writing styles. This analysis can help to better understand this process and to thus simulate/ mimic such behavior accordingly. Unlike the majority of the work done in this field (i.e., authorship attribution, plagiarism detection, etc.) which uses content features, we focus only on the stylometric, i.e. content-agnostic, characteristics of authors.Therefore, we conducted two pilot studies to determine, if humans can identify authorship among documents with high content similarity. The first was a quantitative experiment involving crowd-sourcing, while the second was a qualitative one executed by the authors of this paper.Both studies confirmed that this task is quite challenging.To gain a better understanding of how humans tackle such a problem, we conducted an exploratory data analysis on the results of the studies. In the first experiment, we compared the decisions against content features and stylometric features. While in the second, the evaluators described the process and the features on which their judgment was based. The findings of our detailed analysis could (i) help to improve algorithms such as automatic authorship attribution as well as plagiarism detection, (ii) assist forensic experts or linguists to create profiles of writers, (iii) support intelligence applications to analyze aggressive and threatening messages and (iv) help editor conformity by adhering to, for instance, journal specific writing style.
2017

Cemernek David, Gursch Heimo, Kern Roman

Big Data as a Promoter of Industry 4.0: Lessons of the Semiconductor Industry

IEEE 15th International Conference of Industrial Informatics - INDIN'2017, IEEE, Emden, Germany, 2017

Konferenz
The catchphrase “Industry 4.0” is widely regarded as a methodology for succeeding in modern manufacturing. This paper provides an overview of the history, technologies and concepts of Industry 4.0. One of the biggest challenges to implementing the Industry 4.0 paradigms in manufacturing are the heterogeneity of system landscapes and integrating data from various sources, such as different suppliers and different data formats. These issues have been addressed in the semiconductor industry since the early 1980s and some solutions have become well-established standards. Hence, the semiconductor industry can provide guidelines for a transition towards Industry 4.0 in other manufacturing domains. In this work, the methodologies of Industry 4.0, cyber-physical systems and Big data processes are discussed. Based on a thorough literature review and experiences from the semiconductor industry, we offer implementation recommendations for Industry 4.0 using the manufacturing process of an electronics manufacturer as an example.
2017

Traub Matthias, Gursch Heimo, Lex Elisabeth, Kern Roman

Data Market Austria - Austria's First Digital Ecosystem for Data, Businesses, and Innovation

Exploring a changing view on organizing value creation: Developing New Business Models. Contributions to the 2nd International Conference on New Business Models, Institute of Systems Sciences, Innovation and Sustainability Research, Merangasse 18, 8010 Graz, Austria, Graz, 2017

Konferenz
New business opportunities in the digital economy are established when datasets describing a problem, data services solving the said problem, the required expertise and infrastructure come together. For most real-word problems finding the right data sources, services consulting expertise, and infrastructure is difficult, especially since the market players change often. The Data Market Austria (DMA) offers a platform to bring datasets, data services, consulting, and infrastructure offers to a common marketplace. The recommender systems included in DMA analyses all offerings, to derive suggestions for collaboration between them, like which dataset could be best processed by which data service. The suggestions should help the costumers on DMA to identify new collaborations reaching beyond traditional industry boundaries to get in touch with new clients or suppliers in the digital domain. Human brokers will work together with the recommender system to set up data value chains matching different offers to create a data value chain solving the problems in various domains. In its final expansion stage, DMA is intended to be a central hub for all actors participating in the Austrian data economy, regardless of their industrial and research domain to overcome traditional domain boundaries.
2017

Ziak Hermann, Kern Roman

Evaluation of Contextualization and Diversification Approaches in Aggregated Search

TIR @ DEXA International Conference on Database and Expert Systems Applications, 2017

Konferenz
The combination of different knowledge bases in thefield of information retrieval is called federated or aggregated search. It has several benefits over single source retrieval but poses some challenges as well. This work focuses on the challenge of result aggregation; especially in a setting where the final result list should include a certain degree of diversity and serendipity. Both concepts have been shown to have an impact on how user perceive an information retrieval system. In particular, we want to assess if common procedures for result list aggregation can be utilized to introduce diversity and serendipity. Furthermore, we study whether a blocking or interleaving for result aggregation yields better results. In a cross vertical aggregated search the so-called verticalscould be news, multimedia content or text. Block ranking is one approach to combine such heterogeneous result. It relies on the idea that these verticals are combined into a single result list as blocks of several adjacent items. An alternative approach for this is interleaving. Here the verticals are blended into one result list on an item by item basis, i.e. adjacent items in the result list may come from different verticals. To generate the diverse and serendipitous results we reliedon a query reformulation technique which we showed to be beneficial to generate diversified results in previous work. To conduct this evaluation we created a dedicated dataset. This dataset served as a basis for three different evaluation settings on a crowd sourcing platform, with over 300 participants. Our results show that query based diversification can be adapted to generate serendipitous results in a similar manner. Further, we discovered that both approaches, interleaving and block ranking, appear to be beneficial to introduce diversity and serendipity. Though it seems that queries either benefit from one approach or the other but not from both.
2017

Toller Maximilian, Kern Roman

Robust Parameter-Free Season Length Detection in Time Series

MILETS 2017 @ International Conference on Knowledge Discovery and Data Mining, Halfiax, Nova Scotia Canada, 2017

Konferenz
The in-depth analysis of time series has gained a lot of re-search interest in recent years, with the identification of pe-riodic patterns being one important aspect. Many of themethods for identifying periodic patterns require time series’season length as input parameter. There exist only a few al-gorithms for automatic season length approximation. Manyof these rely on simplifications such as data discretization.This paper presents an algorithm for season length detec-tion that is designed to be sufficiently reliable to be used inpractical applications. The algorithm estimates a time series’season length by interpolating, filtering and detrending thedata. This is followed by analyzing the distances betweenzeros in the directly corresponding autocorrelation function.Our algorithm was tested against a comparable algorithmand outperformed it by passing 122 out of 165 tests, whilethe existing algorithm passed 83 tests. The robustness of ourmethod can be jointly attributed to both the algorithmic ap-proach and also to design decisions taken at the implemen-tational level.
2017

Rexha Andi, Kröll Mark, Ziak Hermann, Kern Roman

Extending Scientific Literature Search by Including the Author’s Writing Style

Fifth Workshop on Bibliometric-enhanced Information Retrieval, Atanassova, I.; Bertin, M.; Mayr, P., Springer, Aberdeen, UK, 2017

Konferenz
Our work is motivated by the idea to extend the retrieval of related scientific literature to cases, where the relatedness also incorporates the writing style of individual scientific authors. Therefore we conducted a pilot study to answer the question whether humans can identity authorship once the topological clues have been removed. As first result, we found out that this task is challenging, even for humans. We also found some agreement between the annotators. To gain a better understanding how humans tackle such a problem, we conducted an exploratory data analysis. Here, we compared the decisions against a number of topological and stylometric features. The outcome of our work should help to improve automatic authorship identificationalgorithms and to shape potential follow-up studies.
2017

Gursch Heimo, Cemernek David, Kern Roman

Multi-Loop Feedback Hierarchy Involving Human Workers in Manufacturing Processes

Mensch und Computer 2017 - Workshopband, Manuel Burghardt, Raphael Wimmer, Christian Wolff, Christa Womser-Hacker, Gesellschaft für Informatik e.V., Regensburg, 2017

Konferenz
In manufacturing environments today, automated machinery works alongside human workers. In many cases computers and humans oversee different aspects of the same manufacturing steps, sub-processes, and processes. This paper identifies and describes four feedback loops in manufacturing and organises them in terms of their time horizon and degree of automation versus human involvement. The data flow in the feedback loops is further characterised by features commonly associated with Big Data. Velocity, volume, variety, and veracity are used to establish, describe and compare differences in the data flows.
2017

Rexha Andi, Kern Roman, Ziak Hermann, Dragoni Mauro

A semantic federated search engine for domain-specific document retrieval

SAC '17 Proceedings of the Symposium on Applied Computing, Sung Y. Shin, Dongwan Shin, Maria Lencastre, ACM, Marrakech, Morocco, 2017

Konferenz
Retrieval of domain-specific documents became attractive for theSemantic Web community due to the possibility of integrating classicInformation Retrieval (IR) techniques with semantic knowledge.Unfortunately, the gap between the construction of a full semanticsearch engine and the possibility of exploiting a repository ofontologies covering all possible domains is far from being filled.Recent solutions focused on the aggregation of different domain-specificrepositories managed by third-parties. In this paper, wepresent a semantic federated search engine developed in the contextof the EEXCESS EU project. Through the developed platform,users are able to perform federated queries over repositories in atransparent way, i.e. without knowing how their original queries aretransformed before being actually submitted. The platform implementsa facility for plugging new repositories and for creating, withthe support of general purpose knowledge bases, knowledge graphsdescribing the content of each connected repository. Such knowledgegraphs are then exploited for enriching queries performed byusers.
2017

Schrunner Stefan, Bluder Olivia, Zernig Anja, Kaestner Andre, Kern Roman

Markov Random Fields for Pattern Extraction in Analog Wafer Test Data

International Conference on Image Processing Theory, Tools and Applications (IPTA 2017), IEEE, Montreal, Canada, 2017

Konferenz
In semiconductor industry it is of paramount im- portance to check whether a manufactured device fulfills all quality specifications and is therefore suitable for being sold to the customer. The occurrence of specific spatial patterns within the so-called wafer test data, i.e. analog electric measurements, might point out on production issues. However the shape of these critical patterns is unknown. In this paper different kinds of process patterns are extracted from wafer test data by an image processing approach using Markov Random Field models for image restoration. The goal is to develop an automated procedure to identify visible patterns in wafer test data to improve pattern matching. This step is a necessary precondition for a subsequent root-cause analysis of these patterns. The developed pattern ex- traction algorithm yields a more accurate discrimination between distinct patterns, resulting in an improved pattern comparison than in the original dataset. In a next step pattern classification will be applied to improve the production process control.
2017

Kern Roman, Falk Stefan, Rexha Andi

Know-Center at SemEval-2017 Task 10: Sequence Classification with the CODE Annotator

Proceedings of the 11th International Workshop on Semantic Evaluations (SemEval-2017), Isabelle Augenstein, Mrinal Das, Sebastian Riedel, Lakshmi Vikraman, Andrew McCallum, ACL, Vancouver, Canada, 2017

Konferenz
This paper describes our participation inSemEval-2017 Task 10, named ScienceIE(Machine Reading for Scientist). We competedin Subtask 1 and 2 which consist respectivelyin identifying all the key phrasesin scientific publications and label them withone of the three categories: Task, Process,and Material. These scientific publicationsare selected from Computer Science, MaterialSciences, and Physics domains. We followeda supervised approach for both subtasksby using a sequential classifier (CRF - ConditionalRandom Fields). For generating oursolution we used a web-based application implementedin the EU-funded research project,named CODE. Our system achieved an F1score of 0.39 for the Subtask 1 and 0.28 forthe Subtask 2.
2017

Seifert Christin, Bailer Werner, Orgel Thomas, Gantner Louis, Kern Roman, Ziak Hermann, Petit Albin, Schlötterer Jörg, Zwicklbauer Stefan, Granitzer Michael

Ubiquitous Access to Digital Cultural Heritage

Journal on Computing and Cultural Heritage (JOCCH) - Special Issue on Digital Infrastructure for Cultural Heritage, Part 1, Roberto Scopign, ACM, New York, NY, US, 2017

Journal
The digitization initiatives in the past decades have led to a tremendous increase in digitized objects in the cultural heritagedomain. Although digitally available, these objects are often not easily accessible for interested users because of the distributedallocation of the content in different repositories and the variety in data structure and standards. When users search for culturalcontent, they first need to identify the specific repository and then need to know how to search within this platform (e.g., usageof specific vocabulary). The goal of the EEXCESS project is to design and implement an infrastructure that enables ubiquitousaccess to digital cultural heritage content. Cultural content should be made available in the channels that users habituallyvisit and be tailored to their current context without the need to manually search multiple portals or content repositories. Torealize this goal, open-source software components and services have been developed that can either be used as an integratedinfrastructure or as modular components suitable to be integrated in other products and services. The EEXCESS modules andcomponents comprise (i) Web-based context detection, (ii) information retrieval-based, federated content aggregation, (iii) meta-data definition and mapping, and (iv) a component responsible for privacy preservation. Various applications have been realizedbased on these components that bring cultural content to the user in content consumption and content creation scenarios. Forexample, content consumption is realized by a browser extension generating automatic search queries from the current pagecontext and the focus paragraph and presenting related results aggregated from different data providers. A Google Docs add-onallows retrieval of relevant content aggregated from multiple data providers while collaboratively writing a document. Theserelevant resources then can be included in the current document either as citation, an image, or a link (with preview) withouthaving to leave disrupt the current writing task for an explicit search in various content providers’ portals.
2017

Breitfuß Gert, Kaiser René, Kern Roman, Kowald Dominik, Lex Elisabeth, Pammer-Schindler Viktoria, Veas Eduardo Enrique

i-Know Workshops 2017

CEUR Workshop Proceedings for i-know 2017 conference, CEUR , CEUR, Graz, Austria, 2017

Buch
Proceedings of the Workshop Papers of i-Know 2017, co-located with International Conference on Knowledge Technologies and Data-Driven Business 2017 (i-Know 2017), Graz, Austria, October 11-12, 2017.
2017

Rexha Andi, Kröll Mark, Ziak Hermann, Kern Roman

Pilot study: Ranking of textual snippets based on the writing style

Zenodo, 2017

In this pilot study, we tried to capture humans' behavior when identifying authorship of text snippets. At first, we selected textual snippets from the introduction of scientific articles written by single authors. Later, we presented to the evaluators a source and four target snippets, and then, ask them to rank the target snippets from the most to the least similar from the writing style.The dataset is composed by 66 experiments manually checked for not having any clear hint during the ranking for the evaluators. For each experiment, we have evaluations from three different evaluators.We present each experiment in a single line (in the CSV file), where, at first we present the metadata of the Source-Article (Journal, Title, Authorship, Snippet), and the metadata for the 4 target snippets (Journal, Title, Authorship, Snippet, Written From the same Author, Published in the same Journal) and the ranking given by each evaluator. This task was performed in the open source platform, Crowd Flower. The headers of the CSV are self-explained. In the TXT file, you can find a human-readable version of the experiment. For more information about the extraction of the data, please consider reading our paper: "Extending Scientific Literature Search by Including the Author’s Writing Style" @BIR: http://www.gesis.org/en/services/events/events-archive/conferences/ecir-workshops/ecir-workshop-2017
2016

Rexha Andi, Kröll Mark, Kern Roman

Social Media Monitoring for Companies: A 4W Summarisation Approach

European Conference on Knowledge Management, Dr. Sandra Moffett and Dr. Brendan Galbraith, Academic Conferences and Publishing International Limited, Belfast, Northern Ireland, UK, 2016

Konferenz
Monitoring (social) media represents one means for companies to gain access to knowledge about, for instance, competitors, products as well as markets. As a consequence, social media monitoring tools have been gaining attention to handle amounts of data nowadays generated in social media. These tools also include summarisation services. However, most summarisation algorithms tend to focus on (i) first and last sentences respectively or (ii) sentences containing keywords.In this work we approach the task of summarisation by extracting 4W (who, when, where, what) information from (social)media texts. Presenting 4W information allows for a more compact content representation than traditional summaries. Inaddition, we depart from mere named entity recognition (NER) techniques to answer these four question types by includingnon-rigid designators, i.e. expressions which do not refer to the same thing in all possible worlds such as “at the main square”or “leaders of political parties”. To do that, we employ dependency parsing to identify grammatical characteristics for each question type. Every sentence is then represented as a 4W block. We perform two different preliminary studies: selecting sentences that better summarise texts by achieving an F1-measure of 0.343, as well as a 4W block extraction for which we achieve F1-measures of 0.932; 0.900; 0.803; 0.861 for “who”, “when”, “where” and “what” category respectively. In a next step the 4W blocks are ranked by relevance. The top three ranked blocks, for example, then constitute a summary of the entire textual passage. The relevance metric can be customised to the user’s needs, for instance, ranked by up-to-dateness where the sentences’ tense is taken into account. In a user study we evaluate different ranking strategies including (i) up-todateness,(ii) text sentence rank, (iii) selecting the firsts and lasts sentences or (iv) coverage of named entities, i.e. based on the number of named entities in the sentence. Our 4W summarisation method presents a valuable addition to a company’s(social) media monitoring toolkit, thus supporting decision making processes.
2016

Gursch Heimo, Körner Stefan, Krasser Hannes, Kern Roman

Parameter Forecasting for Vehicle Paint Quality Optimisation

Mensch und Computer 2016 – Workshopband, Benjamin Weyers, Anke Dittmar, Gesellschaft für Informatik e.V., Aachen, 2016

Konferenz
Painting a modern car involves applying many coats during a highly complex and automated process. The individual coats not only serve a decoration purpose but are also curial for protection from damage due to environmental influences, such as rust. For an optimal paint job, many parameters have to be optimised simultaneously. A forecasting model was created, which predicts the paint flaw probability for a given set of process parameters, to help the production managers modify the process parameters to achieve an optimal result. The mathematical model was based on historical process and quality observations. Production managers who are not familiar with the mathematical concept of the model can use it via an intuitive Web-based Graphical User Interface (Web-GUI). The Web-GUI offers production managers the ability to test process parameters and forecast the expected quality. The model can be used for optimising the process parameters in terms of quality and costs.
2016

Gursch Heimo, Kern Roman

Internet of Things meets Big Data: An Infrastructure to Collect, Connect, and Analyse Sensor Data

VDE Kongress 2016: Internet der Dinge (VDE Kongress 2016), DE Verlag GmbH, Berlin - Offenbach, Congress Center Rosengarten, Mannheim, Germany, 2016

Konferenz
Many different sensing, recording and transmitting platforms are offered on today’s market for Internet of Things (IoT) applications. But taking and transmitting measurements is just one part of a complete system. Also long time storage and processing of recorded sensor values are vital for IoT applications. Big Data technologies provide a rich variety of processing capabilities to analyse the recorded measurements. In this paper an architecture for recording, searching, and analysing sensor measurements is proposed. This architecture combines existing IoT and Big Data technologies to bridge the gap between recording, transmission, and persistency of raw sensor data on one side, and the analysis of data on Hadoop clusters on the other side. The proposed framework emphasises scalability and persistence of measurements as well as easy access to the data from a variety of different data analytics tools. To achieve this, a distributed architecture is designed offering three different views on the recorded sensor readouts. The proposed architecture is not targeted at one specific use-case, but is able to provide a platform for a large number of different services.
2016

Ziak Hermann, Kern Roman

KNOW At The Social Book Search Lab 2016 Suggestion Track

CLEF 2016 Social Book Search Lab, Krisztian Balog, Linda Cappellato, Nicola Ferro, Craig Macdonal, CEUR Workshop Proceeding, Évora, Portugal, 2016

Konferenz
Within this work represents the documentation of our ap-proach on the Social Book Search Lab 2016 where we took part in thesuggestion track. The main goal of the track was to create book recom-mendation for readers only based on their stated request within a forum.The forum entry contained further contextual information, like the user’scatalogue of already read books and the list of example books mentionedin the user’s request. The presented approach is mainly based on themetadata included in the book catalogue provided by the organizers ofthe task. With the help of a dedicated search index we extracted severalpotential book recommendations which were re-ranked by the use of anSVD based approach. Although our results did not meet our expectationwe consider it as first iteration towards a competitive solution.
2016

Rexha Andi, Kern Roman, Dragoni Mauro , Kröll Mark

Exploiting Propositions for Opinion Mining

ESWC-16 Challenge on Semantic Sentiment Analysis, Springer Link, Springer-Verlag, Crete, Greece, 2016

Konferenz
With different social media and commercial platforms, users express their opinion about products in a textual form. Automatically extracting the polarity (i.e. whether the opinion is positive or negative) of a user can be useful for both actors: the online platform incorporating the feedback to improve their product as well as the client who might get recommendations according to his or her preferences. Different approaches for tackling the problem, have been suggested mainly using syntactic features. The “Challenge on Semantic Sentiment Analysis” aims to go beyond the word-level analysis by using semantic information. In this paper we propose a novel approach by employing the semantic information of grammatical unit called preposition. We try to drive the target of the review from the summary information, which serves as an input to identify the proposition in it. Our implementation relies on the hypothesis that the proposition expressing the target of the summary, usually containing the main polarity information.
2016

Rexha Andi, Klampfl Stefan, Kröll Mark, Kern Roman

Towards a more fine grained analysis of scientific authorship: Predicting the number of authors using stylometric features

BIR 2016 Workshop on Bibliometric-enhanced Information Retrieval, Atanassova, I.; Bertin, M.; Mayr, P., Springer, Padova, Italy, 2016

Konferenz
To bring bibliometrics and information retrieval closer together, we propose to add the concept of author attribution into the pre-processing of scientific publications. Presently, common bibliographic metrics often attribute the entire article to all the authors affecting author-specific retrieval processes. We envision a more finegrained analysis of scientific authorship by attributing particular segments to authors. To realize this vision, we propose a new feature representation of scientific publications that captures the distribution of tylometric features. In a classification setting, we then seek to predict the number of authors of a scientific article. We evaluate our approach on a data set of ~ 6100 PubMed articles and achieve best results by applying random forests, i.e., 0.76 precision and 0.76 recall averaged over all classes.
2016

Ziak Hermann, Rexha Andi, Kern Roman

KNOW At The Social Book Search Lab 2016 Mining Track

CLEF 2016 Social Book Search Lab, Krisztian Balog, Linda Cappellato, Nicola Ferro,Craig Macdonald, Springer, Évora, Portugal, 2016

Konferenz
This paper describes our system for the mining task of theSocial Book Search Lab in 2016. The track consisted of two task, theclassification of book request postings and the task of linking book identifierswith references mentioned within the text. For the classificationtask we used text mining features like n-grams and vocabulary size, butalso included advanced features like average spelling errors found withinthe text. Here two datasets were provided by the organizers for this taskwhich were evaluated separately. The second task, the linking of booktitles to a work identifier, was addressed by an approach based on lookuptables. For the dataset of the first task our approach was ranked third,following two baseline approaches of the organizers with an accuracy of91 percent. For the second dataset we achieved second place with anaccuracy of 82 percent. Our approach secured the first place with anF-score of 33.50 for the second task.
2016

Pimas Oliver, Rexha Andi, Kröll Mark, Kern Roman

Profiling microblog authors using concreteness and sentiment - Know-Center at PAN 2016 author profiling

PAN 2016, Krisztian Balog, Linda Cappellato, Nicola Ferro, Craig Macdonald, Springer, Evora, Portugal, 2016

Konferenz
The PAN 2016 author profiling task is a supervised classification problemon cross-genre documents (tweets, blog and social media posts). Our systemmakes use of concreteness, sentiment and syntactic information present in thedocuments. We train a random forest model to identify gender and age of a document’sauthor. We report the evaluation results received by the shared task.
2016

Kern Roman, Klampfl Stefan, Rexha Andi

Identifying Referenced Text in ScientificPublications by Summarisation andClassification Techniques

BIRNDL 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries, G. Cabanac, Muthu Kumar Chandrasekaran, Ingo Frommholz , Kokil Jaidka, Min-Yen Kan, Philipp Mayr, Dietmar Wolfram, ACM, New Jersey, USA, 2016

Konferenz
This report describes our contribution to the 2nd ComputationalLinguistics Scientific Document Summarization Shared Task (CLSciSumm2016), which asked to identify the relevant text span in a referencepaper that corresponds to a citation in another document that citesthis paper. We developed three different approaches based on summarisationand classification techniques. First, we applied a modified versionof an unsupervised summarisation technique, TextSentenceRank, to thereference document, which incorporates the similarity of sentences tothe citation on a textual level. Second, we employed classification to selectfrom candidates previously extracted through the original TextSentenceRankalgorithm. Third, we used unsupervised summarisation of therelevant sub-part of the document that was previously selected in a supervisedmanner.
2016

Dragoni Mauro, Rexha Andi, Kröll Mark, Kern Roman

Polarity Classification for Target Phrases in Tweets: A Word2Vec approach

The Semantic Web, ESWC 2016 Satellite Events, ESWC 2016, Springer-Verlag, Crete, Greece, 2016

Konferenz
Twitter is one of the most popular micro-blogging serviceson the web. The service allows sharing, interaction and collaboration viashort, informal and often unstructured messages called tweets. Polarityclassification of tweets refers to the task of assigning a positive or a nega-tive sentiment to an entire tweet. Quite similar is predicting the polarityof a specific target phrase, for instance@Microsoftor#Linux,whichiscontained in the tweet.In this paper we present a Word2Vec approach to automatically pre-dict the polarity of a target phrase in a tweet. In our classification setting,we thus do not have any polarity information but use only semantic infor-mation provided by a Word2Vec model trained on Twitter messages. Toevaluate our feature representation approach, we apply well-establishedclassification algorithms such as the Support Vector Machine and NaiveBayes. For the evaluation we used theSemeval 2016 Task #4dataset.Our approach achieves F1-measures of up to∼90 % for the positive classand∼54 % for the negative class without using polarity informationabout single words.
2016

Falk Stefan, Rexha Andi, Kern Roman

Know-Center at SemEval-2016 Task 5: Using Word Vectors with Typed Dependencies for Opinion Target Expression Extraction

Conference: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), SemEval 2016, ACL Anthology, San Diego, USA, 2016

Konferenz
This paper describes our participation in SemEval-2016 Task 5 for Subtask 1, Slot 2.The challenge demands to find domain specific target expressions on sentence level thatrefer to reviewed entities. The detection of target words is achieved by using word vectorsand their grammatical dependency relationships to classify each word in a sentence into target or non-target. A heuristic based function then expands the classified target words tothe whole target phrase. Our system achievedan F1 score of 56.816% for this task.
2016

Horn Christopher, Gursch Heimo, Kern Roman, Cik Michael

QZTool – Automatically generated Origin-Destination Matrices from Cell Phone Trajectories

Advances in The Human Side of Service Engineering: Proceedings of the AHFE 2016 International Conference on Human Factors and Sustainable Infrastructure, July 27-31, 2016, Walt Disney World®, Florida, USA, Jerzy Charytonowicz (series Editor), Neville A. Stanton and Steven Landry and Giuseppe Di Bucchianico and Andrea Vallicelli, Springer International Publishing, Cham, Switzerland, 2016

Konferenz
Models describing human travel patterns are indispensable to plan and operate road, rail and public transportation networks. For most kind of analyses in the field of transportation planning, there is a need for origin-destination (OD) matrices, which specify the travel demands between the origin and destination zones in the network. The preparation of OD matrices is traditionally a time consuming and cumbersome task. The presented system, QZTool, reduces the necessary effort as it is capable of generating OD matrices automatically. These matrices are produced starting from floating phone data (FPD) as raw input. This raw input is processed by a Hadoop-based big data system. A graphical user interface allows for an easy usage and hides the complexity from the operator. For evaluation, we compare a FDP-based OD matrix to an OD matrix created by a traffic demand model. Results show that both matrices agree to a high degree, indicating that FPD-based OD matrices can be used to create new, or to validate or amend existing OD matrices.
2016

Urak Günter, Ziak Hermann, Kern Roman

Do Ambiguous Words Improve Probing For Federated Search?

International Conference on Theory and Practice of Digital Libraries, TPDL 2016, Springer-Verlag, 2016

Konferenz
The core approach to distributed knowledge bases is federated search. Two of the main challenges for federated search are the source representation and source selection. Different solutions to these problems were proposed in the literature. Within this work we present our novel approach for query-based sampling by relying on knowledge bases. We show the basic correctness of our approach and we came to the insight that the ambiguity of the probing terms has just a minor impact on the representation of the collection. Finally, we show that our method can be used to distinguish between niche and encyclopedic knowledge bases.
2016

Klampfl Stefan, Kern Roman

Reconstructing the Logical Structure of a Scientific Publication using Machine Learning

Semantic Web Challenges, Communications in Computer and Information Science, Springer Link, Springer-Verlag, 2016

Konferenz
Semantic enrichment of scientific publications has an increasing impact on scholarly communication. This document describes our contribution to Semantic Publishing Challenge 2016, which aims at investigating novel approaches for improving scholarly publishing through semantic technologies. We participated in Task 2 of this challenge, which requires the extraction of information from the content of a paper given as PDF. The extracted information allows answering queries about the paper’s internal organisation and the context in which it was written. We build upon our contribution to the previous edition of the challenge, where we categorised meta-data, such as authors and affiliations, and extracted funding information. Here we use unsupervised machine learning techniques in order to extend the analysis of the logical structure of the document as to identify section titles and captions of figures and tables. Furthermore, we employ clustering techniques to create the hierarchical table of contents of the article. Our system is modular in nature and allows a separate training of different stages on different training sets.
2016

Gursch Heimo, Ziak Hermann, Kröll Mark, Kern Roman

Context-Driven Federated Recommendations for Knowledge Workers

Proceedings of the 17th European Conference on Knowledge Management (ECKM), Dr. Sandra Moffett and Dr. Brendan Galbraith, Academic Conferences and Publishing International Limited, Belfast, Northern Ireland, UK, 2016

Konferenz
Modern knowledge workers need to interact with a large number of different knowledge sources with restricted or public access. Knowledge workers are thus burdened with the need to familiarise and query each source separately. The EEXCESS (Enhancing Europe’s eXchange in Cultural Educational and Scientific reSources) project aims at developing a recommender system providing relevant and novel content to its users. Based on the user’s work context, the EEXCESS system can either automatically recommend useful content, or support users by providing a single user interface for a variety of knowledge sources. In the design process of the EEXCESS system, recommendation quality, scalability and security where the three most important criteria. This paper investigates the scalability aspect achieved by federated design of the EEXCESS recommender system. This means that, content in different sources is not replicated but its management is done in each source individually. Recommendations are generated based on the context describing the knowledge worker’s information need. Each source offers result candidates which are merged and re-ranked into a single result list. This merging is done in a vector representation space to achieve high recommendation quality. To ensure security, user credentials can be set individually by each user for each source. Hence, access to the sources can be granted and revoked for each user and source individually. The scalable architecture of the EEXCESS system handles up to 100 requests querying up to 10 sources in parallel without notable performance deterioration. The re-ranking and merging of results have a smaller influence on the system's responsiveness than the average source response rates. The EEXCESS recommender system offers a common entry point for knowledge workers to a variety of different sources with only marginally lower response times as the individual sources on their own. Hence, familiarisation with individual sources and their query language is not necessary.
2016

Kern Roman, Ziak Hermann

Query Splitting For Context-Driven Federated Recommendations

Database and Expert Systems Applications (DEXA), 2016 27th International Workshop on, IEEEE, Porto, Portugal, 2016

Konferenz
Context-driven query extraction for content-basedrecommender systems faces the challenge of dealing with queriesof multiple topics. In contrast to manually entered queries, forautomatically generated queries this is a more frequent problem. For instances if the information need is inferred indirectly viathe user's current context. Especially for federated search systemswere connected knowledge sources might react vastly differentlyon such queries, an algorithmic way how to deal with suchqueries is of high importance. One such method is to split mixedqueries into their individual subtopics. To gain insight how amulti topic query can be split into its subtopics we conductedan evaluation where we compared a naive approach against amore complex approaches based on word embedding techniques:One created using Word2Vec and one created using GloVe. Toevaluate these two approaches we used the Webis-QSeC-10 queryset, consisting of about 5,000 multi term queries. Queries of thisset were concatenated and passed through the algorithms withthe goal to split those queries again. Hence the naive approach issplitting the queries into several groups, according to the amountof joined queries, assuming the topics are of equal query termcount. In the case of the Word2Vec and GloVe based approacheswe relied on the already pre-trained datasets. The Google Newsmodel and a model trained with a Wikipedia dump and theEnglish Gigaword newswire text archive. The out of this datasetsresulting query term vectors were grouped into subtopics usinga k-Means clustering. We show that a clustering approach basedon word vectors achieves better results in particular when thequery is not in topical order. Furthermore we could demonstratethe importance of the underlying dataset.
2016

Santos Tiago, Kern Roman

A Literature Survey of Early Time Series Classification and Deep Learning

SamI40 workshop at i-KNOW'16, 2016

Konferenz
This paper provides an overview of current literature on timeseries classification approaches, in particular of early timeseries classification.A very common and effective time series classification ap-proach is the 1-Nearest Neighbor classifier, with differentdistance measures such as the Euclidean or dynamic timewarping distances. This paper starts by reviewing thesebaseline methods.More recently, with the gain in popularity in the applica-tion of deep neural networks to the field of computer vision,research has focused on developing deep learning architec-tures for time series classification as well. The literature inthe field of deep learning for time series classification hasshown promising results.Early time series classification aims to classify a time se-ries with as few temporal observations as possible, whilekeeping the loss of classification accuracy at a minimum.Prominent early classification frameworks reviewed by thispaper include, but are not limited to, ECTS, RelClass andECDIRE. These works have shown that early time seriesclassification may be feasible and performant, but they alsoshow room for improvement
2016

Mutlu Belgin, Sabol Vedran, Gursch Heimo, Kern Roman

From Data to Visualisations and Back: Selecting Visualisations Based on Data and System Design Considerations

arXiv, 2016

Konferenz
Graphical interfaces and interactive visualisations are typical mediators between human users and data analytics systems. HCI researchers and developers have to be able to understand both human needs and back-end data analytics. Participants of our tutorial will learn how visualisation and interface design can be combined with data analytics to provide better visualisations. In the first of three parts, the participants will learn about visualisations and how to appropriately select them. In the second part, restrictions and opportunities associated with different data analytics systems will be discussed. In the final part, the participants will have the opportunity to develop visualisations and interface designs under given scenarios of data and system settings.
2016

Rexha Andi, Dragoni Mauro, Kern Roman, Kröll Mark

An Information Retrieval Based Approach for Multilingual Ontology Matching

International Conference on Applications of Natural Language to Information Systems, Métais E., Meziane F., Saraee M., Sugumaran V., Vadera S. , Springer , Salford, UK, 2016

Konferenz
Ontology matching in a multilingual environment consists of finding alignments between ontologies modeled by using more than one language. Such a research topic combines traditional ontology matching algorithms with the use of multilingual resources, services, and capabilities for easing multilingual matching. In this paper, we present a multilingual ontology matching approach based on Information Retrieval (IR) techniques: ontologies are indexed through an inverted index algorithm and candidate matches are found by querying such indexes. We also exploit the hierarchical structure of the ontologies by adopting the PageRank algorithm for our system. The approaches have been evaluated using a set of domain-specific ontologies belonging to the agricultural and medical domain. We compare our results with existing systems following an evaluation strategy closely resembling a recommendation scenario. The version of our system using PageRank showed an increase in performance in our evaluations.
2016

Pimas Oliver, Klampfl Stefan, Kohl Thomas, Kern Roman, Kröll Mark

Generating Tailored Classification Schemas for German Patents

21st International Conference on Applications of Natural Language to Information Systems, NLDB 2016, Springer-Verlag, Salford, UK, 2016

Konferenz
Patents and patent applications are important parts of acompany’s intellectual property. Thus, companies put a lot of effort indesigning and maintaining an internal structure for organizing their ownpatent portfolios, but also in keeping track of competitor’s patent port-folios. Yet, official classification schemas offered by patent offices (i) areoften too coarse and (ii) are not mappable, for instance, to a company’sfunctions, applications, or divisions. In this work, we present a first steptowards generating tailored classification. To automate the generationprocess, we apply key term extraction and topic modelling algorithmsto 2.131 publications of German patent applications. To infer categories,we apply topic modelling to the patent collection. We evaluate the map-ping of the topics found via the Latent Dirichlet Allocation method tothe classes present in the patent collection as assigned by the domainexpert.
2015

Schulze Gunnar, Horn Christopher, Kern Roman

Map-Matching Cell Phone Trajectories of Low Spatial and Temporal Accuracy

2015 IEEE 18th International Conference on Intelligent Transportation Systems, IEEE, IEEE, 2015

Konferenz
This paper presents an approach for matching cell phone trajectories of low spatial and temporal accuracy to the underlying road network. In this setting, only the position of the base station involved in a signaling event and the timestamp are known, resulting in a possible error of several kilometers. No additional information, such as signal strength, is available. The proposed solution restricts the set of admissible routes to a corridor by estimating the area within which a user is allowed to travel. The size and shape of this corridor can be controlled by various parameters to suit different requirements. The computed area is then used to select road segments from an underlying road network, for instance OpenStreetMap. These segments are assembled into a search graph, which additionally takes the chronological order of observations into account. A modified Dijkstra algorithm is applied for finding admissible candidate routes, from which the best one is chosen. We performed a detailed evaluation of 2249 trajectories with an average sampling time of 260 seconds. Our results show that, in urban areas, on average more than 44% of each trajectory are matched correctly. In rural and mixed areas, this value increases to more than 55%. Moreover, an in-depth evaluation was carried out to determine the optimal values for the tunable parameters and their effects on the accuracy, matching ratio and execution time. The proposed matching algorithm facilitates the use of large volumes of cell phone data in Intelligent Transportation Systems, in which accurate trajectories are desirable.
2015

Ziak Hermann, Kern Roman

Evaluation of Pseudo Relevance Feedback Techniques for Cross Vertical Aggregated Search

6th International Conference of the CLEF Association, CLEF'15, Toulouse, France, September 8-11, 2015, Proceedings, Springer, 2015

Konferenz
Cross vertical aggregated search is a special form of meta search, were multiple search engines from different domains and varying behaviour are combined to produce a single search result for each query. Such a setting poses a number of challenges, among them the question of how to best evaluate the quality of the aggregated search results. We devised an evaluation strategy together with an evaluation platform in order to conduct a series of experiments. In particular, we are interested whether pseudo relevance feedback helps in such a scenario. Therefore we implemented a number of pseudo relevance feedback techniques based on knowledge bases, where the knowledge base is either Wikipedia or a combination of the underlying search engines themselves. While conducting the evaluations we gathered a number of qualitative and quantitative results and gained insights on how different users compare the quality of search result lists. In regard to the pseudo relevance feedback we found that using Wikipedia as knowledge base generally provides a benefit, unless for entity centric queries, which are targeting single persons or organisations. Our results will enable to help steering the development of cross vertical aggregated search engines and will also help to guide large scale evaluation strategies, for example using crowd sourcing techniques.
2015

Pimas Oliver, Kröll Mark, Kern Roman

Know-Center at PAN 2015 author identification

Lecture Notes in Computer Science, Working Notes Papers of the CLEF 2015 Evaluation Labs, Springer Link, Toulouse, France, 2015

Konferenz
Our system for the PAN 2015 authorship verification challenge is basedupon a two step pre-processing pipeline. In the first step we extract different fea-tures that observe stylometric properties, grammatical characteristics and purestatistical features. In the second step of our pre-processing we merge all thosefeatures into a single meta feature space. We train an SVM classifier on the gener-ated meta features to verify the authorship of an unseen text document. We reportthe results from the final evaluation as well as on the training datasets
2015

Gursch Heimo, Ziak Hermann, Kern Roman

Unified Information Access for Knowledge Workers via a Federated Recommender System

Mensch und Computer 2015 – Workshopband, Anette Weisbecker, Michael Burmester, Albrecht Schmidt, De Gruyter Oldenbourg, Berlin, 2015

Konferenz
The objective of the EEXCESS (Enhancing Europe’s eXchange in Cultural Educational and Scientific reSources) project is to develop a system that can automatically recommend helpful and novel content to knowledge workers. The EEXCESS system can be integrated into existing software user interfaces as plugins which will extract topics and suggest the relevant material automatically. This recommendation process simplifies the information gathering of knowledge workers. Recommendations can also be triggered manually via web frontends. EEXCESS hides the potentially large number of knowledge sources by semi or fully automatically providing content suggestions. Hence, users only have to be able to in use the EEXCESS system and not all sources individually. For each user, relevant sources can be set or auto-selected individually. EEXCESS offers open interfaces, making it easy to connect additional sources and user program plugins.
2015

Rexha Andi, Klampfl Stefan, Kröll Mark, Kern Roman

Towards Authorship Attribution for Bibliometrics using Stylometric Features

Proc. of the Workshop Mining Scientific Papers: Computational Linguistics and Bibliometrics, Atanassova, I.; Bertin, M.; Mayr, P., ACL Anthology, Istanbul, Turkey, 2015

Konferenz
The overwhelming majority of scientific publications are authored by multiple persons; yet, bibliographic metrics are only assigned to individual articles as single entities. In this paper, we aim at a more fine-grained analysis of scientific authorship. We therefore adapt a text segmentation algorithm to identify potential author changes within the main text of a scientific article, which we obtain by using existing PDF extraction techniques. To capture stylistic changes in the text, we employ a number of stylometric features. We evaluate our approach on a small subset of PubMed articles consisting of an approximately equal number of research articles written by a varying number of authors. Our results indicate that the more authors an article has the more potential author changes are identified. These results can be considered as an initial step towards a more detailed analysis of scientific authorship, thereby extending the repertoire of bibliometrics.
2015

Klampfl Stefan, Kern Roman

Machine Learning Techniques for Automatically Extracting Contextual Information from Scientific Publications

Semantic Web Evaluation Challenges. SemWebEval 2015 at ESWC 2015, Portorož, Slovenia, May 31 – June 4, 2015, Revised Selected Papers, Gandon, F.; Cabrio, E.; Stankovic, M.; Zimmermann, A. , Springer International Publishing, 2015

Konferenz
Scholarly publishing increasingly requires automated systems that semantically enrich documents in order to support management and quality assessment of scientific output.However, contextual information, such as the authors' affiliations, references, and funding agencies, is typically hidden within PDF files.To access this information we have developed a processing pipeline that analyses the structure of a PDF document incorporating a diverse set of machine learning techniques.First, unsupervised learning is used to extract contiguous text blocks from the raw character stream as the basic logical units of the article.Next, supervised learning is employed to classify blocks into different meta-data categories, including authors and affiliations.Then, a set of heuristics are applied to detect the reference section at the end of the paper and segment it into individual reference strings.Sequence classification is then utilised to categorise the tokens of individual references to obtain information such as the journal and the year of the reference.Finally, we make use of named entity recognition techniques to extract references to research grants, funding agencies, and EU projects.Our system is modular in nature.Some parts rely on models learnt on training data, and the overall performance scales with the quality of these data sets.
2015

Horn Christopher, Kern Roman

Deriving Public Transportation Timetables with Large-Scale Cell Phone Data

Procedia Computer Science, 2015

Konferenz
In this paper, we propose an approach to deriving public transportation timetables of a region (i.e. country) based on (i) large- scale, non-GPS cell phone data and (ii) a dataset containing geographic information of public transportation stations. The presented algorithm is designed to work with movements data, which are scarce and have a low spatial accuracy but exists in vast amounts (large-scale). Since only aggregated statistics are used, our algorithm copes well with anonymized data. Our evaluation shows that 89% of the departure times of popular train connections are correctly recalled with an allowed deviation of 5 minutes. The timetable can be used as feature for transportation mode detection to separate public from private transport when no public timetable is available.
2015

Kern Roman, Frey Matthias

Efficient Table Annotation for Digital Articles

4th International Workshop on Mining Scientific Publications, D-Lib, 2015

Konferenz
Table recognition and table extraction are important tasks in information extraction, especially in the domain of schol- arly communication. In this domain tables are commonplace and contain valuable information. Many different automatic approaches for table recognition and extraction exist. Com- mon to many of these approaches is the need for ground truth datasets, to train algorithms or to evaluate the results. In this paper we present the PDF Table Annotator, a web based tool for annotating elements and regions in PDF doc- uments, in particular tables. The annotated data is intended to serve as a ground truth useful to machine learning algo- rithms for detecting table regions and table structure. To make the task of manual table annotation as convenient as possible, the tool is designed to allow an efficient annotation process that may spawn multiple session by multiple users. An evaluation is conducted where we compare our tool to three alternative ways of creating ground truth of tables in documents. Here we found that our tool overall provides an efficient and convenient way to annotate tables. In addition, our tool is particularly suitable for complex table structures, where it provided the lowest annotation time and the highest accuracy. Furthermore, our tool allows to annotate tables following a logical or a functional model. Given that by the use of our tool ground truth datasets for table recognition and extraction are easier to produce, the quality of auto- matic tables extraction should greatly benefit. General
2015

Rubien Raoul, Ziak Hermann, Kern Roman

Efficient Search Result Diversification via Query Expansion Using Knowledge Bases

Proceedings of 12th International Workshop on Text-based Information Retrieval (TIR), 2015

Konferenz
Underspecified search queries can be performed via result list diversification approaches, which are often compu- tationally complex and require longer response times. In this paper, we explore an alternative, and more efficient way to diversify the result list based on query expansion. To that end, we used a knowledge base pseudo-relevance feedback algorithm. We compared our algorithm to IA-Select, a state-of-the-art diversification method, using its intent-aware version of the NDCG (Normalized Discounted Cumulative Gain) metric. The results indicate that our approach can guarantee a similar extent of diversification as IA-Select. In addition, we showed that the supported query language of the underlying search engines plays an important role in the query expansion based on diversification. Therefore, query expansion may be an alternative when result diversification is not feasible, for example in federated search systems where latency and the quantity of handled search results are critical issues.
Kontakt Karriere

Hiermit erkläre ich ausdrücklich meine Einwilligung zum Einsatz und zur Speicherung von Cookies. Weiter Informationen finden sich unter Datenschutzerklärung

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close