Publikationen

Hier finden Sie von Know-Center MitarbeiterInnen verfasste wissenschaftliche Publikationen

2018

Rexha Andi, Kröll Mark, Ziak Hermann, Kern Roman

Authorship Identification of Documents with High Content Similarity

Scientometrics, Wolfgang Glänzel, Springer Link, 2018

Journal
The goal of our work is inspired by the task of associating segments of text to their real authors. In this work, we focus on analyzing the way humans judge different writing styles. This analysis can help to better understand this process and to thus simulate/ mimic such behavior accordingly. Unlike the majority of the work done in this field (i.e., authorship attribution, plagiarism detection, etc.) which uses content features, we focus only on the stylometric, i.e. content-agnostic, characteristics of authors.Therefore, we conducted two pilot studies to determine, if humans can identify authorship among documents with high content similarity. The first was a quantitative experiment involving crowd-sourcing, while the second was a qualitative one executed by the authors of this paper.Both studies confirmed that this task is quite challenging.To gain a better understanding of how humans tackle such a problem, we conducted an exploratory data analysis on the results of the studies. In the first experiment, we compared the decisions against content features and stylometric features. While in the second, the evaluators described the process and the features on which their judgment was based. The findings of our detailed analysis could (i) help to improve algorithms such as automatic authorship attribution as well as plagiarism detection, (ii) assist forensic experts or linguists to create profiles of writers, (iii) support intelligence applications to analyze aggressive and threatening messages and (iv) help editor conformity by adhering to, for instance, journal specific writing style.
2018

Urak Günter, Ziak Hermann, Kern Roman

Source Selection of Long Tail Sources for Federated Search in an Uncooperative Setting

SAC, 2018

Konferenz
The task of federated search is to combine results from multiple knowledge bases into a single, aggregated result list, where the items typically range from textual documents toimages. These knowledge bases are also called sources, and the process of choosing the actual subset of sources for a given query is called source selection. A scenario wherethese sources do not provide information about their content in a standardized way is called uncooperative setting. In our work we focus on knowledge bases providing long tail content, i.e., rather specialized sources offering a low number of relevant documents. These sources are often neglected in favor of more popular knowledge sources, both by today’s Web users as well as by most of the existing source selection techniques. We propose a system for source selection which i) could be utilized to automatically detect long tail knowledge bases and ii) generates aggregated search results that tend to incorporate results from these long tail sources. Starting from the current state-of-the-art we developed components that allowed to adjust the amount of contribution from long tail sources. Our evaluation is conducted on theTREC 2014 Federated WebSearch dataset. As this dataset also favors the most popular sources, systems that include many long tail knowledge bases will yield low performancemeasures. Here, we propose a system where just a few relevant long tail sources are integrated into the list of more popular knowledge bases. Additionally, we evaluated the implications of an uncooperative setting, where only minimal information of the sources is available to the federated search system. Here a severe drop in performance is observed once the share of long tail sources is higher than 40%. Our work is intended to steer the development of federated search systems that aim at increasing the diversity and coverage of the aggregated search result.
2017

Ziak Hermann, Kern Roman

Evaluation of Contextualization and Diversification Approaches in Aggregated Search

TIR @ DEXA International Conference on Database and Expert Systems Applications, 2017

Konferenz
The combination of different knowledge bases in thefield of information retrieval is called federated or aggregated search. It has several benefits over single source retrieval but poses some challenges as well. This work focuses on the challenge of result aggregation; especially in a setting where the final result list should include a certain degree of diversity and serendipity. Both concepts have been shown to have an impact on how user perceive an information retrieval system. In particular, we want to assess if common procedures for result list aggregation can be utilized to introduce diversity and serendipity. Furthermore, we study whether a blocking or interleaving for result aggregation yields better results. In a cross vertical aggregated search the so-called verticalscould be news, multimedia content or text. Block ranking is one approach to combine such heterogeneous result. It relies on the idea that these verticals are combined into a single result list as blocks of several adjacent items. An alternative approach for this is interleaving. Here the verticals are blended into one result list on an item by item basis, i.e. adjacent items in the result list may come from different verticals. To generate the diverse and serendipitous results we reliedon a query reformulation technique which we showed to be beneficial to generate diversified results in previous work. To conduct this evaluation we created a dedicated dataset. This dataset served as a basis for three different evaluation settings on a crowd sourcing platform, with over 300 participants. Our results show that query based diversification can be adapted to generate serendipitous results in a similar manner. Further, we discovered that both approaches, interleaving and block ranking, appear to be beneficial to introduce diversity and serendipity. Though it seems that queries either benefit from one approach or the other but not from both.
2017

Rexha Andi, Kröll Mark, Ziak Hermann, Kern Roman

Extending Scientific Literature Search by Including the Author’s Writing Style

Fifth Workshop on Bibliometric-enhanced Information Retrieval, Atanassova, I.; Bertin, M.; Mayr, P., Springer, Aberdeen, UK, 2017

Konferenz
Our work is motivated by the idea to extend the retrieval of related scientific literature to cases, where the relatedness also incorporates the writing style of individual scientific authors. Therefore we conducted a pilot study to answer the question whether humans can identity authorship once the topological clues have been removed. As first result, we found out that this task is challenging, even for humans. We also found some agreement between the annotators. To gain a better understanding how humans tackle such a problem, we conducted an exploratory data analysis. Here, we compared the decisions against a number of topological and stylometric features. The outcome of our work should help to improve automatic authorship identificationalgorithms and to shape potential follow-up studies.
2017

Rexha Andi, Kern Roman, Ziak Hermann, Dragoni Mauro

A semantic federated search engine for domain-specific document retrieval

SAC '17 Proceedings of the Symposium on Applied Computing, Sung Y. Shin, Dongwan Shin, Maria Lencastre, ACM, Marrakech, Morocco, 2017

Konferenz
Retrieval of domain-specific documents became attractive for theSemantic Web community due to the possibility of integrating classicInformation Retrieval (IR) techniques with semantic knowledge.Unfortunately, the gap between the construction of a full semanticsearch engine and the possibility of exploiting a repository ofontologies covering all possible domains is far from being filled.Recent solutions focused on the aggregation of different domain-specificrepositories managed by third-parties. In this paper, wepresent a semantic federated search engine developed in the contextof the EEXCESS EU project. Through the developed platform,users are able to perform federated queries over repositories in atransparent way, i.e. without knowing how their original queries aretransformed before being actually submitted. The platform implementsa facility for plugging new repositories and for creating, withthe support of general purpose knowledge bases, knowledge graphsdescribing the content of each connected repository. Such knowledgegraphs are then exploited for enriching queries performed byusers.
2017

Seifert Christin, Bailer Werner, Orgel Thomas, Gantner Louis, Kern Roman, Ziak Hermann, Petit Albin, Schlötterer Jörg, Zwicklbauer Stefan, Granitzer Michael

Ubiquitous Access to Digital Cultural Heritage

Journal on Computing and Cultural Heritage (JOCCH) - Special Issue on Digital Infrastructure for Cultural Heritage, Part 1, Roberto Scopign, ACM, New York, NY, US, 2017

Journal
The digitization initiatives in the past decades have led to a tremendous increase in digitized objects in the cultural heritagedomain. Although digitally available, these objects are often not easily accessible for interested users because of the distributedallocation of the content in different repositories and the variety in data structure and standards. When users search for culturalcontent, they first need to identify the specific repository and then need to know how to search within this platform (e.g., usageof specific vocabulary). The goal of the EEXCESS project is to design and implement an infrastructure that enables ubiquitousaccess to digital cultural heritage content. Cultural content should be made available in the channels that users habituallyvisit and be tailored to their current context without the need to manually search multiple portals or content repositories. Torealize this goal, open-source software components and services have been developed that can either be used as an integratedinfrastructure or as modular components suitable to be integrated in other products and services. The EEXCESS modules andcomponents comprise (i) Web-based context detection, (ii) information retrieval-based, federated content aggregation, (iii) meta-data definition and mapping, and (iv) a component responsible for privacy preservation. Various applications have been realizedbased on these components that bring cultural content to the user in content consumption and content creation scenarios. Forexample, content consumption is realized by a browser extension generating automatic search queries from the current pagecontext and the focus paragraph and presenting related results aggregated from different data providers. A Google Docs add-onallows retrieval of relevant content aggregated from multiple data providers while collaboratively writing a document. Theserelevant resources then can be included in the current document either as citation, an image, or a link (with preview) withouthaving to leave disrupt the current writing task for an explicit search in various content providers’ portals.
2017

Rexha Andi, Kröll Mark, Ziak Hermann, Kern Roman

Pilot study: Ranking of textual snippets based on the writing style

Zenodo, 2017

In this pilot study, we tried to capture humans' behavior when identifying authorship of text snippets. At first, we selected textual snippets from the introduction of scientific articles written by single authors. Later, we presented to the evaluators a source and four target snippets, and then, ask them to rank the target snippets from the most to the least similar from the writing style.The dataset is composed by 66 experiments manually checked for not having any clear hint during the ranking for the evaluators. For each experiment, we have evaluations from three different evaluators.We present each experiment in a single line (in the CSV file), where, at first we present the metadata of the Source-Article (Journal, Title, Authorship, Snippet), and the metadata for the 4 target snippets (Journal, Title, Authorship, Snippet, Written From the same Author, Published in the same Journal) and the ranking given by each evaluator. This task was performed in the open source platform, Crowd Flower. The headers of the CSV are self-explained. In the TXT file, you can find a human-readable version of the experiment. For more information about the extraction of the data, please consider reading our paper: "Extending Scientific Literature Search by Including the Author’s Writing Style" @BIR: http://www.gesis.org/en/services/events/events-archive/conferences/ecir-workshops/ecir-workshop-2017
2016

Ziak Hermann, Rexha Andi, Kern Roman

KNOW At The Social Book Search Lab 2016 Mining Track

CLEF 2016 Social Book Search Lab, Krisztian Balog, Linda Cappellato, Nicola Ferro,Craig Macdonald, Springer, Évora, Portugal, 2016

Konferenz
This paper describes our system for the mining task of theSocial Book Search Lab in 2016. The track consisted of two task, theclassification of book request postings and the task of linking book identifierswith references mentioned within the text. For the classificationtask we used text mining features like n-grams and vocabulary size, butalso included advanced features like average spelling errors found withinthe text. Here two datasets were provided by the organizers for this taskwhich were evaluated separately. The second task, the linking of booktitles to a work identifier, was addressed by an approach based on lookuptables. For the dataset of the first task our approach was ranked third,following two baseline approaches of the organizers with an accuracy of91 percent. For the second dataset we achieved second place with anaccuracy of 82 percent. Our approach secured the first place with anF-score of 33.50 for the second task.
2016

Urak Günter, Ziak Hermann, Kern Roman

Do Ambiguous Words Improve Probing For Federated Search?

International Conference on Theory and Practice of Digital Libraries, TPDL 2016, Springer-Verlag, 2016

Konferenz
The core approach to distributed knowledge bases is federated search. Two of the main challenges for federated search are the source representation and source selection. Different solutions to these problems were proposed in the literature. Within this work we present our novel approach for query-based sampling by relying on knowledge bases. We show the basic correctness of our approach and we came to the insight that the ambiguity of the probing terms has just a minor impact on the representation of the collection. Finally, we show that our method can be used to distinguish between niche and encyclopedic knowledge bases.
2016

Kern Roman, Ziak Hermann

Query Splitting For Context-Driven Federated Recommendations

Database and Expert Systems Applications (DEXA), 2016 27th International Workshop on, IEEEE, Porto, Portugal, 2016

Konferenz
Context-driven query extraction for content-basedrecommender systems faces the challenge of dealing with queriesof multiple topics. In contrast to manually entered queries, forautomatically generated queries this is a more frequent problem. For instances if the information need is inferred indirectly viathe user's current context. Especially for federated search systemswere connected knowledge sources might react vastly differentlyon such queries, an algorithmic way how to deal with suchqueries is of high importance. One such method is to split mixedqueries into their individual subtopics. To gain insight how amulti topic query can be split into its subtopics we conductedan evaluation where we compared a naive approach against amore complex approaches based on word embedding techniques:One created using Word2Vec and one created using GloVe. Toevaluate these two approaches we used the Webis-QSeC-10 queryset, consisting of about 5,000 multi term queries. Queries of thisset were concatenated and passed through the algorithms withthe goal to split those queries again. Hence the naive approach issplitting the queries into several groups, according to the amountof joined queries, assuming the topics are of equal query termcount. In the case of the Word2Vec and GloVe based approacheswe relied on the already pre-trained datasets. The Google Newsmodel and a model trained with a Wikipedia dump and theEnglish Gigaword newswire text archive. The out of this datasetsresulting query term vectors were grouped into subtopics usinga k-Means clustering. We show that a clustering approach basedon word vectors achieves better results in particular when thequery is not in topical order. Furthermore we could demonstratethe importance of the underlying dataset.
2016

Gursch Heimo, Ziak Hermann, Kröll Mark, Kern Roman

Context-Driven Federated Recommendations for Knowledge Workers

Proceedings of the 17th European Conference on Knowledge Management (ECKM), Dr. Sandra Moffett and Dr. Brendan Galbraith, Academic Conferences and Publishing International Limited, Belfast, Northern Ireland, UK, 2016

Konferenz
Modern knowledge workers need to interact with a large number of different knowledge sources with restricted or public access. Knowledge workers are thus burdened with the need to familiarise and query each source separately. The EEXCESS (Enhancing Europe’s eXchange in Cultural Educational and Scientific reSources) project aims at developing a recommender system providing relevant and novel content to its users. Based on the user’s work context, the EEXCESS system can either automatically recommend useful content, or support users by providing a single user interface for a variety of knowledge sources. In the design process of the EEXCESS system, recommendation quality, scalability and security where the three most important criteria. This paper investigates the scalability aspect achieved by federated design of the EEXCESS recommender system. This means that, content in different sources is not replicated but its management is done in each source individually. Recommendations are generated based on the context describing the knowledge worker’s information need. Each source offers result candidates which are merged and re-ranked into a single result list. This merging is done in a vector representation space to achieve high recommendation quality. To ensure security, user credentials can be set individually by each user for each source. Hence, access to the sources can be granted and revoked for each user and source individually. The scalable architecture of the EEXCESS system handles up to 100 requests querying up to 10 sources in parallel without notable performance deterioration. The re-ranking and merging of results have a smaller influence on the system's responsiveness than the average source response rates. The EEXCESS recommender system offers a common entry point for knowledge workers to a variety of different sources with only marginally lower response times as the individual sources on their own. Hence, familiarisation with individual sources and their query language is not necessary.
2016

Ziak Hermann, Kern Roman

KNOW At The Social Book Search Lab 2016 Suggestion Track

CLEF 2016 Social Book Search Lab, Krisztian Balog, Linda Cappellato, Nicola Ferro, Craig Macdonal, CEUR Workshop Proceeding, Évora, Portugal, 2016

Konferenz
Within this work represents the documentation of our ap-proach on the Social Book Search Lab 2016 where we took part in thesuggestion track. The main goal of the track was to create book recom-mendation for readers only based on their stated request within a forum.The forum entry contained further contextual information, like the user’scatalogue of already read books and the list of example books mentionedin the user’s request. The presented approach is mainly based on themetadata included in the book catalogue provided by the organizers ofthe task. With the help of a dedicated search index we extracted severalpotential book recommendations which were re-ranked by the use of anSVD based approach. Although our results did not meet our expectationwe consider it as first iteration towards a competitive solution.
2015

Rubien Raoul, Ziak Hermann, Kern Roman

Efficient Search Result Diversification via Query Expansion Using Knowledge Bases

Proceedings of 12th International Workshop on Text-based Information Retrieval (TIR), 2015

Konferenz
Underspecified search queries can be performed via result list diversification approaches, which are often compu- tationally complex and require longer response times. In this paper, we explore an alternative, and more efficient way to diversify the result list based on query expansion. To that end, we used a knowledge base pseudo-relevance feedback algorithm. We compared our algorithm to IA-Select, a state-of-the-art diversification method, using its intent-aware version of the NDCG (Normalized Discounted Cumulative Gain) metric. The results indicate that our approach can guarantee a similar extent of diversification as IA-Select. In addition, we showed that the supported query language of the underlying search engines plays an important role in the query expansion based on diversification. Therefore, query expansion may be an alternative when result diversification is not feasible, for example in federated search systems where latency and the quantity of handled search results are critical issues.
2015

Ziak Hermann, Kern Roman

Evaluation of Pseudo Relevance Feedback Techniques for Cross Vertical Aggregated Search

6th International Conference of the CLEF Association, CLEF'15, Toulouse, France, September 8-11, 2015, Proceedings, Springer, 2015

Konferenz
Cross vertical aggregated search is a special form of meta search, were multiple search engines from different domains and varying behaviour are combined to produce a single search result for each query. Such a setting poses a number of challenges, among them the question of how to best evaluate the quality of the aggregated search results. We devised an evaluation strategy together with an evaluation platform in order to conduct a series of experiments. In particular, we are interested whether pseudo relevance feedback helps in such a scenario. Therefore we implemented a number of pseudo relevance feedback techniques based on knowledge bases, where the knowledge base is either Wikipedia or a combination of the underlying search engines themselves. While conducting the evaluations we gathered a number of qualitative and quantitative results and gained insights on how different users compare the quality of search result lists. In regard to the pseudo relevance feedback we found that using Wikipedia as knowledge base generally provides a benefit, unless for entity centric queries, which are targeting single persons or organisations. Our results will enable to help steering the development of cross vertical aggregated search engines and will also help to guide large scale evaluation strategies, for example using crowd sourcing techniques.
2015

Gursch Heimo, Ziak Hermann, Kern Roman

Unified Information Access for Knowledge Workers via a Federated Recommender System

Mensch und Computer 2015 – Workshopband, Anette Weisbecker, Michael Burmester, Albrecht Schmidt, De Gruyter Oldenbourg, Berlin, 2015

Konferenz
The objective of the EEXCESS (Enhancing Europe’s eXchange in Cultural Educational and Scientific reSources) project is to develop a system that can automatically recommend helpful and novel content to knowledge workers. The EEXCESS system can be integrated into existing software user interfaces as plugins which will extract topics and suggest the relevant material automatically. This recommendation process simplifies the information gathering of knowledge workers. Recommendations can also be triggered manually via web frontends. EEXCESS hides the potentially large number of knowledge sources by semi or fully automatically providing content suggestions. Hence, users only have to be able to in use the EEXCESS system and not all sources individually. For each user, relevant sources can be set or auto-selected individually. EEXCESS offers open interfaces, making it easy to connect additional sources and user program plugins.
Kontakt Karriere

Hiermit erkläre ich ausdrücklich meine Einwilligung zum Einsatz und zur Speicherung von Cookies. Weiter Informationen finden sich unter Datenschutzerklärung

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close