Ziak Hermann, Kern Roman
2016
Within this work represents the documentation of our ap-proach on the Social Book Search Lab 2016 where we took part in thesuggestion track. The main goal of the track was to create book recom-mendation for readers only based on their stated request within a forum.The forum entry contained further contextual information, like the user’scatalogue of already read books and the list of example books mentionedin the user’s request. The presented approach is mainly based on themetadata included in the book catalogue provided by the organizers ofthe task. With the help of a dedicated search index we extracted severalpotential book recommendations which were re-ranked by the use of anSVD based approach. Although our results did not meet our expectationwe consider it as first iteration towards a competitive solution.
Gursch Heimo, Ziak Hermann, Kröll Mark, Kern Roman
2016
Modern knowledge workers need to interact with a large number of different knowledge sources with restricted or public access. Knowledge workers are thus burdened with the need to familiarise and query each source separately. The EEXCESS (Enhancing Europe’s eXchange in Cultural Educational and Scientific reSources) project aims at developing a recommender system providing relevant and novel content to its users. Based on the user’s work context, the EEXCESS system can either automatically recommend useful content, or support users by providing a single user interface for a variety of knowledge sources. In the design process of the EEXCESS system, recommendation quality, scalability and security where the three most important criteria. This paper investigates the scalability aspect achieved by federated design of the EEXCESS recommender system. This means that, content in different sources is not replicated but its management is done in each source individually. Recommendations are generated based on the context describing the knowledge worker’s information need. Each source offers result candidates which are merged and re-ranked into a single result list. This merging is done in a vector representation space to achieve high recommendation quality. To ensure security, user credentials can be set individually by each user for each source. Hence, access to the sources can be granted and revoked for each user and source individually. The scalable architecture of the EEXCESS system handles up to 100 requests querying up to 10 sources in parallel without notable performance deterioration. The re-ranking and merging of results have a smaller influence on the system's responsiveness than the average source response rates. The EEXCESS recommender system offers a common entry point for knowledge workers to a variety of different sources with only marginally lower response times as the individual sources on their own. Hence, familiarisation with individual sources and their query language is not necessary.
Kern Roman, Ziak Hermann
2016
Context-driven query extraction for content-basedrecommender systems faces the challenge of dealing with queriesof multiple topics. In contrast to manually entered queries, forautomatically generated queries this is a more frequent problem. For instances if the information need is inferred indirectly viathe user's current context. Especially for federated search systemswere connected knowledge sources might react vastly differentlyon such queries, an algorithmic way how to deal with suchqueries is of high importance. One such method is to split mixedqueries into their individual subtopics. To gain insight how amulti topic query can be split into its subtopics we conductedan evaluation where we compared a naive approach against amore complex approaches based on word embedding techniques:One created using Word2Vec and one created using GloVe. Toevaluate these two approaches we used the Webis-QSeC-10 queryset, consisting of about 5,000 multi term queries. Queries of thisset were concatenated and passed through the algorithms withthe goal to split those queries again. Hence the naive approach issplitting the queries into several groups, according to the amountof joined queries, assuming the topics are of equal query termcount. In the case of the Word2Vec and GloVe based approacheswe relied on the already pre-trained datasets. The Google Newsmodel and a model trained with a Wikipedia dump and theEnglish Gigaword newswire text archive. The out of this datasetsresulting query term vectors were grouped into subtopics usinga k-Means clustering. We show that a clustering approach basedon word vectors achieves better results in particular when thequery is not in topical order. Furthermore we could demonstratethe importance of the underlying dataset.
Urak Günter, Ziak Hermann, Kern Roman
2016
The core approach to distributed knowledge bases is federated search. Two of the main challenges for federated search are the source representation and source selection. Different solutions to these problems were proposed in the literature. Within this work we present our novel approach for query-based sampling by relying on knowledge bases. We show the basic correctness of our approach and we came to the insight that the ambiguity of the probing terms has just a minor impact on the representation of the collection. Finally, we show that our method can be used to distinguish between niche and encyclopedic knowledge bases.
Ziak Hermann, Rexha Andi, Kern Roman
2016
This paper describes our system for the mining task of theSocial Book Search Lab in 2016. The track consisted of two task, theclassification of book request postings and the task of linking book identifierswith references mentioned within the text. For the classificationtask we used text mining features like n-grams and vocabulary size, butalso included advanced features like average spelling errors found withinthe text. Here two datasets were provided by the organizers for this taskwhich were evaluated separately. The second task, the linking of booktitles to a work identifier, was addressed by an approach based on lookuptables. For the dataset of the first task our approach was ranked third,following two baseline approaches of the organizers with an accuracy of91 percent. For the second dataset we achieved second place with anaccuracy of 82 percent. Our approach secured the first place with anF-score of 33.50 for the second task.