Urak Günter, Ziak Hermann, Kern Roman
Source Selection of Long Tail Sources for Federated Search in an Uncooperative Setting
The task of federated search is to combine results from multiple knowledge bases into a single, aggregated result list, where the items typically range from textual documents toimages. These knowledge bases are also called sources, and the process of choosing the actual subset of sources for a given query is called source selection. A scenario wherethese sources do not provide information about their content in a standardized way is called uncooperative setting. In our work we focus on knowledge bases providing long tail content, i.e., rather specialized sources offering a low number of relevant documents. These sources are often neglected in favor of more popular knowledge sources, both by today’s Web users as well as by most of the existing source selection techniques. We propose a system for source selection which i) could be utilized to automatically detect long tail knowledge bases and ii) generates aggregated search results that tend to incorporate results from these long tail sources. Starting from the current state-of-the-art we developed components that allowed to adjust the amount of contribution from long tail sources. Our evaluation is conducted on theTREC 2014 Federated WebSearch dataset. As this dataset also favors the most popular sources, systems that include many long tail knowledge bases will yield low performancemeasures. Here, we propose a system where just a few relevant long tail sources are integrated into the list of more popular knowledge bases. Additionally, we evaluated the implications of an uncooperative setting, where only minimal information of the sources is available to the federated search system. Here a severe drop in performance is observed once the share of long tail sources is higher than 40%. Our work is intended to steer the development of federated search systems that aim at increasing the diversity and coverage of the aggregated search result.