Gursch Heimo, Ziak Hermann, Kern Roman
2015
The objective of the EEXCESS (Enhancing Europe’s eXchange in Cultural Educational and Scientific reSources) project is to develop a system that can automatically recommend helpful and novel content to knowledge workers. The EEXCESS system can be integrated into existing software user interfaces as plugins which will extract topics and suggest the relevant material automatically. This recommendation process simplifies the information gathering of knowledge workers. Recommendations can also be triggered manually via web frontends. EEXCESS hides the potentially large number of knowledge sources by semi or fully automatically providing content suggestions. Hence, users only have to be able to in use the EEXCESS system and not all sources individually. For each user, relevant sources can be set or auto-selected individually. EEXCESS offers open interfaces, making it easy to connect additional sources and user program plugins.
Schulze Gunnar, Horn Christopher, Kern Roman
2015
This paper presents an approach for matching cell phone trajectories of low spatial and temporal accuracy to the underlying road network. In this setting, only the position of the base station involved in a signaling event and the timestamp are known, resulting in a possible error of several kilometers. No additional information, such as signal strength, is available. The proposed solution restricts the set of admissible routes to a corridor by estimating the area within which a user is allowed to travel. The size and shape of this corridor can be controlled by various parameters to suit different requirements. The computed area is then used to select road segments from an underlying road network, for instance OpenStreetMap. These segments are assembled into a search graph, which additionally takes the chronological order of observations into account. A modified Dijkstra algorithm is applied for finding admissible candidate routes, from which the best one is chosen. We performed a detailed evaluation of 2249 trajectories with an average sampling time of 260 seconds. Our results show that, in urban areas, on average more than 44% of each trajectory are matched correctly. In rural and mixed areas, this value increases to more than 55%. Moreover, an in-depth evaluation was carried out to determine the optimal values for the tunable parameters and their effects on the accuracy, matching ratio and execution time. The proposed matching algorithm facilitates the use of large volumes of cell phone data in Intelligent Transportation Systems, in which accurate trajectories are desirable.
Ziak Hermann, Kern Roman
2015
Cross vertical aggregated search is a special form of meta search, were multiple search engines from different domains and varying behaviour are combined to produce a single search result for each query. Such a setting poses a number of challenges, among them the question of how to best evaluate the quality of the aggregated search results. We devised an evaluation strategy together with an evaluation platform in order to conduct a series of experiments. In particular, we are interested whether pseudo relevance feedback helps in such a scenario. Therefore we implemented a number of pseudo relevance feedback techniques based on knowledge bases, where the knowledge base is either Wikipedia or a combination of the underlying search engines themselves. While conducting the evaluations we gathered a number of qualitative and quantitative results and gained insights on how different users compare the quality of search result lists. In regard to the pseudo relevance feedback we found that using Wikipedia as knowledge base generally provides a benefit, unless for entity centric queries, which are targeting single persons or organisations. Our results will enable to help steering the development of cross vertical aggregated search engines and will also help to guide large scale evaluation strategies, for example using crowd sourcing techniques.
Pimas Oliver, Kröll Mark, Kern Roman
2015
Our system for the PAN 2015 authorship verification challenge is basedupon a two step pre-processing pipeline. In the first step we extract different fea-tures that observe stylometric properties, grammatical characteristics and purestatistical features. In the second step of our pre-processing we merge all thosefeatures into a single meta feature space. We train an SVM classifier on the gener-ated meta features to verify the authorship of an unseen text document. We reportthe results from the final evaluation as well as on the training datasets
Rubien Raoul, Ziak Hermann, Kern Roman
2015
Underspecified search queries can be performed via result list diversification approaches, which are often compu- tationally complex and require longer response times. In this paper, we explore an alternative, and more efficient way to diversify the result list based on query expansion. To that end, we used a knowledge base pseudo-relevance feedback algorithm. We compared our algorithm to IA-Select, a state-of-the-art diversification method, using its intent-aware version of the NDCG (Normalized Discounted Cumulative Gain) metric. The results indicate that our approach can guarantee a similar extent of diversification as IA-Select. In addition, we showed that the supported query language of the underlying search engines plays an important role in the query expansion based on diversification. Therefore, query expansion may be an alternative when result diversification is not feasible, for example in federated search systems where latency and the quantity of handled search results are critical issues.
Rexha Andi, Klampfl Stefan, Kröll Mark, Kern Roman
2015
The overwhelming majority of scientific publications are authored by multiple persons; yet, bibliographic metrics are only assigned to individual articles as single entities. In this paper, we aim at a more fine-grained analysis of scientific authorship. We therefore adapt a text segmentation algorithm to identify potential author changes within the main text of a scientific article, which we obtain by using existing PDF extraction techniques. To capture stylistic changes in the text, we employ a number of stylometric features. We evaluate our approach on a small subset of PubMed articles consisting of an approximately equal number of research articles written by a varying number of authors. Our results indicate that the more authors an article has the more potential author changes are identified. These results can be considered as an initial step towards a more detailed analysis of scientific authorship, thereby extending the repertoire of bibliometrics.
Klampfl Stefan, Kern Roman
2015
Scholarly publishing increasingly requires automated systems that semantically enrich documents in order to support management and quality assessment of scientific output.However, contextual information, such as the authors' affiliations, references, and funding agencies, is typically hidden within PDF files.To access this information we have developed a processing pipeline that analyses the structure of a PDF document incorporating a diverse set of machine learning techniques.First, unsupervised learning is used to extract contiguous text blocks from the raw character stream as the basic logical units of the article.Next, supervised learning is employed to classify blocks into different meta-data categories, including authors and affiliations.Then, a set of heuristics are applied to detect the reference section at the end of the paper and segment it into individual reference strings.Sequence classification is then utilised to categorise the tokens of individual references to obtain information such as the journal and the year of the reference.Finally, we make use of named entity recognition techniques to extract references to research grants, funding agencies, and EU projects.Our system is modular in nature.Some parts rely on models learnt on training data, and the overall performance scales with the quality of these data sets.
Horn Christopher, Kern Roman
2015
In this paper, we propose an approach to deriving public transportation timetables of a region (i.e. country) based on (i) large- scale, non-GPS cell phone data and (ii) a dataset containing geographic information of public transportation stations. The presented algorithm is designed to work with movements data, which are scarce and have a low spatial accuracy but exists in vast amounts (large-scale). Since only aggregated statistics are used, our algorithm copes well with anonymized data. Our evaluation shows that 89% of the departure times of popular train connections are correctly recalled with an allowed deviation of 5 minutes. The timetable can be used as feature for transportation mode detection to separate public from private transport when no public timetable is available.
Kern Roman, Frey Matthias
2015
Table recognition and table extraction are important tasks in information extraction, especially in the domain of schol- arly communication. In this domain tables are commonplace and contain valuable information. Many different automatic approaches for table recognition and extraction exist. Com- mon to many of these approaches is the need for ground truth datasets, to train algorithms or to evaluate the results. In this paper we present the PDF Table Annotator, a web based tool for annotating elements and regions in PDF doc- uments, in particular tables. The annotated data is intended to serve as a ground truth useful to machine learning algo- rithms for detecting table regions and table structure. To make the task of manual table annotation as convenient as possible, the tool is designed to allow an efficient annotation process that may spawn multiple session by multiple users. An evaluation is conducted where we compare our tool to three alternative ways of creating ground truth of tables in documents. Here we found that our tool overall provides an efficient and convenient way to annotate tables. In addition, our tool is particularly suitable for complex table structures, where it provided the lowest annotation time and the highest accuracy. Furthermore, our tool allows to annotate tables following a logical or a functional model. Given that by the use of our tool ground truth datasets for table recognition and extraction are easier to produce, the quality of auto- matic tables extraction should greatly benefit. General