Hier finden Sie von Know-Center MitarbeiterInnen verfasste wissenschaftliche Publikationen


Lovric Mario, Molero Perez Jose Manuel, Kern Roman

PySpark and RDKit: moving towards Big Data in QSAR

Molecular Informatics, Wiley, 2019

The authors present an implementation of the cheminformatics toolkit RDKit in a distributed computing environment, Apache Hadoop. Together with the Apache Spark analytics engine, wrapped by PySpark, resources from commodity scalable hardware can be employed for cheminformatic calculations and query operations with basic knowledge in Python programming and understanding of the resilient distributed datasets (RDD). Three use cases of cheminfomatical computing in Spark on the Hadoop cluster are presented; querying substructures, calculating fingerprint similarity and calculating molecular descriptors. The source code for the PySpark‐RDKit implementation is provided. The use cases showed that Spark provides a reasonable scalability depending on the use case and can be a suitable choice for datasets too big to be processed with current low‐end workstations

Tiago Santos, Stefan Schrunner, Geiger Bernhard, Olivia Pfeiler, Anja Zernig, Andre Kaestner, Kern Roman

Feature Extraction From Analog Wafermaps: A Comparison of Classical Image Processing and a Deep Generative Mode

IEEE Transactions on Semiconductor Manufacturing, IEEE, 2019

Semiconductor manufacturing is a highly innovative branch of industry, where a high degree of automation has already been achieved. For example, devices tested to be outside of their specifications in electrical wafer test are automatically scrapped. In this paper, we go one step further and analyze test data of devices still within the limits of the specification, by exploiting the information contained in the analog wafermaps. To that end, we propose two feature extraction approaches with the aim to detect patterns in the wafer test dataset. Such patterns might indicate the onset of critical deviations in the production process. The studied approaches are: 1) classical image processing and restoration techniques in combination with sophisticated feature engineering and 2) a data-driven deep generative model. The two approaches are evaluated on both a synthetic and a real-world dataset. The synthetic dataset has been modeled based on real-world patterns and characteristics. We found both approaches to provide similar overall evaluation metrics. Our in-depth analysis helps to choose one approach over the other depending on data availability as a major aspect, as well as on available computing power and required interpretability of the results

Toller Maximilian, Santos Tiago, Kern Roman

SAZED: parameter-free domain-agnostic season length estimation in time series data

Data Mining and Knowledge Discovery, Springer US, 2019

Season length estimation is the task of identifying the number of observations in the dominant repeating pattern of seasonal time series data. As such, it is a common pre-processing task crucial for various downstream applications. Inferring season length from a real-world time series is often challenging due to phenomena such as slightly varying period lengths and noise. These issues may, in turn, lead practitioners to dedicate considerable effort to preprocessing of time series data since existing approaches either require dedicated parameter-tuning or their performance is heavily domain-dependent. Hence, to address these challenges, we propose SAZED: spectral and average autocorrelation zero distance density. SAZED is a versatile ensemble of multiple, specialized time series season length estimation approaches. The combination of various base methods selected with respect to domain-agnostic criteria and a novel seasonality isolation technique, allow a broad applicability to real-world time series of varied properties. Further, SAZED is theoretically grounded and parameter-free, with a computational complexity of O( log ), which makes it applicable in practice. In our experiments, SAZED was statistically significantly better than every other method on at least one dataset. The datasets we used for the evaluation consist of time series data from various real-world domains, sterile synthetic test cases and synthetic data that were designed to be seasonal and yet have no finite statistical moments of any order.

Bassa Akim, Kröll Mark, Kern Roman

GerIE - An Open InformationExtraction System for the German Language

Journal of Universal Computer Science, 2018

Open Information Extraction (OIE) is the task of extracting relations fromtext without the need of domain speci c training data. Currently, most of the researchon OIE is devoted to the English language, but little or no research has been conductedon other languages including German. We tackled this problem and present GerIE, anOIE parser for the German language. Therefore we started by surveying the availableliterature on OIE with a focus on concepts, which may also apply to the Germanlanguage. Our system is built upon the output of a dependency parser, on which anumber of hand crafted rules are executed. For the evaluation we created two dedicateddatasets, one derived from news articles and one based on texts from an encyclopedia.Our system achieves F-measures of up to 0.89 for sentences that have been correctlypreprocessed.

Rexha Andi, Kröll Mark, Ziak Hermann, Kern Roman

Authorship Identification of Documents with High Content Similarity

Scientometrics, Wolfgang Glänzel, Springer Link, 2018

The goal of our work is inspired by the task of associating segments of text to their real authors. In this work, we focus on analyzing the way humans judge different writing styles. This analysis can help to better understand this process and to thus simulate/ mimic such behavior accordingly. Unlike the majority of the work done in this field (i.e., authorship attribution, plagiarism detection, etc.) which uses content features, we focus only on the stylometric, i.e. content-agnostic, characteristics of authors.Therefore, we conducted two pilot studies to determine, if humans can identify authorship among documents with high content similarity. The first was a quantitative experiment involving crowd-sourcing, while the second was a qualitative one executed by the authors of this paper.Both studies confirmed that this task is quite challenging.To gain a better understanding of how humans tackle such a problem, we conducted an exploratory data analysis on the results of the studies. In the first experiment, we compared the decisions against content features and stylometric features. While in the second, the evaluators described the process and the features on which their judgment was based. The findings of our detailed analysis could (i) help to improve algorithms such as automatic authorship attribution as well as plagiarism detection, (ii) assist forensic experts or linguists to create profiles of writers, (iii) support intelligence applications to analyze aggressive and threatening messages and (iv) help editor conformity by adhering to, for instance, journal specific writing style.

Hojas Sebastian, Kröll Mark, Kern Roman

GerMeter - A Corpus for Measuring Text Reuse in the Austrian JournalisticDomain

Language Resources and Evaluation, Springer, 2018


Santos Tiago, Walk Simon, Kern Roman, Strohmaier M., Helic Denis

Activity in Questions & Answers Websites

ACM Transactions on Social Computing, 2018

Millions of users on the Internet discuss a variety of topics on Question and Answer (Q&A) instances. However, not all instances and topics receive the same amount of attention, as some thrive and achieve self-sustaining levels of activity while others fail to attract users and either never grow beyond being a small niche community or become inactive. Hence, it is imperative to not only better understand but also to distill deciding factors and rules that define and govern sustainable Q&A instances. We aim to empower community managers with quantitative methods for them to better understand, control and foster their communities, and thus contribute to making the Web a more efficient place to exchange information. To that end, we extract, model and cluster user activity-based time series from 50 randomly selected Q&A instances from the StackExchange network to characterize user behavior. We find four distinct types of user activity temporal patterns, which vary primarily according to the users' activity frequency. Finally, by breaking down total activity in our 50 Q&A instances by the previously identified user activity profiles, we classify those 50 Q&A instances into three different activity profiles. Our categorization of Q&A instances aligns with the stage of development and maturity of the underlying communities, which can potentially help operators of such instances not only to quantitatively assess status and progress, but also allow them to optimize community building efforts

Seifert Christin, Bailer Werner, Orgel Thomas, Gantner Louis, Kern Roman, Ziak Hermann, Petit Albin, Schlötterer Jörg, Zwicklbauer Stefan, Granitzer Michael

Ubiquitous Access to Digital Cultural Heritage

Journal on Computing and Cultural Heritage (JOCCH) - Special Issue on Digital Infrastructure for Cultural Heritage, Part 1, Roberto Scopign, ACM, New York, NY, US, 2017

The digitization initiatives in the past decades have led to a tremendous increase in digitized objects in the cultural heritagedomain. Although digitally available, these objects are often not easily accessible for interested users because of the distributedallocation of the content in different repositories and the variety in data structure and standards. When users search for culturalcontent, they first need to identify the specific repository and then need to know how to search within this platform (e.g., usageof specific vocabulary). The goal of the EEXCESS project is to design and implement an infrastructure that enables ubiquitousaccess to digital cultural heritage content. Cultural content should be made available in the channels that users habituallyvisit and be tailored to their current context without the need to manually search multiple portals or content repositories. Torealize this goal, open-source software components and services have been developed that can either be used as an integratedinfrastructure or as modular components suitable to be integrated in other products and services. The EEXCESS modules andcomponents comprise (i) Web-based context detection, (ii) information retrieval-based, federated content aggregation, (iii) meta-data definition and mapping, and (iv) a component responsible for privacy preservation. Various applications have been realizedbased on these components that bring cultural content to the user in content consumption and content creation scenarios. Forexample, content consumption is realized by a browser extension generating automatic search queries from the current pagecontext and the focus paragraph and presenting related results aggregated from different data providers. A Google Docs add-onallows retrieval of relevant content aggregated from multiple data providers while collaboratively writing a document. Theserelevant resources then can be included in the current document either as citation, an image, or a link (with preview) withouthaving to leave disrupt the current writing task for an explicit search in various content providers’ portals.

Strohmaier M., Helic Denis, Benz D., Körner C., Kern Roman

Evaluation of Folksonomy Induction Algorithms

In the ACM Transactions on Intelligent Systems and Technology, 3(4), 2012, 2012


Kern Roman, Seifert Christin, Granitzer Michael

A Hybrid System for German Encyclopedia Alignment

International Journal on Digital Libraries, Springer, 2010

Collaboratively created on-line encyclopediashave become increasingly popular. Especially in terms ofcompleteness they have begun to surpass their printedcounterparts. Two German publishers of traditional encyclopediashave reacted to this challenge and started aninitiative to merge their corpora to create a single, more completeencyclopedia. The crucial step in this merging processis the alignment of articles. We have developed a two-stephybrid system to provide high-accurate alignments with lowmanual effort. First, we apply an information retrieval based,automatic alignment algorithm. Second, the articles with alow confidence score are revised using a manual alignmentscheme carefully designed for quality assurance. Our evaluationshows that a combination of weighting and rankingtechniques utilizing different facets of the encyclopedia articlesallow to effectively reduce the number of necessary manualalignments. Further, the setup of the manual alignment turned out to be robust against inter-indexer inconsistencies.As a result, the developed system empowered us to align fourencyclopedias with high accuracy and low effort.

Neidhart T., Granitzer Michael, Kern Roman, Weichselbraun A., Wohlgenannt G., Scharl A., Juffinger A.

Distributed Web2.0 Crawling for Ontology Evolution

Journal of Digital Information Management, 2009

Kontakt Karriere

Hiermit erkläre ich ausdrücklich meine Einwilligung zum Einsatz und zur Speicherung von Cookies. Weiter Informationen finden sich unter Datenschutzerklärung

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.