Publikationen

Hier finden Sie von Know-Center MitarbeiterInnen verfasste wissenschaftliche Publikationen

2019

Lovric Mario, Molero Perez Jose Manuel, Kern Roman

PySpark and RDKit: moving towards Big Data in QSAR

Molecular Informatics, Wiley, 2019

Journal
The authors present an implementation of the cheminformatics toolkit RDKit in a distributed computing environment, Apache Hadoop. Together with the Apache Spark analytics engine, wrapped by PySpark, resources from commodity scalable hardware can be employed for cheminformatic calculations and query operations with basic knowledge in Python programming and understanding of the resilient distributed datasets (RDD). Three use cases of cheminfomatical computing in Spark on the Hadoop cluster are presented; querying substructures, calculating fingerprint similarity and calculating molecular descriptors. The source code for the PySpark‐RDKit implementation is provided. The use cases showed that Spark provides a reasonable scalability depending on the use case and can be a suitable choice for datasets too big to be processed with current low‐end workstations
2019

Geiger Bernhard, Schrunner Stefan, Kern Roman

An Information-Theoretic Measure for Pattern Similarity in Analog Wafermap

European Advanced Process Control and Manufacturing Conf. (apc|m, Villach, 2019

Konferenz
Schrunner and Geiger have contributed equally to this work.
2019

Winter Kevin, Kern Roman

Know-Center at SemEval-2019 Task 5: Multilingual Hate SpeechDetection on Twitter using CNNs

Proceedings of the Thirteenth International Workshop on Semantic Evaluation, 2019

Konferenz
This paper presents the Know-Center system submitted for task 5 of the SemEval-2019workshop. Given a Twitter message in either English or Spanish, the task is to first detect whether it contains hateful speech and second,to determine the target and level of aggression used. For this purpose our system utilizes word embeddings and a neural network architecture, consisting of both dilated and traditional convolution layers. We achieved aver-age F1-scores of 0.57 and 0.74 for English and Spanish respectively.
2019

Tiago Santos, Stefan Schrunner, Geiger Bernhard, Olivia Pfeiler, Anja Zernig, Andre Kaestner, Kern Roman

Feature Extraction From Analog Wafermaps: A Comparison of Classical Image Processing and a Deep Generative Mode

IEEE Transactions on Semiconductor Manufacturing, IEEE, 2019

Journal
Semiconductor manufacturing is a highly innovative branch of industry, where a high degree of automation has already been achieved. For example, devices tested to be outside of their specifications in electrical wafer test are automatically scrapped. In this paper, we go one step further and analyze test data of devices still within the limits of the specification, by exploiting the information contained in the analog wafermaps. To that end, we propose two feature extraction approaches with the aim to detect patterns in the wafer test dataset. Such patterns might indicate the onset of critical deviations in the production process. The studied approaches are: 1) classical image processing and restoration techniques in combination with sophisticated feature engineering and 2) a data-driven deep generative model. The two approaches are evaluated on both a synthetic and a real-world dataset. The synthetic dataset has been modeled based on real-world patterns and characteristics. We found both approaches to provide similar overall evaluation metrics. Our in-depth analysis helps to choose one approach over the other depending on data availability as a major aspect, as well as on available computing power and required interpretability of the results
2019

Lovric Mario, Žuvela Petar, Kern Roman, Lucic, Bono, J. Jay Liu, Tomasz Bączek

Machine learning methods for cross-column prediction of retention time in reversed-phased liquid chromatography

8th World Conference on Physico Chemical Methods in Drug Discovery and Developmen, IAPC, Split, Croatia, 2019

Konferenz
Quantitative structure-retention relationships (QSRR) were employed to build global models for prediction of chromatographic retention time of synthetic peptides across six RP-LC-MS/MS columns and varied experimental conditions. The global QSRR models were based on only three a priori selected molecular descriptors: sum of gradient retention times of 20 natural amino acids (logSumAA), van der Waals volume (logvdWvol.), and hydrophobicity (clogP) related to the retention mechanism of RP-LC separation of peptides. Three machine learning regression methods were compared: random forests (RF), partial least squares (PLS), and adaptive boosting (ADA). All the models were comprehensively optimized through 3-fold cross-validation (CV) and validated through an external validation set. The chemical domain of applicability was also defined. Percentage root mean square error of prediction (%RMSEP) was used as an external validation metric. Results have shown that RF exhibited a %RMSEP of 14.99 %; PLS exhibited a %RMSEP of 40.561 %; whereas ADA exhibited a %RMSEP of 26.35 %. The ensemble models considerably outperform the conventional PLS-based QSRR model. Novel methods of tree-based model explainability were employed to reveal mechanisms behind black-box global ensemble QSRR models. The models revelead the highest feature importance for sum of gradient retention times (logSumAA), followed by van der Waals volume (logvdWvol.), and hydrophobicity (clogP). The promising results of this study show the potential of machine learning for improved peptide identification, retention time standardization and integration into state-of-the-art LC-MS/MS proteomics workflows.
2019

Toller Maximilian, Geiger Bernhard, Kern Roman

A Formally Robust Time Series Distance Metric

Mile'TS @ SIGKDD, Anchorage, Alaska USA, 2019

Konferenz
Distance-based classification is among the most competitive classification methods for time series data. The most critical componentof distance-based classification is the selected distance function.Past research has proposed various different distance metrics ormeasures dedicated to particular aspects of real-world time seriesdata, yet there is an important aspect that has not been considered so far: Robustness against arbitrary data contamination. In thiswork, we propose a novel distance metric that is robust against arbitrarily “bad” contamination and has a worst-case computationalcomplexity of O(n logn). We formally argue why our proposedmetric is robust, and demonstrate in an empirical evaluation thatthe metric yields competitive classification accuracy when appliedin k-Nearest Neighbor time series classification.
2019

Toller Maximilian, Santos Tiago, Kern Roman

SAZED: parameter-free domain-agnostic season length estimation in time series data

Data Mining and Knowledge Discovery, Springer US, 2019

Journal
Season length estimation is the task of identifying the number of observations in the dominant repeating pattern of seasonal time series data. As such, it is a common pre-processing task crucial for various downstream applications. Inferring season length from a real-world time series is often challenging due to phenomena such as slightly varying period lengths and noise. These issues may, in turn, lead practitioners to dedicate considerable effort to preprocessing of time series data since existing approaches either require dedicated parameter-tuning or their performance is heavily domain-dependent. Hence, to address these challenges, we propose SAZED: spectral and average autocorrelation zero distance density. SAZED is a versatile ensemble of multiple, specialized time series season length estimation approaches. The combination of various base methods selected with respect to domain-agnostic criteria and a novel seasonality isolation technique, allow a broad applicability to real-world time series of varied properties. Further, SAZED is theoretically grounded and parameter-free, with a computational complexity of O( log ), which makes it applicable in practice. In our experiments, SAZED was statistically significantly better than every other method on at least one dataset. The datasets we used for the evaluation consist of time series data from various real-world domains, sterile synthetic test cases and synthetic data that were designed to be seasonal and yet have no finite statistical moments of any order.
2019

Kowald Dominik, Traub Matthias, Theiler Dieter, Gursch Heimo, Lacic Emanuel, Lindstaedt Stefanie , Kern Roman, Lex Elisabeth

Using the Open Meta Kaggle Dataset to Evaluate Tripartite Recommendations in Data Markets

REVEAL Workshop co-located with RecSys'2019, ACM, 2019

Konferenz
2019

Remonda Adrian, Krebs Sarah, Luzhnica Granit, Kern Roman, Veas Eduardo Enrique

Formula RL: Deep Reinforcement Learning for Autonomous Racing usingTelemetry Data

Workshop on Scaling-Up Reinforcement Learning (SURL) @ Int. Joint Conf. on Artificial Intelligence, 2019

Konferenz
This paper explores the use of reinforcement learning (RL) models for autonomous racing. In contrast to passenger cars, where safety is the top priority, a racing car aims to minimize the lap-time. We frame the problem as a reinforcement learning task witha multidimensional input consisting of the vehicle telemetry, and a continuous action space. To findout which RL methods better solve the problem and whether the obtained models generalize to drivingon unknown tracks, we put 10 variants of deep deterministic policy gradient (DDPG) to race in two experiments: i) studying how RL methods learn to drive a racing car and ii) studying how the learning scenario influences the capability of the models to generalize. Our studies show that models trained with RL are not only able to drive faster than the baseline open source handcrafted bots but also generalize to unknown tracks.
2019

Gursch Heimo, Cemernek David, Wuttei Andreas, Kern Roman

Cyber-Physical Systems as Enablers in Manufacturing Communication and Worker Support

Mensch und Computer 2019, Frank Steinicke und Katrin Wolf, Gesellschaft für Informatik e.V., Bonn, Germany, 2019

Konferenz
The increasing potential of Information and Communications Technology (ICT) drives higher degrees of digitisation in the manufacturing industry. Such catchphrases as “Industry 4.0” and “smart manufacturing” reflect this tendency. The implementation of these paradigms is not merely an end to itself, but a new way of collaboration across existing department and process boundaries. Converting the process input, internal and output data into digital twins offers the possibility to test and validate the parameter changes via simulations, whose results can be used to update guidelines for shop-floor workers. The result is a Cyber-Physical System (CPS) that brings together the physical shop-floor, the digital data created in the manufacturing process, the simulations, and the human workers. The CPS offers new ways of collaboration on a shared data basis: the workers can annotate manufacturing problems directly in the data, obtain updated process guidelines, and use knowledge from other experts to address issues. Although the CPS cannot replace manufacturing management since it is formalised through various approaches, e. g., Six-Sigma or Advanced Process Control (APC), it is a new tool for validating decisions in simulation before they are implemented, allowing to continuously improve the guidelines.
2019

Santos Tiago, Walk Simon, Kern Roman, Strohmaier Markus, Helic Denis

Activity Archetypes in Question-and-Answer (Q8A) Websites—A Study of 50 Stack Exchange Instances

ACM Transactions on Social Computing, 2019

Journal
Millions of users on the Internet discuss a variety of topics on Question-and-Answer (Q&A) instances. However, not all instances and topics receive the same amount of attention, as some thrive and achieve selfsustaining levels of activity, while others fail to attract users and either never grow beyond being a smallniche community or become inactive. Hence, it is imperative to not only better understand but also to distilldeciding factors and rules that define and govern sustainable Q&A instances. We aim to empower communitymanagers with quantitative methods for them to better understand, control, and foster their communities,and thus contribute to making the Web a more efficient place to exchange information. To that end, we extract, model, and cluster a user activity-based time series from 50 randomly selected Q&A instances from theStack Exchange network to characterize user behavior. We find four distinct types of user activity temporalpatterns, which vary primarily according to the users’ activity frequency. Finally, by breaking down totalactivity in our 50 Q&A instances by the previously identified user activity profiles, we classify those 50 Q&Ainstances into three different activity profiles. Our parsimonious categorization of Q&A instances aligns withthe stage of development and maturity of the underlying communities, and can potentially help operatorsof such instances: We not only quantitatively assess progress of Q&A instances, but we also derive practicalimplications for optimizing Q&A community building efforts, as we, e.g., recommend which user types tofocus on at different developmental stages of a Q&A community
2019

Al-Ubaidi Tarek, Khodachenko Maxim, Kern Roman, Granitzer Michael, Poedts Stefaan

Advanced Techniques for Signal Search and Automatic Classification of Observational Space Data

European Planetary Science Congress, 2019

Konferenz
The presentation will outline various approaches inmachine learning and content based searchinvestigated by members of the former IMPEx-FP7(http://impex-fp7.oeaw.ac.at/) project consortium, inclose cooperation with partners Know-Center, GrazUniversity of Technology, and University of Passauand discuss some of the numerous possibilities thatopen up, using these or equivalent techniques in theemerging field of e-Science in conjunction withspace science. In particular, the presentation willfocus on applications that allow systems toautomatically classify and pre-select scientific dataand hence speed up scientific workflows significantlyby supporting scientists with the cumbersome task ofgoing through vast amounts of data manually, lookingfor specific patterns, signals and phenomena ofinterest prior to selecting specific data for closerexamination and analysis.
2019

Santos Tiago, Schrunner Stefan, Geiger Bernhard, Pfeiler Olivia, Zernig Anja, Kaestner Andre, Kern Roman

Feature Extraction From Analog Wafermaps: A Comparison of Classical Image Processing and a Deep Generative Mode

IEEE Transactions on Semiconductor Manufacturing, IEEE, 2019

Journal
Semiconductor manufacturing is a highly innovative branch of industry, where a high degree of automation has already been achieved. For example, devices tested to be outside of their specifications in electrical wafer test are automatically scrapped. In this paper, we go one step further and analyze test data of devices still within the limits of the specification, by exploiting the information contained in the analog wafermaps. To that end, we propose two feature extraction approaches with the aim to detect patterns in the wafer test dataset. Such patterns might indicate the onset of critical deviations in the production process. The studied approaches are: 1) classical image processing and restoration techniques in combination with sophisticated feature engineering and 2) a data-driven deep generative model. The two approaches are evaluated on both a synthetic and a real-world dataset. The synthetic dataset has been modeled based on real-world patterns and characteristics. We found both approaches to provide similar overall evaluation metrics. Our in-depth analysis helps to choose one approach over the other depending on data availability as a major aspect, as well as on available computing power and required interpretability of the results
2019

Schrunner Stefan, Jenul Anna, Scheider Michael, Zernig Anja, Kaestner Andre, Kern Roman

A Health Factor for Process Patterns - Enhancing Semiconductor Manufacturing by Pattern Recognition in Analog Wafermaps

2019 IEEE International Conference on Systems, Man and Cybernetics (SMC), Bari, Italy, 2019

Konferenz
Electrical measurement data at the end of semi- conductor frontend production, so-called wafer test data, pro- vide deep insight into the preceding manufacturing process. Patterns in these datasets, such as spatial regularities on the wafer, frequently indicate that deviations occurred during production, potentially leading to failures in the produced devices. As such patterns of interest differ w.r.t. their shapes and equally important their intensities, pattern recognition is challenging, but crucial as a prerequisite for production environments in Industry 4.0. In this work, we propose an indicator for the presence and development of process patterns, a so-called ”Health Factor for Process Patterns”, embedded in a framework of statistical decision theory. We provide adequate machine learning components, focusing on the recognition and assessment of known patterns in analog wafer test data. Finally, we conduct experiments using simulated as well as real-world datasets to demonstrate that our method yields competitive results and can be extended to a decision support system for industrial usage.
Kontakt Karriere

Hiermit erkläre ich ausdrücklich meine Einwilligung zum Einsatz und zur Speicherung von Cookies. Weiter Informationen finden sich unter Datenschutzerklärung

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close