Razouk Houssam, Liu Xinglan, Kern Roman
2023
The Failure Mode Effect Analysis process (FMEA) is widely used in industry for risk assessment, as it effectively captures and documents domain-specific knowledge. This process is mainly concerned with causal domain knowledge. In practical applications, FMEAs encounter challenges in terms of comprehensibility, particularly related to inadequate coverage of listed failure modes and their corresponding effects and causes. This can be attributed to the limitations of traditional brainstorming approaches typically employed in the FMEA process. Depending on the size and diversity in terms of disciplines of the team conducting the analysis, these approaches may not adequately capture a comprehensive range of failure modes, leading to gaps in coverage. To this end, methods for improving FMEA knowledge comprehensibility are highly needed. A potential approach to address this gap is rooted in recent advances in common-sense knowledge graph completion, which have demonstrated the effectiveness of text-aware graph embedding techniques. However, the applicability of such methods in an industrial setting is limited. This paper addresses this issue on FMEA documents in an industrial environment. Here, the application of common-sense knowledge graph completion methods on FMEA documents from semiconductor manufacturing is studied. These methods achieve over 20% MRR on the test set and 70% of the top 10 predictions were manually assessed to be plausible by domain experts. Based on the evaluation, this paper confirms that text-aware knowledge graph embedding for common-sense knowledge graph completion are more effective than structure-only knowledge graph embedding for improving FMEA knowledge comprehensibility. Additionally we found that language model in domain fine-tuning is beneficial for extracting more meaningful embedding, thus improving the overall model performance.
Malinverno Luca, Barros Vesna, Ghisoni Francesco, Visonà Giovanni, Kern Roman, Nickel Philip , Ventura Barbara Elvira, Simic Ilija, Stryeck Sarah, Manni Francesca , Ferri Cesar , Jean-Quartier Clair, Genga Laura , Schweikert Gabriele, Lovric Mario, Rosen-Zvi Michal
2023
Understanding the inner working of machine-learning models has become a crucial point of discussion in fairness and reliability of artificial intelligence (AI). In this perspective, we reveal insights from recently published scientific works on explainable AI (XAI) within the biomedical sciences. Specifically, we speculate that the COVID-19 pandemic is associated with the rate of publications in the field. Current research efforts seem to be directed more toward explaining black-box machine-learning models than designing novel interpretable architecture. Notably, an inflection period in the publication rate was observed in October 2020, when the quantity of XAI research in biomedical sciences surged upward significantly.While a universally accepted definition of explainability is unlikely, ongoing research efforts are pushing the biomedical field toward improving the robustness and reliability of applied machine learning, which we consider a positive trend.
Siddiqi Shafaq, Qureshi Faiza, Lindstaedt Stefanie , Kern Roman
2023
Outlier detection in non-independent and identically distributed (non-IID) data refers to identifying unusual or unexpected observations in datasets that do not follow an independent and identically distributed (IID) assumption. This presents a challenge in real-world datasets where correlations, dependencies, and complex structures are common. In recent literature, several methods have been proposed to address this issue and each method has its own strengths and limitations, and the selection depends on the data characteristics and application requirements. However, there is a lack of a comprehensive categorization of these methods in the literature. This study addresses this gap by systematically reviewing methods for outlier detection in non-IID data published from 2015 to 2023. This study focuses on three major aspects; data characteristics, methods, and evaluation measures. In data characteristics, we discuss the differentiating properties of non-IID data. Then we review the recent methods proposed for outlier detection in non-IID data, covering their theoretical foundations and algorithmic approaches. Finally, we discuss the evaluation metrics proposed to measure the performance of these methods. Additionally, we present a taxonomy for organizing these methods and highlight the application domain of outlier detection in non-IID categorical data, outlier detection in federated learning, and outlier detection in attribute graphs. We provide a comprehensive overview of datasets used in the selected literature. Moreover, we discuss open challenges in outlier detection for non-IID to shed light on future research directions. By synthesizing the existing literature, this study contributes to advancing the understanding and development of outlier detection techniques in non-IID data settings.
Jantscher Michael, Gunzer Felix, Kern Roman, Hassler Eva, Tschauner Sebastian, Reishofer Gernot
2023
Recent advances in deep learning and natural language processing (NLP) have opened many new opportunities for automatic text understanding and text processing in the medical field. This is of great benefit as many clinical downstream tasks rely on information from unstructured clinical documents. However, for low-resource languages like German, the use of modern text processing applications that require a large amount of training data proves to be difficult, as only few data sets are available mainly due to legal restrictions. In this study, we present an information extraction framework that was initially pre-trained on real-world computed tomographic (CT) reports of head examinations, followed by domain adaptive fine-tuning on reports from different imaging examinations. We show that in the pre-training phase, the semantic and contextual meaning of one clinical reporting domain can be captured and effectively transferred to foreign clinical imaging examinations. Moreover, we introduce an active learning approach with an intrinsic strategic sampling method to generate highly informative training data with low human annotation cost. We see that the model performance can be significantly improved by an appropriate selection of the data to be annotated, without the need to train the model on a specific downstream task. With a general annotation scheme that can be used not only in the radiology field but also in a broader clinical setting, we contribute to a more consistent labeling and annotation process that also facilitates the verification and evaluation of language models in the German clinical setting
Gabler Philipp, Geiger Bernhard, Schuppler Barbara, Kern Roman
2023
Superficially, read and spontaneous speech—the two main kinds of training data for automatic speech recognition—appear as complementary, but are equal: pairs of texts and acoustic signals. Yet, spontaneous speech is typically harder for recognition. This is usually explained by different kinds of variation and noise, but there is a more fundamental deviation at play: for read speech, the audio signal is produced by recitation of the given text, whereas in spontaneous speech, the text is transcribed from a given signal. In this review, we embrace this difference by presenting a first introduction of causal reasoning into automatic speech recognition, and describing causality as a tool to study speaking styles and training data. After breaking down the data generation processes of read and spontaneous speech and analysing the domain from a causal perspective, we highlight how data generation by annotation must affect the interpretation of inference and performance. Our work discusses how various results from the causality literature regarding the impact of the direction of data generation mechanisms on learning and prediction apply to speech data. Finally, we argue how a causal perspective can support the understanding of models in speech processing regarding their behaviour, capabilities, and limitations.
Hoffer Johannes Georg, Geiger Bernhard, Kern Roman
2023
This research presents an approach that combines stacked Gaussian processes (stacked GP) with target vector Bayesian optimization (BO) to solve multi-objective inverse problems of chained manufacturing processes. In this context, GP surrogate models represent individual manufacturing processes and are stacked to build a unified surrogate model that represents the entire manufacturing process chain. Using stacked GPs, epistemic uncertainty can be propagated through all chained manufacturing processes. To perform target vector BO, acquisition functions make use of a noncentral χ-squared distribution of the squared Euclidean distance between a given target vector and surrogate model output. In BO of chained processes, there are the options to use a single unified surrogate model that represents the entire joint chain, or that there is a surrogate model for each individual process and the optimization is cascaded from the last to the first process. Literature suggests that a joint optimization approach using stacked GPs overestimates uncertainty, whereas a cascaded approach underestimates it. For improved target vector BO results of chained processes, we present an approach that combines methods which under- or overestimate uncertainties in an ensemble for rank aggregation. We present a thorough analysis of the proposed methods and evaluate on two artificial use cases and on a typical manufacturing process chain: preforming and final pressing of an Inconel 625 superalloy billet.
Razouk Houssam, Kern Roman
2022
Digitalization of causal domain knowledge is crucial. Especially since the inclusion of causal domain knowledge in the data analysis processes helps to avoid biased results. To extract such knowledge, the Failure Mode Effect Analysis (FMEA) documents represent a valuable data source. Originally, FMEA documents were designed to be exclusively produced and interpreted by human domain experts. As a consequence, these documents often suffer from data consistency issues. This paper argues that due to the transitive perception of the causal relations, discordant and merged information cases are likely to occur. Thus, we propose to improve the consistency of FMEA documents as a step towards more efficient use of causal domain knowledge. In contrast to other work, this paper focuses on the consistency of causal relations expressed in the FMEA documents. To this end, based on an explicit scheme of types of inconsistencies derived from the causal perspective, novel methods to enhance the data quality in FMEA documents are presented. Data quality improvement will significantly improve downstream tasks, such as root cause analysis and automatic process control.
Sousa Samuel, Kern Roman
2022
Deep learning (DL) models for natural language processing (NLP) tasks often handle private data, demanding protection against breaches and disclosures. Data protection laws, such as the European Union’s General Data Protection Regulation (GDPR), thereby enforce the need for privacy. Although many privacy-preserving NLP methods have been proposed in recent years, no categories to organize them have been introduced yet, making it hard to follow the progress of the literature. To close this gap, this article systematically reviews over sixty DL methods for privacy-preserving NLP published between 2016 and 2020, covering theoretical foundations, privacy-enhancing technologies, and analysis of their suitability for real-world scenarios. First, we introduce a novel taxonomy for classifying the existing methods into three categories: data safeguarding methods, trusted methods, and verification methods. Second, we present an extensive summary of privacy threats, datasets for applications, and metrics for privacy evaluation. Third, throughout the review, we describe privacy issues in the NLP pipeline in a holistic view. Further, we discuss open challenges in privacy-preserving NLP regarding data traceability, computation overhead, dataset size, the prevalence of human biases in embeddings, and the privacy-utility tradeoff. Finally, this review presents future research directions to guide successive research and development of privacy-preserving NLP models.
Koutroulis Georgios, Mutlu Belgin, Kern Roman
2022
In prognostics and health management (PHM), the task of constructing comprehensive health indicators (HI)from huge amounts of condition monitoring data plays a crucial role. HIs may influence both the accuracyand reliability of remaining useful life (RUL) prediction, and ultimately the assessment of system’s degradationstatus. Most of the existing methods assume apriori an oversimplified degradation law of the investigatedmachinery, which in practice may not appropriately reflect the reality. Especially for safety–critical engineeredsystems with a high level of complexity that operate under time-varying external conditions, degradationlabels are not available, and hence, supervised approaches are not applicable. To address the above-mentionedchallenges for extrapolating HI values, we propose a novel anticausal-based framework with reduced modelcomplexity, by predicting the cause from the causal models’ effects. Two heuristic methods are presented forinferring the structural causal models. First, the causal driver is identified from complexity estimate of thetime series, and second, the set of the effect measuring parameters is inferred via Granger Causality. Once thecausal models are known, off-line anticausal learning only with few healthy cycles ensures strong generalizationcapabilities that helps obtaining robust online predictions of HIs. We validate and compare our framework onthe NASA’s N-CMAPSS dataset with real-world operating conditions as recorded on board of a commercial jet,which are utilized to further enhance the CMAPSS simulation model. The proposed framework with anticausallearning outperforms existing deep learning architectures by reducing the average root-mean-square error(RMSE) across all investigated units by nearly 65%.
Hoffer Johannes Georg, Ofner Andreas Benjamin, Rohrhofer Franz Martin, Lovric Mario, Kern Roman, Lindstaedt Stefanie , Geiger Bernhard
2022
Most engineering domains abound with models derived from first principles that have beenproven to be effective for decades. These models are not only a valuable source of knowledge, but they also form the basis of simulations. The recent trend of digitization has complemented these models with data in all forms and variants, such as process monitoring time series, measured material characteristics, and stored production parameters. Theory-inspired machine learning combines the available models and data, reaping the benefits of established knowledge and the capabilities of modern, data-driven approaches. Compared to purely physics- or purely data-driven models, the models resulting from theory-inspired machine learning are often more accurate and less complex, extrapolate better, or allow faster model training or inference. In this short survey, we introduce and discuss several prominent approaches to theory-inspired machine learning and show how they were applied in the fields of welding, joining, additive manufacturing, and metal forming.
Hoffer Johannes Georg, Geiger Bernhard, Kern Roman
2022
The avoidance of scrap and the adherence to tolerances is an important goal in manufacturing. This requires a good engineering understanding of the underlying process. To achieve this, real physical experiments can be conducted. However, they are expensive in time and resources, and can slow down production. A promising way to overcome these drawbacks is process exploration through simulation, where the finite element method (FEM) is a well-established and robust simulation method. While FEM simulation can provide high-resolution results, it requires extensive computing resources to do so. In addition, the simulation design often depends on unknown process properties. To circumvent these drawbacks, we present a Gaussian Process surrogate model approach that accounts for real physical manufacturing process uncertainties and acts as a substitute for expensive FEM simulation, resulting in a fast and robust method that adequately depicts reality. We demonstrate that active learning can be easily applied with our surrogate model to improve computational resources. On top of that, we present a novel optimization method that treats aleatoric and epistemic uncertainties separately, allowing for greater flexibility in solving inverse problems. We evaluate our model using a typical manufacturing use case, the preforming of an Inconel 625 superalloy billet on a forging press.
Egger Jan, Pepe Antonio, Gsaxner Christina, Jin Yuan, Li Jianning, Kern Roman
2021
Deep learning belongs to the field of artificial intelligence, where machines perform tasks that typically require some kind of human intelligence. Deep learning tries to achieve this by drawing inspiration from the learning of a human brain. Similar to the basic structure of a brain, which consists of (billions of) neurons and connections between them, a deep learning algorithm consists of an artificial neural network, which resembles the biological brain structure. Mimicking the learning process of humans with their senses, deep learning networks are fed with (sensory) data, like texts, images, videos or sounds. These networks outperform the state-of-the-art methods in different tasks and, because of this, the whole field saw an exponential growth during the last years. This growth resulted in way over 10,000 publications per year in the last years. For example, the search engine PubMed alone, which covers only a sub-set of all publications in the medical field, provides already over 11,000 results in Q3 2020 for the search term ‘deep learning’, and around 90% of these results are from the last three years. Consequently, a complete overview over the field of deep learning is already impossible to obtain and, in the near future, it will potentially become difficult to obtain an overview over a subfield. However, there are several review articles about deep learning, which are focused on specific scientific fields or applications, for example deep learning advances in computer vision or in specific tasks like object detection. With these surveys as a foundation, the aim of this contribution is to provide a first high-level, categorized meta-survey of selected reviews on deep learning across different scientific disciplines and outline the research impact that they already have during a short period of time. The categories (computer vision, language processing, medical informatics and additional works) have been chosen according to the underlying data sources (image, language, medical, mixed). In addition, we review the common architectures, methods, pros, cons, evaluations, challenges and future directions for every sub-category
Lovric Mario, Duricic Tomislav, Tran Thi Ngoc Han, Hussain Hussain, Lacic Emanuel, Morten A. Rasmussen, Kern Roman
2021
Methods for dimensionality reduction are showing significant contributions to knowledge generation in high-dimensional modeling scenarios throughout many disciplines. By achieving a lower dimensional representation (also called embedding), fewer computing resources are needed in downstream machine learning tasks, thus leading to a faster training time, lower complexity, and statistical flexibility. In this work, we investigate the utility of three prominent unsupervised embedding techniques (principal component analysis—PCA, uniform manifold approximation and projection—UMAP, and variational autoencoders—VAEs) for solving classification tasks in the domain of toxicology. To this end, we compare these embedding techniques against a set of molecular fingerprint-based models that do not utilize additional pre-preprocessing of features. Inspired by the success of transfer learning in several fields, we further study the performance of embedders when trained on an external dataset of chemical compounds. To gain a better understanding of their characteristics, we evaluate the embedders with different embedding dimensionalities, and with different sizes of the external dataset. Our findings show that the recently popularized UMAP approach can be utilized alongside known techniques such as PCA and VAE as a pre-compression technique in the toxicology domain. Nevertheless, the generative model of VAE shows an advantage in pre-compressing the data with respect to classification accuracy.
Hoffer Johannes Georg, Geiger Bernhard, Ofner Patrick, Kern Roman
2021
The technical world of today fundamentally relies on structural analysis in the form of design and structural mechanic simulations.A traditional and robust simulation method is the physics-based Finite Element Method (FEM) simulation. FEM simulations in structural mechanics are known to be very accurate, however, the higher the desired resolution, the more computational effort is required. Surrogate modeling provides a robust approach to address this drawback. Nonetheless, finding the right surrogate model and its hyperparameters for a specific use case is not a straightforward process.In this paper, we discuss and compare several classes of mesh-free surrogate models based on traditional and thriving Machine Learning (ML) and Deep Learning (DL) methods.We show that relatively simple algorithms (such as $k$-nearest neighbor regression) can be competitive in applications with low geometrical complexity and extrapolation requirements. With respect to tasks exhibiting higher geometric complexity, our results show that recent DL methods at the forefront of literature (such as physics-informed neural networks), are complicated to train and to parameterize and thus require further research before they can be put to practical use. In contrast, we show that already well-researched DL methods such as the multi-layer perceptron are superior with respect to interpolation use cases and can be easily trained with available tools.With our work, we thus present a basis for selection and practical implementation of surrogate models.
Lovric Mario, Meister Richard, Steck Thomas, Fadljevic Leon, Gerdenitsch Johann, Schuster Stefan, Schiefermüller Lukas, Lindstaedt Stefanie , Kern Roman
2020
In industrial electro galvanizing lines aged anodes deteriorate zinc coating distribution over the strip width, leading to an increase in electricity and zinc cost. We introduce a data-driven approach in predictive maintenance of anodes to replace the cost- and labor-intensive manual inspection, which is still common for this task. The approach is based on parasitic resistance as an indicator of anode condition which might be aged or mis-installed. The parasitic resistance is indirectly observable via the voltage difference between the measured and baseline (theoretical) voltage for healthy anode. Here we calculate the baseline voltage by means of two approaches: (1) a physical model based on electrical and electrochemical laws, and (2) advanced machine learning techniques including boosting and bagging regression. The data was collected on one exemplary rectifier unit equipped with two anodes being studied for a total period of two years. The dataset consists of one target variable (rectifier voltage) and nine predictive variables used in the models, observing electrical current, electrolyte, and steel strip characteristics. For predictive modelling, we used Random Forest, Partial Least Squares and AdaBoost Regression. The model training was conducted on intervals where the anodes were in good condition and validated on other segments which served as a proof of concept that bad anode conditions can be identified using the parasitic resistance predicted by our models. Our results show a RMSE of 0.24 V for baseline rectifier voltage with a mean ± standard deviation of 11.32 ± 2.53 V for the best model on the validation set. The best-performing model is a hybrid version of a Random Forest which incorporates meta-variables computed from the physical model. We found that a large predicted parasitic resistance coincides well with the results of the manual inspection. The results of this work will be implemented in online monitoring of anode conditions to reduce operational cost at a production site
Arslanovic Jasmina, Ajana Löw, Lovric Mario, Kern Roman
2020
Previous studies have suggested that artistic (synchronized) swimming athletes might showeating disorders symptoms. However, systematic research on eating disorders in artistic swimming is limited and the nature and antecedents of the development of eating disorders in this specific population of athletes is still scarce. Hence, the aim of our research was to investigate the eating disorder symptoms in artistic swimming athletes using the EAT-26 instrument, and to examine the relation of the incidence and severity of these symptoms to body mass index and body image dissatisfaction. Furthermore, we wanted to compare artistic swimmers with athletes of a non-leanness (but also an aquatic) sport, therefore we also included a group of female water-polo athletes of the same age. The sample consisted of 36 artistic swimmers and 34 female waterpolo players (both aged 13-16). To test the presence of the eating disorder symptoms the EAT-26 was used. The Mann-Whitney U Test (MWU) was used to test for the differences in EAT-26 scores. The EAT-26 total score and the Dieting subscale (one of the three subscale) showed significant differences between the two groups. The median value for EAT-26 total score was higher in the artistic swimmers’ group (C = 11) than in the waterpolo players’ group (C = 8). A decision tree classifier was used to discriminate the artistic swimmers and female water polo players based on the features from the EAT26 and calculated features. The most discriminative features were the BMI, the dieting subscale and the habit of post-meal vomiting.Our results suggest that artistic swimmers, at their typical competing age, show higher risk of developing eating disorders than female waterpoloplayers and that they are also prone to dieting weight-control behaviors to achieve a desired weight. Furthermore, results indicate that purgative behaviors, such as binge eating or self-induced vomiting, might not be a common weight-control behavior among these athletes. The results corroborate the findings that sport environment in leanness sports might contribute to the development of eating disorders. The results are also in line with evidence that leanness sports athletes are more at risk for developing restrictive than purgative eating behaviors, as the latter usually do not contribute to body weight reduction. As sport environment factors in artistic swimming include judging criteria that emphasize a specific body shape and performance, it is important to raise the awareness of mental health risks that such environment might encourage.
Santos Tiago, Schrunner Stefan, Geiger Bernhard, Pfeiler Olivia, Zernig Anja, Kaestner Andre, Kern Roman
2019
Semiconductor manufacturing is a highly innovative branch of industry, where a high degree of automation has already been achieved. For example, devices tested to be outside of their specifications in electrical wafer test are automatically scrapped. In this paper, we go one step further and analyze test data of devices still within the limits of the specification, by exploiting the information contained in the analog wafermaps. To that end, we propose two feature extraction approaches with the aim to detect patterns in the wafer test dataset. Such patterns might indicate the onset of critical deviations in the production process. The studied approaches are: 1) classical image processing and restoration techniques in combination with sophisticated feature engineering and 2) a data-driven deep generative model. The two approaches are evaluated on both a synthetic and a real-world dataset. The synthetic dataset has been modeled based on real-world patterns and characteristics. We found both approaches to provide similar overall evaluation metrics. Our in-depth analysis helps to choose one approach over the other depending on data availability as a major aspect, as well as on available computing power and required interpretability of the results.
Toller Maximilian, Santos Tiago, Kern Roman
2019
Season length estimation is the task of identifying the number of observations in the dominant repeating pattern of seasonal time series data. As such, it is a common pre-processing task crucial for various downstream applications. Inferring season length from a real-world time series is often challenging due to phenomena such as slightly varying period lengths and noise. These issues may, in turn, lead practitioners to dedicate considerable effort to preprocessing of time series data since existing approaches either require dedicated parameter-tuning or their performance is heavily domain-dependent. Hence, to address these challenges, we propose SAZED: spectral and average autocorrelation zero distance density. SAZED is a versatile ensemble of multiple, specialized time series season length estimation approaches. The combination of various base methods selected with respect to domain-agnostic criteria and a novel seasonality isolation technique, allow a broad applicability to real-world time series of varied properties. Further, SAZED is theoretically grounded and parameter-free, with a computational complexity of O(𝑛log𝑛), which makes it applicable in practice. In our experiments, SAZED was statistically significantly better than every other method on at least one dataset. The datasets we used for the evaluation consist of time series data from various real-world domains, sterile synthetic test cases and synthetic data that were designed to be seasonal and yet have no finite statistical moments of any order.
Lovric Mario, Molero Perez Jose Manuel, Kern Roman
2019
The authors present an implementation of the cheminformatics toolkit RDKit in a distributed computing environment, Apache Hadoop. Together with the Apache Spark analytics engine, wrapped by PySpark, resources from commodity scalable hardware can be employed for cheminformatic calculations and query operations with basic knowledge in Python programming and understanding of the resilient distributed datasets (RDD). Three use cases of cheminfomatical computing in Spark on the Hadoop cluster are presented; querying substructures, calculating fingerprint similarity and calculating molecular descriptors. The source code for the PySpark‐RDKit implementation is provided. The use cases showed that Spark provides a reasonable scalability depending on the use case and can be a suitable choice for datasets too big to be processed with current low‐end workstations
Rexha Andi, Kröll Mark, Ziak Hermann, Kern Roman
2018
The goal of our work is inspired by the task of associating segments of text to their real authors. In this work, we focus on analyzing the way humans judge different writing styles. This analysis can help to better understand this process and to thus simulate/ mimic such behavior accordingly. Unlike the majority of the work done in this field (i.e., authorship attribution, plagiarism detection, etc.) which uses content features, we focus only on the stylometric, i.e. content-agnostic, characteristics of authors.Therefore, we conducted two pilot studies to determine, if humans can identify authorship among documents with high content similarity. The first was a quantitative experiment involving crowd-sourcing, while the second was a qualitative one executed by the authors of this paper.Both studies confirmed that this task is quite challenging.To gain a better understanding of how humans tackle such a problem, we conducted an exploratory data analysis on the results of the studies. In the first experiment, we compared the decisions against content features and stylometric features. While in the second, the evaluators described the process and the features on which their judgment was based. The findings of our detailed analysis could (i) help to improve algorithms such as automatic authorship attribution as well as plagiarism detection, (ii) assist forensic experts or linguists to create profiles of writers, (iii) support intelligence applications to analyze aggressive and threatening messages and (iv) help editor conformity by adhering to, for instance, journal specific writing style.
Bassa Akim, Kröll Mark, Kern Roman
2018
Open Information Extraction (OIE) is the task of extracting relations fromtext without the need of domain speci c training data. Currently, most of the researchon OIE is devoted to the English language, but little or no research has been conductedon other languages including German. We tackled this problem and present GerIE, anOIE parser for the German language. Therefore we started by surveying the availableliterature on OIE with a focus on concepts, which may also apply to the Germanlanguage. Our system is built upon the output of a dependency parser, on which anumber of hand crafted rules are executed. For the evaluation we created two dedicateddatasets, one derived from news articles and one based on texts from an encyclopedia.Our system achieves F-measures of up to 0.89 for sentences that have been correctlypreprocessed.
Seifert Christin, Bailer Werner, Orgel Thomas, Gantner Louis, Kern Roman, Ziak Hermann, Petit Albin, Schlötterer Jörg, Zwicklbauer Stefan, Granitzer Michael
2017
The digitization initiatives in the past decades have led to a tremendous increase in digitized objects in the cultural heritagedomain. Although digitally available, these objects are often not easily accessible for interested users because of the distributedallocation of the content in different repositories and the variety in data structure and standards. When users search for culturalcontent, they first need to identify the specific repository and then need to know how to search within this platform (e.g., usageof specific vocabulary). The goal of the EEXCESS project is to design and implement an infrastructure that enables ubiquitousaccess to digital cultural heritage content. Cultural content should be made available in the channels that users habituallyvisit and be tailored to their current context without the need to manually search multiple portals or content repositories. Torealize this goal, open-source software components and services have been developed that can either be used as an integratedinfrastructure or as modular components suitable to be integrated in other products and services. The EEXCESS modules andcomponents comprise (i) Web-based context detection, (ii) information retrieval-based, federated content aggregation, (iii) meta-data definition and mapping, and (iv) a component responsible for privacy preservation. Various applications have been realizedbased on these components that bring cultural content to the user in content consumption and content creation scenarios. Forexample, content consumption is realized by a browser extension generating automatic search queries from the current pagecontext and the focus paragraph and presenting related results aggregated from different data providers. A Google Docs add-onallows retrieval of relevant content aggregated from multiple data providers while collaboratively writing a document. Theserelevant resources then can be included in the current document either as citation, an image, or a link (with preview) withouthaving to leave disrupt the current writing task for an explicit search in various content providers’ portals.
Neidhart T., Granitzer Michael, Kern Roman, Weichselbraun A., Wohlgenannt G., Scharl A., Juffinger A.
2009