Rohrhofer Franz Martin, Posch Stefan, Gößnitzer Clemens, Geiger Bernhard
Physics-informed neural networks (PINNs) have emerged as a promising deep learning method, capable of solving forward and inverse problems governed by differential equations. Despite their recent advance, it is widely acknowledged that PINNs are difficult to train and often require a careful tuning of loss weights when data and physics loss functions are combined by scalarization of a multi-objective (MO) problem. In this paper, we aim to understand how parameters of the physical system, such as characteristic length and time scales, the computational domain, and coefficients of differential equations affect MO optimization and the optimal choice of loss weights. Through a theoretical examination of where these system parameters appear in PINN training, we find that they effectively and individually scale the loss residuals, causing imbalances in MO optimization with certain choices of system parameters. The immediate effects of this are reflected in the apparent Pareto front, which we define as the set of loss values achievable with gradient-based training and visualize accordingly. We empirically verify that loss weights can be used successfully to compensate for the scaling of system parameters, and enable the selection of an optimal solution on the apparent Pareto front that aligns well with the physically valid solution. We further demonstrate that by altering the system parameterization, the apparent Pareto front can shift and exhibit locally convex parts, resulting in a wider range of loss weights for which gradient-based training becomes successful. This work explains the effects of system parameters on MO optimization in PINNs, and highlights the utility of proposed loss weighting schemes.
Rohrhofer Franz Martin, Posch Stefan, Gößnitzer Clemens, Geiger Bernhard
This paper empirically studies commonly observed training difficulties of Physics-Informed Neural Networks (PINNs) on dynamical systems.Our results indicate that fixed points which are inherent to these systems play a key role in the optimization of the in PINNs embedded physics loss function.We observe that the loss landscape exhibits local optima that are shaped by the presence of fixed points.We find that these local optima contribute to the complexity of the physics loss optimization which can explain common training difficulties and resulting nonphysical predictions.Under certain settings, e.g., initial conditions close to fixed points or long simulations times, we show that those optima can even become better than that of the desired solution.
Hoffer Johannes G., Ranftl Sascha, Geiger Bernhard
We consider the problem of finding an input to a stochastic black box function such that the scalar output of the black box function is as close as possible to a target value in the sense of the expected squared error. While the optimization of stochastic black boxes is classic in (robust) Bayesian optimization, the current approaches based on Gaussian processes predominantly focus either on (i) maximization/minimization rather than target value optimization or (ii) on the expectation, but not the variance of the output, ignoring output variations due to stochasticity in uncontrollable environmental variables. In this work, we fill this gap and derive acquisition functions for common criteria such as the expected improvement, the probability of improvement, and the lower confidence bound, assuming that aleatoric effects are Gaussian with known variance. Our experiments illustrate that this setting is compatible with certain extensions of Gaussian processes, and show that the thus derived acquisition functions can outperform classical Bayesian optimization even if the latter assumptions are violated. An industrial use case in billet forging is presented.
Gabler Philipp, Geiger Bernhard, Schuppler Barbara, Kern Roman
Superficially, read and spontaneous speech—the two main kinds of training data for automatic speech recognition—appear as complementary, but are equal: pairs of texts and acoustic signals. Yet, spontaneous speech is typically harder for recognition. This is usually explained by different kinds of variation and noise, but there is a more fundamental deviation at play: for read speech, the audio signal is produced by recitation of the given text, whereas in spontaneous speech, the text is transcribed from a given signal. In this review, we embrace this difference by presenting a first introduction of causal reasoning into automatic speech recognition, and describing causality as a tool to study speaking styles and training data. After breaking down the data generation processes of read and spontaneous speech and analysing the domain from a causal perspective, we highlight how data generation by annotation must affect the interpretation of inference and performance. Our work discusses how various results from the causality literature regarding the impact of the direction of data generation mechanisms on learning and prediction apply to speech data. Finally, we argue how a causal perspective can support the understanding of models in speech processing regarding their behaviour, capabilities, and limitations.
Hoffer Johannes Georg, Geiger Bernhard, Kern Roman
This research presents an approach that combines stacked Gaussian processes (stacked GP) with target vector Bayesian optimization (BO) to solve multi-objective inverse problems of chained manufacturing processes. In this context, GP surrogate models represent individual manufacturing processes and are stacked to build a unified surrogate model that represents the entire manufacturing process chain. Using stacked GPs, epistemic uncertainty can be propagated through all chained manufacturing processes. To perform target vector BO, acquisition functions make use of a noncentral χ-squared distribution of the squared Euclidean distance between a given target vector and surrogate model output. In BO of chained processes, there are the options to use a single unified surrogate model that represents the entire joint chain, or that there is a surrogate model for each individual process and the optimization is cascaded from the last to the first process. Literature suggests that a joint optimization approach using stacked GPs overestimates uncertainty, whereas a cascaded approach underestimates it. For improved target vector BO results of chained processes, we present an approach that combines methods which under- or overestimate uncertainties in an ensemble for rank aggregation. We present a thorough analysis of the proposed methods and evaluate on two artificial use cases and on a typical manufacturing process chain: preforming and final pressing of an Inconel 625 superalloy billet.
Lovric Mario, Antunović Mario, Šunić Iva, Vuković Matej, Kecorius Simon, Kröll Mark, Bešlić Ivan, Godec Ranka, Pehnec Gordana, Geiger Bernhard, Grange Stuart K, Šimić Iva
In this paper, the authors investigated changes in mass concentrations of particulate matter (PM) during the Coronavirus Disease of 2019 (COVID-19) lockdown. Daily samples of PM1, PM2.5 and PM10 fractions were measured at an urban background sampling site in Zagreb, Croatia from 2009 to late 2020. For the purpose of meteorological normalization, the mass concentrations were fed alongside meteorological and temporal data to Random Forest (RF) and LightGBM (LGB) models tuned by Bayesian optimization. The models’ predictions were subsequently de-weathered by meteorological normalization using repeated random resampling of all predictive variables except the trend variable. Three pollution periods in 2020 were examined in detail: January and February, as pre-lockdown, the month of April as the lockdown period, as well as June and July as the “new normal”. An evaluation using normalized mass concentrations of particulate matter and Analysis of variance (ANOVA) was conducted. The results showed that no significant differences were observed for PM1, PM2.5 and PM10 in April 2020—compared to the same period in 2018 and 2019. No significant changes were observed for the “new normal” as well. The results thus indicate that a reduction in mobility during COVID-19 lockdown in Zagreb, Croatia, did not significantly affect particulate matter concentration in the long-term
Hoffer Johannes Georg, Ofner Andreas Benjamin, Rohrhofer Franz Martin, Lovric Mario, Kern Roman, Lindstaedt Stefanie , Geiger Bernhard
Most engineering domains abound with models derived from first principles that have beenproven to be effective for decades. These models are not only a valuable source of knowledge, but they also form the basis of simulations. The recent trend of digitization has complemented these models with data in all forms and variants, such as process monitoring time series, measured material characteristics, and stored production parameters. Theory-inspired machine learning combines the available models and data, reaping the benefits of established knowledge and the capabilities of modern, data-driven approaches. Compared to purely physics- or purely data-driven models, the models resulting from theory-inspired machine learning are often more accurate and less complex, extrapolate better, or allow faster model training or inference. In this short survey, we introduce and discuss several prominent approaches to theory-inspired machine learning and show how they were applied in the fields of welding, joining, additive manufacturing, and metal forming.
Ofner Andreas Benjamin, Kefalas Achilles, Posch Stefan, Geiger Bernhard
This article introduces a method for the detection of knock occurrences in an internal combustion engine (ICE) using a 1-D convolutional neural network trained on in-cylinder pressure data. The model architecture is based on expected frequency characteristics of knocking combustion. All cycles were reduced to 60° CA long windows with no further processing applied to the pressure traces. The neural networks were trained exclusively on in-cylinder pressure traces from multiple conditions, with labels provided by human experts. The best-performing model architecture achieves an accuracy of above 92% on all test sets in a tenfold cross-validation when distinguishing between knocking and non-knocking cycles. In a multiclass problem where each cycle was labeled by the number of experts who rated it as knocking, 78% of cycles were labeled perfectly, while 90% of cycles were classified at most one class from ground truth. They thus considerably outperform the broadly applied maximum amplitude of pressure oscillation (MAPO) detection method, as well as references reconstructed from previous works. Our analysis indicates that the neural network learned physically meaningful features connected to engine-characteristic resonances, thus verifying the intended theory-guided data science approach. Deeper performance investigation further shows remarkable generalization ability to unseen operating points. In addition, the model proved to classify knocking cycles in unseen engines with increased accuracy of 89% after adapting to their features via training on a small number of exclusively non-knocking cycles. The algorithm takes below 1 ms to classify individual cycles, effectively making it suitable for real-time engine control.
Hoffer Johannes Georg, Geiger Bernhard, Kern Roman
The avoidance of scrap and the adherence to tolerances is an important goal in manufacturing. This requires a good engineering understanding of the underlying process. To achieve this, real physical experiments can be conducted. However, they are expensive in time and resources, and can slow down production. A promising way to overcome these drawbacks is process exploration through simulation, where the finite element method (FEM) is a well-established and robust simulation method. While FEM simulation can provide high-resolution results, it requires extensive computing resources to do so. In addition, the simulation design often depends on unknown process properties. To circumvent these drawbacks, we present a Gaussian Process surrogate model approach that accounts for real physical manufacturing process uncertainties and acts as a substitute for expensive FEM simulation, resulting in a fast and robust method that adequately depicts reality. We demonstrate that active learning can be easily applied with our surrogate model to improve computational resources. On top of that, we present a novel optimization method that treats aleatoric and epistemic uncertainties separately, allowing for greater flexibility in solving inverse problems. We evaluate our model using a typical manufacturing use case, the preforming of an Inconel 625 superalloy billet on a forging press.
Amjad Rana Ali, Liu Kairen, Geiger Bernhard
In this work, we investigate the use of three information-theoretic quantities--entropy, mutual information with the class variable, and a class selectivity measure based on Kullback-Leibler (KL) divergence--to understand and study the behavior of already trained fully connected feedforward neural networks (NNs). We analyze the connection between these information-theoretic quantities and classification performance on the test set by cumulatively ablating neurons in networks trained on MNIST, FashionMNIST, and CIFAR-10. Our results parallel those recently published by Morcos et al., indicating that class selectivity is not a good indicator for classification performance. However, looking at individual layers separately, both mutual information and class selectivity are positively correlated with classification performance, at least for networks with ReLU activation functions. We provide explanations for this phenomenon and conclude that it is ill-advised to compare the proposed information-theoretic quantities across layers. Furthermore, we show that cumulative ablation of neurons with ascending or descending information-theoretic quantities can be used to formulate hypotheses regarding the joint behavior of multiple neurons, such as redundancy and synergy, with comparably low computational cost. We also draw connections to the information bottleneck theory for NNs.
Hoffer Johannes Georg, Geiger Bernhard, Ofner Patrick, Kern Roman
The technical world of today fundamentally relies on structural analysis in the form of design and structural mechanic simulations.A traditional and robust simulation method is the physics-based Finite Element Method (FEM) simulation. FEM simulations in structural mechanics are known to be very accurate, however, the higher the desired resolution, the more computational effort is required. Surrogate modeling provides a robust approach to address this drawback. Nonetheless, finding the right surrogate model and its hyperparameters for a specific use case is not a straightforward process.In this paper, we discuss and compare several classes of mesh-free surrogate models based on traditional and thriving Machine Learning (ML) and Deep Learning (DL) methods.We show that relatively simple algorithms (such as $k$-nearest neighbor regression) can be competitive in applications with low geometrical complexity and extrapolation requirements. With respect to tasks exhibiting higher geometric complexity, our results show that recent DL methods at the forefront of literature (such as physics-informed neural networks), are complicated to train and to parameterize and thus require further research before they can be put to practical use. In contrast, we show that already well-researched DL methods such as the multi-layer perceptron are superior with respect to interpolation use cases and can be easily trained with available tools.With our work, we thus present a basis for selection and practical implementation of surrogate models.
Smieja Marek, Wolczyk Maciej, Tabor Jacek, Geiger Bernhard
We propose a semi-supervised generative model, SeGMA, which learns a joint probability distribution of data and their classes and is implemented in a typical Wasserstein autoencoder framework. We choose a mixture of Gaussians as a target distribution in latent space, which provides a natural splitting of data into clusters. To connect Gaussian components with correct classes, we use a small amount of labeled data and a Gaussian classifier induced by the target distribution. SeGMA is optimized efficiently due to the use of the Cramer-Wold distance as a maximum mean discrepancy penalty, which yields a closed-form expression for a mixture of spherical Gaussian components and, thus, obviates the need of sampling. While SeGMA preserves all properties of its semi-supervised predecessors and achieves at least as good generative performance on standard benchmark data sets, it presents additional features: 1) interpolation between any pair of points in the latent space produces realistically looking samples; 2) combining the interpolation property with disentangling of class and style information, SeGMA is able to perform continuous style transfer from one class to another; and 3) it is possible to change the intensity of class characteristics in a data point by moving the latent representation of the data point away from specific Gaussian components.
Geiger Bernhard
We review the current literature concerned with information plane (IP) analyses of neural network (NN) classifiers. While the underlying information bottleneck theory and the claim that information-theoretic compression is causally linked to generalization are plausible, empirical evidence was found to be both supporting and conflicting. We review this evidence together with a detailed analysis of how the respective information quantities were estimated. Our survey suggests that compression visualized in IPs is not necessarily information-theoretic but is rather often compatible with geometric compression of the latent representations. This insight gives the IP a renewed justification. Aside from this, we shed light on the problem of estimating mutual information in deterministic NNs and its consequences. Specifically, we argue that, even in feedforward NNs, the data processing inequality needs not to hold for estimates of mutual information. Similarly, while a fitting phase, in which the mutual information is between the latent representation and the target increases, is necessary (but not sufficient) for good classification performance, depending on the specifics of mutual information estimation, such a fitting phase needs to not be visible in the IP.
Basirat Mina, Geiger Bernhard, Roth Peter
Information plane analysis, describing the mutual information between the input and a hidden layer and between a hidden layer and the target over time, has recently been proposed to analyze the training of neural networks. Since the activations of a hidden layer are typically continuous-valued, this mutual information cannot be computed analytically and must thus be estimated, resulting in apparently inconsistent or even contradicting results in the literature. The goal of this paper is to demonstrate how information plane analysis can still be a valuable tool for analyzing neural network training. To this end, we complement the prevailing binning estimator for mutual information with a geometric interpretation. With this geometric interpretation in mind, we evaluate the impact of regularization and interpret phenomena such as underfitting and overfitting. In addition, we investigate neural network learning in the presence of noisy data and noisy labels.
Schweimer Christoph, Geiger Bernhard, Wang Meizhu, Gogolenko Sergiy, Gogolenko Sergiy, Mahmood Imran, Jahani Alireza, Suleimenova Diana, Groen Derek
Automated construction of location graphs is instrumental but challenging, particularly in logistics optimisation problems and agent-based movement simulations. Hence, we propose an algorithm for automated construction of location graphs, in which vertices correspond to geographic locations of interest and edges to direct travelling routes between them. Our approach involves two steps. In the first step, we use a routing service to compute distances between all pairs of L locations, resulting in a complete graph. In the second step, we prune this graph by removing edges corresponding to indirect routes, identified using the triangle inequality. The computational complexity of this second step is O(L3), which enables the computation of location graphs for all towns and cities on the road network of an entire continent. To illustrate the utility of our algorithm in an application, we constructed location graphs for four regions of different size and road infrastructures and compared them to manually created ground truths. Our algorithm simultaneously achieved precision and recall values around 0.9 for a wide range of the single hyperparameter, suggesting that it is a valid approach to create large location graphs for which a manual creation is infeasible.
Kefalas Achilles, Ofner Andreas Benjamin, Pirker Gerhard, Posch Stefan, Geiger Bernhard, Wimmer Andreas
The phenomenon of knock is an abnormal combustion occurring in spark-ignition (SI) engines and forms a barrier that prevents an increase in thermal efficiency while simultaneously reducing CO2 emissions. Since knocking combustion is highly stochastic, a cyclic analysis of in-cylinder pressure is necessary. In this study we propose an approach for efficient and robust detection and identification of knocking combustion in three different internal combustion engines. The proposed methodology includes a signal processing technique, called continuous wavelet transformation (CWT), which provides a simultaneous analysis of the in-cylinder pressure traces in the time and frequency domains with coefficients. These coefficients serve as input for a convolutional neural network (CNN) which extracts distinctive features and performs an image recognition task in order to distinguish between non-knock and knock. The results revealed the following: (i) The CWT delivered a stable and effective feature space with the coefficients that represents the unique time-frequency pattern of each individual in-cylinder pressure cycle; (ii) the proposed approach was superior to the state-of-the-art threshold value exceeded (TVE) method with a maximum amplitude pressure oscillation (MAPO) criterion improving the overall accuracy by 6.15 percentage points (up to 92.62%); and (iii) The CWT + CNN method does not require calibrating threshold values for different engines or operating conditions as long as enough and diverse data is used to train the neural network.
Geiger Bernhard, Kubin Gernot
guest editorial for a special issue
Geiger Bernhard, Fischer Ian
In this short note, we relate the variational bounds proposed in Alemi et al. (2017) and Fischer (2020) for the information bottleneck (IB) and the conditional entropy bottleneck (CEB) functional, respectively. Although the two functionals were shown to be equivalent, it was empirically observed that optimizing bounds on the CEB functional achieves better generalization performance and adversarial robustness than optimizing those on the IB functional. This work tries to shed light on this issue by showing that, in the most general setting, no ordering can be established between these variational bounds, while such an ordering can be enforced by restricting the feasible sets over which the optimizations take place. The absence of such an ordering in the general setup suggests that the variational bound on the CEB functional is either more amenable to optimization or a relevant cost function for optimization in its own regard, i.e., without justification from the IB or CEB functionals.
Amjad Rana Ali, Geiger Bernhard
In this theory paper, we investigate training deep neural networks (DNNs) for classification via minimizing the information bottleneck (IB) functional. We show that the resulting optimization problem suffers from two severe issues: First, for deterministic DNNs, either the IB functional is infinite for almost all values of network parameters, making the optimization problem ill-posed, or it is piecewise constant, hence not admitting gradient-based optimization methods. Second, the invariance of the IB functional under bijections prevents it from capturing properties of the learned representation that are desirable for classification, such as robustness and simplicity. We argue that these issues are partly resolved for stochastic DNNs, DNNs that include a (hard or soft) decision rule, or by replacing the IB functional with related, but more well-behaved cost functions. We conclude that recent successes reported about training DNNs using the IB framework must be attributed to such solutions. As a side effect, our results indicate limitations of the IB framework for the analysis of DNNs. We also note that rather than trying to repair the inherent problems in the IB functional, a better approach may be to design regularizers on latent representation enforcing the desired properties directly.
Amjad Rana Ali, Bloechl Clemens, Geiger Bernhard
We propose an information-theoretic Markov aggregation framework that is motivated by two objectives: 1) The Markov chain observed through the aggregation mapping should be Markov. 2) The aggregated chain should retain the temporal dependence structure of the original chain. We analyze our parameterized cost function and show that it contains previous cost functions as special cases, which we critically assess. Our simple optimization heuristic for deterministic aggregations characterizes the optimization landscape for different parameter values.
Koncar Philipp, Fuchs Alexandra, Hobisch Elisabeth, Geiger Bernhard, Scholger Martina, Helic Denis
Spectator periodicals contributed to spreading the ideas of the Age of Enlightenment, a turning point in human history and the foundation of our modern societies. In this work, we study the spirit and atmosphere captured in the spectator periodicals about important social issues from the 18th century by analyzing text sentiment of those periodicals. Specifically, based on a manually annotated corpus of over 3 700 issues published in five different languages and over a period of more than one hundred years, we conduct a three-fold sentiment analysis: First, we analyze the development of sentiment over time as well as the influence of topics and narrative forms on sentiment. Second, we construct sentiment networks to assess the polarity of perceptions between different entities, including periodicals, places and people. Third, we construct and analyze sentiment word networks to determine topological differences between words with positive and negative polarity allowing us to make conclusions on how sentiment was expressed in spectator periodicals.Our results depict a mildly positive tone in spectator periodicals underlining the positive attitude towards important topics of the Age of Enlightenment, but also signaling stylistic devices to disguise critique in order to avoid censorship. We also observe strong regional variation in sentiment, indicating cultural and historic differences between countries. For example, while Italy perceived other European countries as positive role models, French periodicals were frequently more critical towards other European countries. Finally, our topological analysis depicts a weak overrepresentation of positive sentiment words corroborating our findings about a general mildly positive tone in spectator periodicals.We believe that our work based on the combination of the sentiment analysis of spectator periodicals and the extensive knowledge available from literary studies sheds interesting new light on these publications. Furthermore, we demonstrate the inclusion of sentiment analysis as another useful method in the digital humanist’s distant reading toolbox.
Santos Tiago, Schrunner Stefan, Geiger Bernhard, Pfeiler Olivia, Zernig Anja, Kaestner Andre, Kern Roman
Semiconductor manufacturing is a highly innovative branch of industry, where a high degree of automation has already been achieved. For example, devices tested to be outside of their specifications in electrical wafer test are automatically scrapped. In this paper, we go one step further and analyze test data of devices still within the limits of the specification, by exploiting the information contained in the analog wafermaps. To that end, we propose two feature extraction approaches with the aim to detect patterns in the wafer test dataset. Such patterns might indicate the onset of critical deviations in the production process. The studied approaches are: 1) classical image processing and restoration techniques in combination with sophisticated feature engineering and 2) a data-driven deep generative model. The two approaches are evaluated on both a synthetic and a real-world dataset. The synthetic dataset has been modeled based on real-world patterns and characteristics. We found both approaches to provide similar overall evaluation metrics. Our in-depth analysis helps to choose one approach over the other depending on data availability as a major aspect, as well as on available computing power and required interpretability of the results.
Geiger Bernhard, Koch Tobias
In 1959, Rényi proposed the information dimension and the d-dimensional entropy to measure the information content of general random variables. This paper proposes a generalization of information dimension to stochastic processes by defining the information dimension rate as the entropy rate of the uniformly quantized stochastic process divided by minus the logarithm of the quantizer step size 1/m in the limit as m → ∞. It is demonstrated that the information dimension rate coincides with the rate-distortion dimension, defined as twice the rate-distortion function R(D) of the stochastic process divided by - log(D) in the limit as D ↓ 0. It is further shown that among all multivariate stationary processes with a given (matrixvalued) spectral distribution function (SDF), the Gaussian process has the largest information dimension rate and the information dimension rate of multivariate stationary Gaussian processes is given by the average rank of the derivative of the SDF. The presented results reveal that the fundamental limits of almost zero-distortion recovery via compressible signal pursuit and almost lossless analog compression are different in general.
Clemens Bloechl, Rana Ali Amjad, Geiger Bernhard
We present an information-theoretic cost function for co-clustering, i.e., for simultaneous clustering of two sets based on similarities between their elements. By constructing a simple random walk on the corresponding bipartite graph, our cost function is derived from a recently proposed generalized framework for information-theoretic Markov chain aggregation. The goal of our cost function is to minimize relevant information loss, hence it connects to the information bottleneck formalism. Moreover, via the connection to Markov aggregation, our cost function is not ad hoc, but inherits its justification from the operational qualities associated with the corresponding Markov aggregation problem. We furthermore show that, for appropriate parameter settings, our cost function is identical to well-known approaches from the literature, such as “Information-Theoretic Co-Clustering” by Dhillon et al. Hence, understanding the influence of this parameter admits a deeper understanding of the relationship between previously proposed information-theoretic cost functions. We highlight some strengths and weaknesses of the cost function for different parameters. We also illustrate the performance of our cost function, optimized with a simple sequential heuristic, on several synthetic and real-world data sets, including the Newsgroup20 and the MovieLens100k data sets.