Tackling Gravitational Waves with Neural Networks

Nan Jiang

Due Thu May 13

SYS 6018 Course Project| Spring 2021 | University of Virginia

What’s wrong with the old method?

As is well known, recent years have seen great triumphs of gravitational wave detection of compact binaries, like binary black hole (e.g.GW150914) and binary neutron star (e.g.GW170817). Statistical inference is the eyepiece of gravitational-wave observatories: it maps the noise-dominated detector output into probabilistic assessments of candidate significance, and into posterior probability densities for the physical parameters of confirmed detections. The mathematical setup of gravitational-wave (GW) data analysis is simple, with most of its salient features apparent in the classical time-series likelihood: \[ \mathcal{L}(\theta) := p(x|\theta) \propto \exp\left\{-\frac{1}{2}\langle x - h(\theta)|x-h(\theta)\rangle\right\}. \] Here, \(x\) is the detector output, \(h\) is the modeled detector response to an incoming GW with source parameters \(\theta\), and \(\langle|\rangle\) is the complex noise-weighted inner product \[ \langle a(t)|b(t) \rangle := 4 \int_0^\infty df \frac{\tilde{a}^*(f)\tilde{b}(f)}{S_n(f)}, \] with tildes denoting Fourier transforms and \(S_n\) the one-sided power spectral density of detector noise \(n\). The first equation essentially describes \(n\) as Gaussian and additive, with the sampling distribution \(p(n)\propto\exp{\langle n|n\rangle^2/2}\).

The challenge in using the likelihood equation for detection and parameter estimation is entirely practical. In the Bayesian analysis of signals immersed in noise, we seek a representation for the posterior probability of one or more parameters that govern the shape of the signals. Unless the parameter-to-signal map is very simple, the analysis comes at significant computational cost, as it requires the stochastic exploration of the likelihood surface over parameter space. Such is the case, for example, of parameter estimation for the gravitational-wave sources such as the compact binaries detected by LIGO-Virgo; here each likelihood evaluation requires that we generate the gravitational waveform corresponding to a set of source parameters, and compute its noise-weighted correlation with detector data. Waveform generation is usually the costlier operation. Also extending the analysis from the data we have obtained to the data we might expect (scoping out parameter estimation for future experiments) further increases the expense, since we need to explore posteriors for many noise realizations, and across the domain of possible source parameters. To see this more clearly, we price the evaluation of a single Bayesian posterior at \(\gtrsim 10^6\) times the cost of generating a waveform, and the characterization of parameter estimation for a single source type at \(\gtrsim 10^6\) times the cost of a posterior. With current computational resources, this means that accurate component-mass estimation is only available after hours or days after the detection of a binary black-hole merger, while any extensive study of parameter estimation prospects must rely on less accurate and reliable techniques such as Fisher-matrix approximations.

To speed up the process of Bayesian inference for gravitational-wave astronomy, using deep-learning techniques could instantly produce the posterior for a certain set of source parameters given the detector data. Below we would talk about what neural network is and how it could be applied to gravitational-wave posterior inference.

Motivation of Neural Networks

a. Where random forest fails

For non-image/text data, gradient boosting and random forest models could do a pretty good job on classification problems. Gradient descent finds the model parameters that minimize the loss function. From any starting point, we move towards the optimum model guided by gradient descent, illustrated in the below equation \[ f_{i+1} = f_i - \lambda_i \frac{d \mathcal{L}}{d f_i} \] where parameter \(\lambda_i\) is the step size of the current iteration. So gradient boost pushes decision trees to a pretty high accuracy.

But there is a problem with the model \(f_i\): decision tree algorithm is greedy. A decision tree is reasonable when each base features carries some reasonable amount of information. But this is not the case if you consider image recognition. We could treat the pixel bases as base feature, then random forest will first choose a set of pixels randomly and then start making a decision tree by cutting on the value fo these pixels. If we would like to distinguish between cats and dogs, decision trees will search for the pixel that best splits cats and dogs, while in reality we know there is no such pixel. Whether an image is cat or dog has nothing to do with any particular pixel. It is the correlation with local pixels that matter, not the values of pixels.

If you shift images, it doesn’t change whether it’s a frog or not

b. Where neural network comes in

Neural networks are not greedy in the same sense. They do not make choices on a certain parameter independent of others. When neural network makes a gradient step, it updates parameters that effect all parts of the image at all scales. A decision tree just looks at one pixel. By looking at one pixel at a time, a decision tree could not really discover a good separation boundary. The real separation boundary in this high dimensional space is rarely perpendicular with the original basis. The neural networks differentiate from random forests in

Neural network can be given hints early on what kind of transformation it should be invariant to, which allows it to avoid choosing bad initial features.
Neural network does not assume base features to be informative individually or that it can progress by greedily optimizing along each base feature direction at a time.
Neural network naturally builds features hierarchically.

Neural network is an information processing model inspired by biological neuron system. It follows a non-linear path and process information in parallel throughout the system. It was designed to solve problems which are easy for humans and difficult for machines such as identifying pictures. As we will see in the later part of this tutorial, it can also be used to spot gravitational wave event in a stochastic background and infer physical parameters from this noisy data set.

What is a neural network?

Introduction

A neural network is a two-stage regression or classification model, typically represented by a network diagram shown below. This network applies to both regression and classification. It consists of various simple but highly connected elements or nodes, called "neurons", which are organized in layers which process information using dynamic state response to external inputs. Patterns are introduced to the neural network by the input layer that has one neuron for each component present in the input data and is communicated to one or more hidden layers present in the network. It is the hidden layers where all the processing happens through a system of connections characterized by weights and biases. Once the input is received, the neuron will calculate a weighted sum adding also the bias. Then according to the result and a pre-set activation function, it decides whether it should be ‘fired’ or activated. Afterwards, the neuron transmits the information downstream to other connected neurons in a process called forward pass. In the end, the last hidden layer is linked to the output layer which has one neuron for each possible expected output.

A schematic neural network with two hidden layers and a single neuron output

Components of Neural Networks

Below is an example of a simple, one-layer network for \(K\)-class classification problem.

There are \(K\) units on the top with \(k\)th unit modeling the probability of class \(k\).
The target measurements \(Y_k\) are being coded as 0- 1 variable for \(k\)th class.
Features \(Z_m\) are derived from linear combinations of the input.
Target \(Y_k\) is modeled as a function of linear combinations of \(Z_m\).

Neural network with one hidden layer and feed-forward

Activation Function

As we notice in hidden units \(Z_m\), there is a function \(\sigma(\nu)\). This function is called activation function. It allows the network to learn complex mapping functions. Since if \(\sigma\) is the identity function, the entire model would collapse to a linear model in the inputs. A widely used choice is \(sigmoid\) function \[ \sigma(\nu) = \frac{1}{1 + e^{-\nu}} \]

However, a general problem with the sigmoid function is that it saturates, which means that it is only very sensitive to changes around its mid-point such as 0.5. The limited sensitivity and saturation of the function happen regardless of whether the summed activation fromt he nodes provided as input contains information or not. Once saturated, it becomes challenging for the learning algorithm to continue to adapt to the weights to improve the performance of the model.

In what we will discuss in later sections, which are applications of neural nets to gravitational wave data analysis, the activation function to be employed is called rectified linear activation function or ReLU for short. It will output the input directly if it is positive, otherwise, it will output zero. It acts like a linear function, but is, in fact, a nonlinear function allowing complex relationships in the data to be learned.

The output function \(g_k(T)\)

The output function allows a final transformation of the vector of outputs \(T\). For regression problems, people normally choose the identity function \(g_k(T) = T_k\). For classification problem, a softmax function is a more popular choice. \[ g_k(T) = \frac{e^{T_k}}{\sum_{i=1}^K e^{T_i}} \]

How does a neural network learn?

Weight and Bias

The unknown parameters in the neural network model are often called weights. And these are the values to make the model fit the training data well. In this tutorial, the complete set of weights will be denoted by \(\theta\). Still using the previous example, the weights are \[ \{\alpha_{0m}, \alpha_m; m = 1,2,...,M\},~~ M(p+1) ~\mathrm{weights~in~total},\\ \{\beta_{0k}, \beta_k; k = 1,2,...,K\},~~ K(M+1) ~\mathrm{weights~in~total}. \] Note that \(\alpha_{0m}\) and \(\beta_{0k}\) are intercepts, which are often captured by bias units.

Loss Function

Next, we need a loss function to penalize errors in prediction. For regression problems, people usually use sum-of-squared errors as measure of fit \[ R(\theta) = \sum_{i = 1}^N R_i= \sum_{k=1}^K \sum_{i = 1}^N (y_{ik} - f_k(x_i))^2. \] For classification, we either use squared error or cross-entropy (deviance) : \[ R(\theta) = -\sum_{k=1}^K \sum_{i = 1}^N y_{ik}\log f_k(x_i). \] with the corresponding classifier \(G(x) = \mathrm{argmax}_k f_k(x)\).

Back propagation

The generic approach to minimizing \(R(\theta)\) is by gradient descent, called back-propagation in this setting. The gradient could be easily derived through chain rule. This can be computed by a forward and backward sweep over the network, keeping track only of quantities local to each unit. We shall see this in detail for squared error loss.

A gradient descent update at the \((r + 1)\)st iteration has the form \[ \beta_{km}^{(r+1)} = \beta_{km}^{(r)} - \gamma_r \sum_{i = 1}^N \frac{\partial R_i}{\partial \beta_{km}^{(r)}},~k=1,..,K,m=1,..,M+1,\\ \alpha_{ml}^{(r+1)} = \alpha_{ml}^{(r)} - \gamma_r \sum_{i = 1}^N \frac{\partial R_i}{\partial \alpha_{ml}^{(r)}},~m = 1,..,M,l=1,..p+1, \] where \(\gamma_r\) is the learning rate, the derivatives are obtained from chain rule, \[ \frac{\partial R_i}{\partial \beta_{km}} = -2(y_ik - f_k(x_i))g_k' (\beta_k^T z_i) z_{mi} = \delta_{ki}z_{mi},\\ \frac{\partial R_i}{\partial \alpha_{ml}} = -\sum_{k = 1}^K2(y_ik - f_k(x_i)) g_k' (\beta_k^T z_i) \beta_{km} \sigma'(\alpha_m^T x_i) x_{il} = s_{mi} x_{il}, \] where \(z_{mi} = \sigma(\alpha_{0m} + \alpha_m^T x_i)\). The quantities \(\delta_{ki}\) and \(s_{mi}\) are “errors” from the current model at the ouput and hidden layer units respectively. From this definition, these errors satisfy \[ s_{mi} = \sigma'(\alpha_m^T x_i)\sum_{k=1}^K \beta_{km}\delta_{ki}, \] known as back-propagation equations.

Back-propagation is a two-pass procedure. In the forward pass, the current weights are fixed and the predicted values \(\hat{f}_k(x_i)\) are computed. In the backward pass, the errors \(\delta_{ki}\) are computed, and then back-propagated to give the errors \(s_{mi}\). Then both sets of errors are used to compute gradients for updates. The advantages of back-propagation its simple, local nature. Since each hidden unit passes and receives information only to and from units that share a connection, the algorithm could be implemented efficiently on a parallel architecture computer.

Overfitting

Typically, we do not want the global minimizer of \(R(\theta)\), since it is likely to lead to an overfitted solution. This will be achieved by a penalty term or early stopping.

In early development of neural networks, either by design or by accident, an early stopping rule is used to avoid overfitting. The model is trained for a while, and stopped well before the global minimum is reached. A validation data set is useful for determining when to stop, since we expect the validation error to start increasing.

A more explicit method for regularization is weight decay, which is similar to ridge regression. After adding the penalty, the error function became \[ R(\theta) + \lambda J(\theta) = R(\theta) + \lambda \left(\sum_{k,m} \beta^2_{km} + \sum_{m,l} \alpha^2_{ml}\right). \] Typically, cross-validation is used to estimate \(\lambda\). There are other kinds of penalty, like weight elimination penalty. We do not elaborate on this for simplicity here.

Example of Implementation in R

In this section, we demonstrate a simple example of neural network application in R.

Package Required

library(neuralnet)
library(tidyverse)
library(GGally)
library(ggplot2)

The training dataset

The training data set will be Haberman’s Survival data set from UCI’s Machine Learning Repository. It contains cases from a 1958 and 1970 study conducted on the survival of 306 patients who had undergone surgery for breast cancer. We will use this data set to predict a patient’s 5-year survival as a function of their age at date of operation, year of operation and the number of positive Axillary Lymph nodes detected.

We are concerned with the Axillary Lymph Nodes

Survival is imported as integer. A value of 1 means that the patient survived for at least 5 years after the operation, while 0 means that the patient does not survive for five years.

url <- 'http://archive.ics.uci.edu/ml/machine-learning-databases//haberman/haberman.data' 
#download data
Haberman_Data <- read_csv(file = url,
                     col_names = c('Age', 'Operation_Year', 
                                   'Num_Pos_Nodes','Survival')) %>%
  na.omit() %>% #drop NA 
  mutate(Survival = ifelse(Survival == 2, 0, 1),
         Survival = factor(Survival))

An overview of what the data looks like is as below.

The training process

First, all features should be scaled to \([0,1]\) interval. Survival is transformed to the Boolean feature.

scale_func <- function(x){
  (x - min(x)) / (max(x) - min(x))
}

Haberman_Data <- Haberman_Data %>%
  mutate(Age = scale_func(Age), 
         Operation_Year = scale_func(Operation_Year), 
         Num_Pos_Nodes = scale_func(Num_Pos_Nodes), 
         Survival = as.numeric(Survival)-1)
Haberman_Data <- Haberman_Data %>%
  mutate(Survival = as.integer(Survival) - 1, 
         Survival = ifelse(Survival == 1, TRUE, FALSE))

One hidden layer with one neuron

set.seed(2021)
Haberman_NN1 <- neuralnet(Survival ~ Age + Operation_Year + Num_Pos_Nodes, 
                     data = Haberman_Data, 
                     linear.output = FALSE, #we are performing classification
                     err.fct = 'ce',   #cross entropy used as error
                     likelihood = TRUE)  #we would like to see AIC and BIC metrics

The black lines show connections with weights, the blue lines correspond to the bias terms.

Use Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) to evaluate this neural network model.

NN1_Train_Error <- Haberman_NN1$result.matrix[1,1]
paste("CE Error: ", round(NN1_Train_Error, 3))

## [1] "CE Error:  0.009"

NN1_AIC <- Haberman_NN1$result.matrix[4,1]
paste("AIC: ", round(NN1_AIC,3))

## [1] "AIC:  12.018"

NN2_BIC <- Haberman_NN1$result.matrix[5,1]
paste("BIC: ", round(NN2_BIC, 3))

## [1] "BIC:  34.36"

Two hidden layer with one or two neurons

set.seed(2022)

Haberman_NN2 <- neuralnet(Survival ~ Age + Operation_Year + Num_Pos_Nodes, 
                     data = Haberman_Data, 
                     linear.output = FALSE, 
                     err.fct = 'ce', 
                     likelihood = 
                       TRUE, hidden = c(2,1)) #Layer1 : 2 neurons, Layer2 : 1 neuron

Haberman_NN3 <- Haberman_NN2 <- neuralnet(Survival ~ Age + Operation_Year + Num_Pos_Nodes, 
                                data = Haberman_Data, 
                                linear.output = FALSE, 
                                err.fct = 'ce', 
                                likelihood = TRUE, 
                                hidden = c(2,2)) #Layer1 : 2 neurons, Layer2 : 2 neurons

Haberman_NN4 <- Haberman_NN2 <- neuralnet(Survival ~ Age + Operation_Year + Num_Pos_Nodes, 
                                data = Haberman_Data, 
                                linear.output = FALSE, 
                                err.fct = 'ce', 
                                likelihood = TRUE, 
                                hidden = c(1,2)) #Layer1 : 1 neuron, Layer2 : 2 neurons

AIC is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. BIC is similar to AIC, but with a different penalty for the number of parameters. With AIC the penalty is \(2k\) (\(k\) is the number of parameters), while for BIC the penalty is \(\ln(n)k\) (\(n\) is the sample size). As we can see from the above plot, in this case, the best classification neural network is the simplest one.

Application 1: Learning Bayesian posteriors for graviational-wave inference

This work is done by Alvin J. L. Chua and Michele Vallisneri, their paper is available on Physical Review Letters.

In this paper, the authors show how one or two-dimensional projections of Bayesian posteriors may be produced using deep neural networks trained on large ensembles of signal + noise data streams. Specifically, they adopt multilayer perceptrons. The network for each source parameter or parameter pair takes as input a noisy signal and instantly ouputs an approximate posterior, represented as either a histogram or a parametric Gaussian mixture. The network is called PERCIVAL: Posterior Estimation Results Computed Instantaneously Via Artificial Learning.

Reconstructed waveform of GW150914 embedded in noise [Credit:LVC]

Posterior PDFs for component mass inferred from GW150914 [Credit:LVC]

Now, let’s talk about how the neural networks are trained to produce posteriors. A perceptron classifier is a network which takes a data vector as input and ouputs the estimated probabilities that the input belongs to each member of a universal set of disjoint classes. The ’Bayesian optimal discriminant` is the classifier that returns the posterior probabilities. It is a well-established result in machine learning literature that perceptron classifiers can approximate Bayesian optimal discriminants when they are trained on a population of inputs distribution. To move from classifiers to posterior densities for continuous parameters, classes are defined based on a binning of the parameter domain of interest, in which case the network will output histograms of the target posterior. This coarse-graining of parameter space highlights the relationship between classification and regression, but is not actually necessary in this case, since the network can instead output a parametric posterior representation, for example, a Gaussian mixture, to be fed into an analogous loss function. The loss function in the training process is \[ L \simeq - \sum_{j} \ln q(\theta_j | D_j), \] where \(\theta\) denotes the parameters for which we seek a posterior, \(D\) is the input data vector, and \(q(\theta|D)\) is the network-estimated posterior.

The PERCIVAL network consists of the input layer for the 482 ROMAN weights (241 complex numbers); of 8 1024-wide hidden layers (with leaky ReLU activation function); and of an output layer (with linear activation) describing the six parameters of a two-dimensional joint normal distribution. By default, the work is in single precision; softmax=True is required for Gaussian-mixture posteriors. The construction is done with Pytorch.

Net_roman_G2 = truebayes.network.makenet([241*2] + [1024] * 8 + [1*6], softmax=False)

Training is performed using batch gradient descent with Adam optimization and a manually decayed learning rate. A new batch of \(10^5\) signals is generated at every training epoch to eliminate overfitting.

The example result is shown in the plot below. The estimated posteriors are marked as yellow, while true ones are marked in black. This is a joint estimation of posterior of parameters \(\theta = (M_c, \eta)\). \(\rho\) is the signal-to-noise ratio.

Estimated posteriors VS true posteriors [Credit:LVC]

Application 2: Test of Strong Gravity using Deep Learning

This work is done by Swetha Bhagwat and Costantino Pacilio, their paper is available on arXiv. This work presents another way to infer Bayesian posteriors, which is a convolutional neural network trained on parameter-signal pairs to ouput samples.

In this work, the authors proposed the feasibility of test of General Relativity (GR) using a deep learning network, setting a precedence for performing precision tests of gravity with neural network. This test is particular useful for third-generation gravitational wave detectors where ringdowns are expected to be loud and the number of detections can be \(\sim 10^3 - 10^4\)/year. Since analyzing such a large number of of events demands a rapid and computationally efficient inference algorithm. The authors train a neural network architecture called conditional variational autocoder (CVAE) to infer posterior distribution of parameter set \(\{M, \chi_f,q\}\) (mass,spin and mass ratio) from a set of simulated ringdown waveforms.

The authors simulate a data set of \(10^5\) ringdwons by sampling waveform parameters uniformly in certain ranges. The waveforms are embedded in simulated Einstein-Telescope-like noise segments. For tests, a new data set of \(10^3\) simulated ringdown waveforms is generated. For each input waveform, CVAE samples \(10^4\) distinct points to build the posterior.

CVAE is consisted of three neural networks. There are encoder and decoder networks to first map input into the latent layer, then maps the latent representation of the input into the output probability distribution. For training part, besides encoder and decoder parts, there is also a guide part to train the network. An overall representation is as below.

A schematic representation of CVAE [Credit: S.Bhagwat and C.Pacilio (2021)]

The specific hyper-parameters of each component network is shown in below table. We could see the layers and what type of layers is used.

Hyper-parameters of the networks [Credit: H. Gabbard et.al. (2019)]

The total loss function to be optimised is \[ \mathcal{L}_\mathrm{tot} = \mathcal{L}_\mathrm{recon} + \beta \mathcal{L}_\mathrm{KL} \] in which \(\mathcal{L}_\mathrm{KL}\) quantifies the ability of the encoder to produce a meaningful mapping of the input into the latent space, \(\mathcal{L}_\mathrm{recon}\) measures the probability that the true values \(x_{true}\) falls within the decoder distribution. \(\beta\) gives flexibility in implementing effective training strategies.

A quantitative diagnostic of the network performances is presented in the P-P plot. The plot shows marginalized cumulative distribution (CDF) of the true values \(x_\mathrm{true}\) as a function of p% confidence interval. A diagnal P-P plot means that \(x_\mathrm{true}\) is contained p% of the times within p% confidence interval of the marginalized posterior for \(x_\mathrm{pred}\). In the plot, we see that all CDFs are consistent with the diagonal. This means that the CVAE recovers the posterior samples for mass, spins and mass ratio from the ringdown reliably.

P-P plot of test data set [Credit: S.Bhagwat and C.Pacilio (2021)]

After an inference on certain phyiscal parameters from ringdown waveforms is accomplished with neural networks, people could compare it with the parameters predicted by General Relativity. A conclusion could be drawn on whether GR is valid by checking the consistency between the values drawn from observations and from GR.

Reference

[1] Alvin J. K. Chua and Michele Vallisneri, PhysRevLett.124.041102 (2019), [https://doi.org/10.1103/PhysRevLett.124.041102]

[2] Swetha Bhagwat and Costantino Pacilioy, arXiv:2101.07817 (2021),[https://arxiv.org/pdf/2101.07817.pdf]

[3] Hunter Gabbard, Chris Messenger, Ik Siong Heng1, Francesco Tonolini, and Roderick Murray-Smith, arXiv:1909.06296 (2019), [https://arxiv.org/pdf/1909.06296.pdf].

[4] Trevor Hastie, Robert Tibshirani, Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd Ed), Springer (2008).

[5] “Classification Artificial Neural Network”, UC Business Analytics R Programming Guide, [http://uc-r.github.io/ann_classification]

[6]“A Gentle Introduction to the Rectified Linear Unit (ReLU)”, Machine Learning Mastery, [https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks/]

[7]“Understanding Neural Networks: What, How and Why?”, towards data science, [https://towardsdatascience.com/understanding-neural-networks-what-how-and-why-18ec703ebd31]

[8] “In what situations do neural networks outperform gradient boosting and random forest models on regular numeric and categorical data (non-image or text data) if any?”, Quora, [https://www.quora.com/In-what-situations-do-neural-networks-outperform-gradient-boosting-and-random-forest-models-on-regular-numeric-and-categorical-data-non-image-or-text-data-if-any]