How Deep Learning Can Discussion of Counterfactual Inference with Michael Weisz's Insights

[C]DRAFT Printed on

references.bib

Deep Learning for

Counterfactual Inference

Michael Weisz

Green Templeton College

Table of Contents

Introduction

Motivation

The technological advances of recent years have resulted in an increasing availability of data in various fields such as healthcare, education, and economics.

The potential of this data is enormous as it helps us identify, understand, and describe the underlying mechanics and correlations within the data. Furthermore, it allows us to create predictive models that can be used to estimate the effect of actions we are taking which is highly relevant for future decision-making.

When considering causal relations within the data, we are often interested in answering counterfactual questions such as "What would have happened if a different action had been taken?”. Such questions typically arise from observational studies which investigate the observed effect an intervention of interest has on a given subject. In a medical context, for instance, this might concern how a number of patients — each with individual features such as blood pressure, heart rate, and age — have responded to being treated with a treatment out of two possible treatments .

Answering counterfactual questions provides a way to estimate the individualised treatment effect (ITE) of a treatment, which is the difference between the factual (observed) outcome and the counterfactual (unobserved) outcome. This quantity can be estimated by fitting a model to the observed data in order to predict the ITE for a given individual, helping us to make an informed decision concerning which treatment is preferable over the other.

A number of statistical models have been applied to counterfactual reasoning, achieving different levels of success depending on the amount of available training data, the nature of the problem, and the expressiveness of the models themselves. In addition, these models often depend on strong collaboration with domain experts incorporating their specific knowledge into the models.

In contrast, the field of machine learning tries to automatically infer appropriate models from the data without (or with minimum) need for human intervention aiming to minimise the assumptions that have to be presupposed on the models. In particular, the sub-field of deep learning which makes us of deep neural networks has been applied to the problem of counterfactual inference only very recently but has already achieved promising first results. This class of models, however, introduces additional challenges regarding the selection of appropriate architectures and hyper-parameters, and represents an open area of research.

This dissertation which aims to investigate existing methods and improve the state-of-the art in counterfactual reasoning. It focuses on deep neural networks as the model of choice, as they are able to deal with large amounts of data and have proven to be able to capture complex non-linear dependencies, making them highly suitable for the task. In addition, only limited research has been done in this field so far.

The applications of counterfactual inference are ubiquitous and highly relevant to a variety of fields including treatment-planning in healthcare, policy-making for organisations, or even ad-placement for online platforms. As a consequence, research in this field can have a significant impact on a number of disciplines and potentially affect many people’s lives by helping important institutions, governments, and industries to make more informed decisions.

Scope

This dissertation concludes the findings of the master project Deep Learning for Counterfactual Inference as part of the program MSc in Computer Science at the University of Oxford in 2016/17.

The goal of the project is to improve the state-of-the art in counterfactual reasoning using deep neural networks. This includes evaluating the effectiveness of existing methods and architectures and coming up with new approaches for the task of counterfactual inference.

We evaluate our proposed methods by running a number of experiments using synthetic and real-world datasets and demonstrate how they compete

Investigated questions include how to effectively train large networks in order to achieve a high accuracy and make meaningful predictions.

Contribution

The contributions of the project findings are twofold:

Firstly, we propose deep counterfactual networks (DCNs) — a novel architecture for counterfactual inference that conceptualises it as a multi-task learning problem using separate outcomes for the treated and untreated subjects. In addition, we introduce propensity-dropout — a novel way of regularising our model to avoid over-fitting by using a variation of standard dropout

that is dependent on the subject's propensity score (a measure to quantify a subject's probability to be treated). These contributions have been submitted in form of a paper to the Workshop on Principled Approaches to Deep Learning (PADL) at ICML 2017.

Secondly, we introduce a novel and efficient way to automatically learn an appropriate architecture for a DCN by exploiting specific characteristics of the dataset without the need for computationally expensive hyper-parameter optimisation.

Using experiments on synthetic and real-world data, we can show how our approaches outperform the state-of-the art for counterfactual inference.

Thesis Structure

The thesis consists of three parts and is structured into 8

chapters which are briefly described below. The first part Introduction is comprised of chapter 1 and 2 and introduces the problem background, relevant theory, and related works. The second part Methodology including chapters 3 and 4 introduces our own contributions — namely the concept of deep counterfactual networks, a corresponding dropout scheme, and an efficient way to derive its architecture. The third and last part of the thesis Conclusion and Future Work comprised of chapters 6, 7, and 9 contains a summary of our finding, discusses the results and our contribution and concludes with future work.

Chapter : ch:2-background

The second chapter describes the theoretical foundations of the counterfactual inference using deep learning. We briefly describe core concepts of causal inference, machine learning, deep learning, architecture learning, and data generation.

In addition, we discuss the status-quo in counterfactual inference and any related works.

Chapter : ch:3-DCNs

The third chapter introduces our model of deep counterfactual networks — a deep neural network conceptualising causal inference as a multi-task learning problem. We describe its architecture, the concept of propensity-dropout and discuss its contribution and challenges.

Chapter : ch:4-DCN-LAs

The fourth chapter covers the second part of our main contributions — an efficient way to automatically derive an appropriate architecture for DCNs by exploiting specific characteristics of the dataset. We introduce each characteristic separately, formalise a corresponding metric and show and algorithm that uses these metrics to derive a suitable architecture.

Chapter : ch:5-experiments

In the fifth chapter, we conduct a series of experiments on our proposed models in order to evaluate their performance. Firstly, we are using a synthetic model, for which we have full access to the counterfactual outcomes and are able to parametrise the characteristics of the data to see how it effects the performance of our models. Secondly, we apply our models to a real-world dataset to see how well it generalises to unknown data.

Chapter : ch:6-conclusion

The sixth chapter concludes our findings an discusses the results of the experiment section. We compare our models' performances to competing methods and critically evaluate the underlying factors.

Chapter : ch:7-contribution

In the penultimate chapter we discuss the implications of our approach and describe the contributions of the project to the task of causal inference.

Chapter : ch:8-future-work

The final chapter of the thesis discusses open areas of research in the field of counterfactual inference using neural networks. In particular, we describe current limitations of our proposed models and how they could be overcome.

Theoretical Background

Introduction

This chapter describes the theoretical background of machine learning and counterfactual inference including their formalisation, core concepts, state-of-the-art, and challenges.

We start by giving a general introduction to machine learning and its core concepts.

The second part of the chapter deals with deep learning — a particular subset of machine learning that uses deep neural networks.

The last part of the chapter is dedicated to the problem of counterfactual inference. We describe the importance of the problem and its application areas, and give a formalisation. Furthermore, we outline the challenges of the problem and the different approaches that have been applied to it so far. We conclude, by relating counterfactual inference to deep learning and describing its open research questions which is the very foundation of this thesis.

Machine Learning

Motivation

When conceptualising computer programs, we often think of them as series of unambiguous, (mostly) atomic instructions that are executed in a deterministic way and have to be explicitly programmed by a human programmer.

While certain problem areas — in particular those for which a well-defined algorithm or effective step-by-step solution strategy exists — can successfully expressed and ultimately solved this way, there are various problems where this seems infeasible. A typical example for the second category is autonomous driving for which there are far too many situations and eventualities to encode the desired behaviour of an autonomous vehicle as a finite sequence of conditional instructions.

Machine learning is a subfield of computer science that deals with the question of how to teach computer programs to learn without being explicitly programmed what to do.

More formally, an algorithm "is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E". (Mitchell)

Recalling our example of autonomous driving, for instance, we could define the task in terms of "moving the car from A to B" while our performance measure might include aspects such as the travel time and the number of people or objects harmed. The experience would consists of the training data that is acquired by previously covered distances (either autonomously or by "observing" a human driver).

Types of Learning

Machine Learning is typically categorised into supervised learning, unsupervised learning, reinforcement learning.

Supervised Learning

In supervised learning the task is to infer a function over a number of samples from a set of training data that is said to be labelled. In other words, for each sample in the training data, we have access to the actual function value we want to predict.

For instance, we might be interested predicting housing prices where we are given a dataset consisting of historic information regarding houses (e.g. number of rooms, area in , etc.), commonly referred to as features or covariates, and the actual price for which the house has been sold, called outcome, target, or label. Once a certain model has been trained by using the historic labelled dataset, we can use the same model to predict the housing prices on unseen data points, i.e. new houses for which the price is unknown. Such a task of predicting a continuous variable is called a regression task whereas the predicting of a discrete output is referred to as a classification task.

Unsupervised Learning

In contrast to supervised learning, the training set in an unsupervised setting does not contain the target labels. Typical tasks in this field include finding particular patterns in the data and clustering it accordingly. It is important to note, however, that for supervised learning, there is normally no ground-truth meaning that the performance (or quality) of the outcome cannot be easily be measure or even defined in absolute terms but might depend on the underlying use-case of the clustering. Another important task in supervised learning is dimensionality reduction for which a (in terms of the features) high-dimensional dataset is reduces to a target lower dimension while trying to minimise the information loss.

Reinforcement Learning

In reinforcement learning, a software agent has to learn appropriate actions in a dynamic environment in which the consequences of an action might not be immediately accessible but add up to a long-term reward that ought to be maximised.

A most prominent example was mentioned earlier with the task of autonomous driving in which a vehicle is governed by a piece of software that has to find adequate series of actions (steering, regulating speed, etc.) in order to reach a target destination in a complex and ever-changing environment.

Machine Learning Models

As illustrated in figure , machine learning is based on the concept of selecting an appropriate model and fitting it to a given (training) data set using a training algorithm.

For instance, we might want to use a linear model for our task of predicting housing prices. We can then make use linear regression which allows us to fit our linear model to our labelled dataset of existing houses and their prices.

The selection of an appropriate model is of great importance and determines important factors such as the trade-off between the expressiveness (i.e. the ability to capture complex relationships) and the computational complexity of our model (i.e. how difficult it is to train it).

There is a number of models each with individual pros and cons depending on the desired task and the desired characteristics. Typical models include linear models, decision trees, and neural networks.

This dissertation focuses on using deep learning for the problem of counterfactual inference. Therefore, a dedicated part of this chapter deals exclusively with deep neural networks and describes their characteristics in more detail.

Regularisation

When training our model we have to find the right balance between fitting it most accurately to the training data while making sure that the model generalises will for unseen data points.

Given that the expressiveness of the model is sufficiently high to capture it, we might naively fit our model to the training data perfectly, resulting in a training error of zero. This, however, would merely "memorise" the training data and might perform poorly on unseen future data points. This phenomenon where we overfit the training data is often referred to as high variance.

In contrast, if the model is too simple, we night not be able to accurately capture the relationship in our data leading to equally poor results. In this case, we are under-fitting the data and our model has a so called high bias.

Therefore, the goal is to train the model in a way that neither overfits nor underfits and generalises well to unseen datapoints. This can be achieved by a concept called regulariation.

Deep Learning

Introduction

Deep learning refers to a field of machine learning that is based on the use of so called deep neural networks.

The approach is inspired by how the human brain works by modelling biological neurons and their interconnections to other neurons as terms of an artificial neuron that has similar properties.

The details about our understanding of the inner mechanics and biochemical processes of the human brain are beyond the scope of this thesis. On a high level, however, Each neuron receives inputs signals on its dendrites which are connected to neighbouring neurons axons using a synapse. The inputs are accumulated and processed within the cell body causing the cell to output a signal on its own axon which in turn represents the potential input for other cells.

Through these interconnections the neurons form a highly complex structure that can be conceptualised in terms of a biological neural network. It is estimated that the human brain possesses about 100 billion allowing it to process complex signals, abstract concepts, and the general process people refer to as thinking.

In a bionic fashion an artificial neural network adopts this architecture in a simplified way by defining a network of artificial neurons that are connected according to a certain topology. The analogy between the human and the artificial neurons is illustrated in figure .

This way, an artificial neural network (henceforth called neural network), is able to capture complex correlations and interdependencies within the data.

The concepts of using neural networks has been in existence since the 1970s.

After an initial enthusiasm, however, neural networks lost traction in the following two decades due to the realisation that the existing hardware did not allow for a degree of scalability that would have been required to solve the desired problems.

This changed, however, when in 2014 X,Y, and Z used a neural network outperformed all existing methods on the ImageNet Challenge which seeks to recognise and label a set of objects in pictures.

Ever since, deep neural networks have been responsible for some of the most recent successes including self-driving cars and defeating the world champion in Go.

According to former chief data scientist and researcher at Stanford University, Andrew Ng,

two key factors are responsible for renaissance of neural networks and have enabled its recent success: Firstly, the advances in computational capacities which includes highly-optimised processing units such as GPUs and ASICS

and corresponding architectures such as distributed clusters and cloud computing. Secondly, the availability of large datasets that can be used for training the models such as web-scale text corpora for natural language processing or large databases of images (imagenet) etc.

Today, deep neural networks represent the state-of-the-art in many areas such as natural language processing

and computer vision. They are widely considered one of the most promising areas of machine learning and artificial intelligence in general

and have received a high level of attention in society, media, and politics. Despite its recent success, the usage of deep neural networks poses a multitude of computational, architectural, and domain-specific challenges

and therefore represents of most dynamic areas of research.

The Multilayer Perceptron

In order to understand how neural networks work, we first investigate one of the most basic forms of neural networks: the multilayer perceptron or MLP.

Figure depicts the schematics of an MLP. As the name suggests, it consists of multiple layers each containing a fixed number of artificial neurons that are interconnected exclusively by neurons of neighbouring layers. The first, so called input layer, is followed by one or more hidden layers leading to a final so called output layer.

The individual artificial neuron or unit is illustrated in figure . Its output is computed as

where corresponds to the inputs of the neuron, to the weights and to a bias — trainable parameters of the model — and to a non-linear function referred to as activation function. Typical choices for include

(x) = (x) = 11 + e^-x && or && (x) = tanh(x) = 1 – e^-2×1 + e^-2x

where refers to the sigmoid function, to the hyperbolic tangent, and

… to Z. While each of these activation functions has different properties as illustrated in figure , a typical characteristic is that they map the input value to the closed interval .

Despite the relative simplicity of this model, it can be shown that an MLP with appropriate weights and bias parameters is able to represent any arbitrarily complex non-linear function.

Types of Neural Networks

There are different types of neural networks defined by a number of characteristics such as network's architecture and the direction of information flow.

Feed-Forward Neural Networks Closely related to the MLP described in the previous section, feed-forward neural networks (FFNN) represent a class of networks that is characterised by a set of hidden layers that have a similar shape and are often fully connected (i.e. every node in layer is connected to every node in layer and ). As the name suggest, the information flow is strictly uni-directional from nodes in layers with lower indexes to nodes with higher indexes (i.e. no loops). These types of networks represent the most basic type of network that makes no assumptions about the input data and is used in regression and classification tasks.

Convolutional Neural Networks While FFNNs make no assumptions about the input data whatsoever, it is often useful to exploit domain-specific knowledge about the specific input data. For instance, in computer vision the input of a network is typically an image encoded as pixmap with the intensity values of each pixel. In this case, it seems naive and inefficient to assume independence of the inputs ignoring aspects like the principle of locality of neighbouring pixels which are more likely to have a similar colour or intensity than two randomly-selected pixels.

A convolutional neural network or ConvNet is a special kind of feed-forward neural network whose architecture is designed to exploit the principle of locality in the data. This is typically achieved by alternating convolutional layers and pooling layers. The convolutional layers run a filter or kernel across the their input layer which performs an image convolution and can be thought of a way to detect features (such as edges) in the image. The pooling layers typically perform some kind of aggregation (such as taking the maximum of multiple values) over the previous layer to reduce the dimensionality.

While convnets are particularly useful in computer vision and represent the state-of-the-art in image classification

, they can also be applied in other fields in which the data expresses some principle of locality.

Recurrent Neural Networks In contrast to FFNNs and convnets for which the information flow is strictly uni-directional, recurrent neural networks (RNNs) are characterised by some kind of feedback loop that allows the outputs of a unit in layer to be process as input by any other layer , even if .

This allows the network to keep an internal state which can be conceptualised as memory. Such a memory enables the network to effectively deal with sequences of data such as time-series values, natural language, or even music.

r0.4

LSTM

In spite of the benefits the recurrent architecture of the network provides, it also introduces a number of computational challenges such as the so called vanishing gradient problem which stems from the additional distance the gradient is backpropagates (see next section

) in the computation graph of the network causing the gradient. In order to circumvent this problem, a number architectures has been proposed for an individual cell within the RNN, most notably long short-term memory (LSTM) cells and gated recurrent units (GRUs). As illustrated in figure an LSTM achieves this using a number of gates that allow the cell to decide when to store and when to reset (i.e. forget) its internal state.

Today for many problems in machine learning, RNNs represent the state-of-the-art when dealing with sequential data such as natural language and time-series.

Training

Training a neural network refers to the process of fitting the parameters of the model (i.e. the weights and bias for each cell) to the training data with respect to a given objective function.

Objective Function

r0.4

Local Minima

The objective function (also called loss function or error function) defines a metric of the performance of the model. It is the goal of the training procedure to appropriate parameters for the model that maximise the performance or minimise its loss respectively.

Figure illustrates this concept by depicting the graph of an objective function that takes a one-dimensional input (real numbered scalar). As we can see, the function posses multiple extreme values — in particular one local minimum at

and a global minimum at .

In cases where an algebraically obtained closed-form solution for the minimum is not feasible or too computationally expensive, the minimum is often obtained by optimisation algorithms (see below). Since these algorithms often operate in a greedy manner, i.e. by a series of local operations that has no knowledge about the global surface of the function, we are facing the potential problem of only finding a local optimum but missing out the global one. As a consequence, it is desirable to use convex objective function that, by definition, only have one (global) minimum.

The actual loss function is dependent on the problem task and can be derived by a maximum likelihood estimation (MLE). For regression tasks, a common choice is the mean squared error between the prediction and the actual output value defined as

L_MSE(w,b) = 1n _i=1^n(Y_i – Y_i)^2.

For (binary) classification problems we typically use the cross-entropy

as loss function.

Gradient Descent We typically make use of an optimisation algorithm that tries to minimise the output of our loss function that captures the degree to what extend our prediction differs from the target values. A widely used example is gradient descent which is an iterative optimisation algorithm that can be used to find a minimum value of a function. This is achieved by repeatedly computing the gradient of the function at the current position and subtracting a proportion of the gradient from it until the gradient converges towards zero (which is the case at a minimum) or a stopping condition is satisfied. Intuitively, the subtraction of the gradient can be conceptualised by taking iterative steps towards the steepest descent until a minimum value is reached.

Formally, we are looking for by iteratively computing

where refers to our position at time iteration and is a parameter called the learning rate that defines the step size of our descent. Choosing an appropriate learning rate typically represents a trade-off between reducing the number of required iterations (high learning rate) and a making sure the function converges and does not overshoot the actual minimum or oscillates around it. Consequently, it is often desirable to use an adaptive learning rate instead, i.e. making a function of the current iteration.

There two main types of gradient descent: batch gradient descent computes the gradient for entire dataset before applying an update rule as defined in equation which is most exact but computationally expensive as it has to iterate through all data points in the dataset. In contrast, stochastic gradient descent computes the gradient of a random (hence the name) sample and applies an update according to this sample alone which is less computationally expensive. The choice between the batch gradient descend and stochastic gradient descend therefore represents a trade-off between accuracy and computational costs. In addition to this dichotomy, there there exists a hybrid version called minibatch gradient descent which computes the gradient on a batch which is a subset of the entire dataset.

Backpropagation For gradient-based optimisation algorithms it is mandatory to compute the partial derivatives of the objective function with respect to any weight or bias in the neural network that needs to be learnt. In a deep neural network with multiple hidden layers this might seem rather complex as change of weights influences the outcome (and therefore the loss function) only indirectly by propagating its change through subsequent layers.

In the 1980s, X, Y, and Z

introduced an efficient method to achieve this which they called the backpropagation algorithm. The algorithm conceptualised the network as a concatenation of functions and is based on the simple idea of repeatedly applying the chain rule as known in calculus to compute the partial derivatives with respect to each parameter (weights and biases) of interest.

r0.4

Backpropagation

The training is then executed in two alternating phases as illustrated in figure : In the forward pass the inputs are propagated through the network from the input layer through the hidden layers until a predicted outcome is available in the output layer. Using the predicted outcome , the actual outcome , and the loss function we are able to compute a loss l.

In the second phase — the backward pass the partial derivatives are computed for the weights and biases of each unit all the way back until the first layer after the input layer.

Once the partial derivatives are computed, we can update each parameter according to our optimisation algorithm (e.g. using gradient descent as described above).

The backpropagation algorithm therefore represents an effective mean to compute the gradient of the loss function and can be considered an essential part of the training of neural networks.

Dropout

As described in section , regularisation is an important concept in machine learning in order to avoid overfitting the training data. While in theory, a traditional approaches like l1 (lasso) and l2 (weight-decay) are possible, a number of different regularisation techniques have been proposed that are specific to the mechanics of neural networks.

One of the most-used approaches is called dropout and was proposed by

in 2014. The basic idea is that during training every neuron is kept active only with a certain probability . With a probability of it represents an output of zero otherwise. Intuitively, this prevents the network from becoming too dependent on a particular neuron as it is forced to learn an alternative representation when it is disabled. As a consequence, the network learns a representation that generalises better giving dropout a regularising influence on the network. Similar to the regularisation hyper-paremeter described in , the dropout probablity represents a hyper-parameter that has to be chosen appropriately.

Figure illustrates this concept: The original network (right side) is thinned out by the dropout resulting in the network on the right side.

Today, dropout represents the de-facto standard for regularising neural networks and preventing them from overfitting.

Multi-Task Learning

Neural networks are typically used to perform a single task that is formalised by minimising an appropriate objective function. However, sometimes it is desirable to train the network on multiple (related) tasks simultaneously.

r0.4

Multi-Task Learning

As illustrated in figure , such a multi-task neural network is characterised by a number of layers that are shared among all tasks and a number of layers that are task-specific. Each task typically provides its own objective function which is optimised jointly with the objective functions of the other tasks.

Multi-task learning can have a variety of desirable properties: Firstly, multi-task neural networks are typically less prone to overfitting as the shared layers have a regularising effect on the network forcing it to learn a shared representation among all tasks. In addition, they allow influencing the way a task is learnt on a more fine-grained level by introducing inductive biases that are provided by the other tasks. This means that we can make a primary task more inclined towards considering related sub-tasks.

As typical example for multi-task learning, we consider training a spam classifier for emails. Different users might have different distributions over the features of emails they receive due to factors such as language, contacts, age, and interests which can be conceptualised as a number of different tasks. Despite their differences these tasks are highly related allowing us to treat the classifier as a multitask-learning problem. Ideally, the shared layers would learn common concepts of the characteristics of spam emails whereas the individual task-specific layers would take into account the specific characteristics of the user.

Model Selection and Architecture Learning

In section we described the three main types of neural networks which are each suitable for different kinds of tasks. However, even after deciding on a specific type, there are a number of architectural choices that are of great importance for the performance of the model. These architectural choices can be considered as model-level hyper-parameters that are typically defined a-priori and have to be chosen appropriately in addition to the hyper-parameters that were already discussed in the previous sections (e.g. learning rate , dropout probability , activation function , etc.). These following lists typical model-level hyper-parameters and how they might affect the network.

Model Hyperparameters

i) Number of layers The number of layers (or depth) of the network is a key design decision for the model's architecture and greatly influences its performance. In general, a higher number of layer results in a more complex model with high variance increasing its expressiveness and ability to capture complex non-linear dependencies. On the flip side,

this leads to an increased probability of over-fitting the training data and therefore requires stronger regularisation. From a practical perspective, a higher number of layers will cause the training to more computationally expensive and requires more memory to store the trained parameters. Consequently, the choice reflects the classical tradeoff between bias and variance discussed in and depends on the amount of available data and computational resources.

ii) Number of units per layer In addition to the number of hidden layers, we need to decide on an appropriate number of units per hidden layer (network width). The decision follows a rationale similar to the number of layers: A high number of units per layer increase the variance of the model allowing it to accommodate more complex functions at the cost of more computational complexity and the danger of over-fitting.

Hyper-Parameter Optimisation

In contrast to the ordinary parameters (weights and biases) of the model, hyper-parameters are typically not learnt during training but need to be chosen appropriate a-priori. However, there are three main techniques to derive suitable hyper-parameters:

i) Grid Search The most basic form of finding appropriate hyper-parameters is to perform a grid search. This means iteratively trying out a set of candidate values for each hyper-parameter and thereby exhaustively searching the space of potential hyper-parameters. After each iteration, the model is trained and then evaluated on a dedicated cross validation set.

While this approach is doable for a small space search space, it becomes computationally infeasible if the potential hyper-parameters (or their combinations) is large.

ii) Random Search An alternative to grid search that is computationally less expensive is random search. Instead of trying every possible hyper-parameter in the search space exhaustively, this approach randomly samples hyper-parameters from the search space a fixed amount of times.

iii) Bayesian Optimisation Lorem ipsum dolor sit amet consectetur adipiscing elit eleifend massa ut, hac odio viverra feugiat vestibulum libero vulputate suspendisse facilisis, quam class molestie platea mollis turpis sem varius duis.

Architecture Learning

Nevertheless, a number of approaches to automatically deriving suitable architectures has been proposed by a concept called

It is difficult to derive proper architecture.

What's the right number of hidden layers? What's the right number of hidden units per layer, etc?

Different approaches to architecture learning. Optimal Brain Surgeon/Damage. GridSearch, RandomSearch, Bayesian Optimisation.

Counterfactual Inference

Motivation

The concept of causality represents the very foundation reasoning and logic. In many ways, all scientific research can be considered as an attempt to investigate the causal relations within different concepts in order to understand what cause of actions leads to what effect.

Despite being of greatest importance, the study of causality is inherently difficult as since most phenomena are governed by a complex network of interdependencies and correlations making it almost impossible to identify pure causal relations such as .

In practice, causal inference is often performed in the scope of so called observational studies that try to investigate the effect a certain intervention or treatment has on a given context . For instance, we consider a medical setting in which such a study might investigate how the blood pressure () of a given patient () changes after being administered a medication () of interest.

When considering causal relations within the data, we are often interested in answering counterfactual questions such as "What would have happened if a different action had been taken?”. In order to understand the importance of such counterfactual questions, we have to introduce the concept of the individualised treatment effect or ITE.

The ITE is a quantity that is defined as the difference between the observed or factual outcome and the unobserved or counterfactual outcome. Intuitively, this means that the ITE tells us how the outcome would change depending on whether or not we perform

the intervention of interest.

As a consequence, if we had access to this quantity, we could immediately decide which outcome (treated or untreated) is closer to the desired effect. In our medical context mentioned above, having access to the ITE of a medication on a given patient would greatly aid them in their treatment planning.

Unfortunately, by its very definition it is infeasible to obtain the ITE: The observational study only gives us access to observed outcome — the counterfactual outcome is never available.

Consider the example illustrated in table . We are given a dataset of six subjects half of which received a treatment indicated by while the other half did not receive it, i.e. . Depending on their treatment assignment, we can either directly observe the treated outcome or the untreated outcome but never both thus making it impossible to calculate the ITE.

As a consequence, the objective of counterfactual inference is to estimate the unobserved outcome which gives us a way to estimate the ITE and make an informed decision regarding the treatment assignment.

Finally, it can be said that the applications of counterfactual inference are ubiquitous and highly relevant to a variety of fields including treatment-planning in healthcare, policy-making for organisations, or even ad-placement for online platforms. As a consequence, research in this field can have a significant impact on a number of disciplines and potentially affect many people’s lives by helping important institutions, governments, and industries to make more informed decisions.

Formalisation

We represent each subject in our population with a -dimensional feature vector , and two potential outcomes which are drawn from a distribution

This way, the individualised treatment effect for subject can be expressed as

Given this definition, the objective is to approximate the function using an observational dataset consisting of independent samples.

Each sample is comprised of a tuple , where represents the subject's features, the treatment assignment indicator, and and the respective factual and counterfactual outcome.

The treatment assignment (covered in detail in section ) is modelled as a random variable depending on the subjects' features, i.e. . The assignment reflects a domain-specific policy which can be captured in terms of the probability called the propensity score.

Essay: How Deep Learning Can Discussion of Counterfactual Inference with Michael Weisz’s Insights

Essay details and download:

Text preview of this essay:

Introduction

Introduction

Introduction

About this essay:

Essay details and download:

Text preview of this essay:

Introduction

Introduction

Introduction

About this essay:

Essay Categories: