Brain imaging data | EssaySauce.com

Introduction

1.1 Description and Motivation Brain research has been highlighted as an area of national interest in recent years. In 2013, President Obama launched the BRAIN research initiative, seeking to do for neuroscience what the human genome project did for genomics (Tripp & Grueber, 2011).

This research has the potential to impact many important areas, from the detection, treatment, and increased understanding of diseases such as Alzheimer’s and epilepsy, to the neural control of devices to aid the handicapped, to a greater understanding of how the human brain functions on a basic level.

The classification of brain signals recorded by imaging devices using machine learning approaches is a very powerful tool in many of these areas of research. For example, machine learning techniques show promise in the early detection of Alzheimer’s or giving warning before an epileptic seizure. These techniques are already being used in devices such as the P300 speller (Guan et al., 2004) to provide a communication device for the severely handicapped.

In addition, neuroscience problems present a unique set of challenges that require innovation in machine learning. The data obtained from brain activity monitoring devices are noisy, have high dimensionality, and are costly to collect, which limits the number of data samples that can be collected. The combination of these factors leads to exceedingly complex data, which are difficult to analyze or classify, even using the most sophisticated and modern techniques.

This thesis presents several novel results in the broad area of brain signal classification. First, it provides a comparative evaluation of standard machine learning and data preprocessing techniques in brain signal classification. Second, the use of deep learning techniques for brain signal classification is explored in detail. While these techniques are state of the art in many other applications of machine learning, there are relatively few published results of their use in brain signal classification. In particular, recurrent neural networks, which have proven to be powerful in time series analysis, and convolutional networks, which are remarkably efficient in image and video classification, are explored in this thesis (Prasad 2014; Simonyon 2014). Third, to address the relatively low number of samples able to be collected in neuroimaging tasks, a novel application of transfer learning within the dataset is explored.

1 . 2 Application

The classification of brain signals is a growing area of research, with emerging applications in both applied and theoretical neuroscience. These applications can generally be divided into a few main areas including device control, brain state detection, medical diagnosis, and basic research.

1 . 2 . 1 Basic Research

In neuroscience, the use of machine learning techniques to classify brain signals is seen as a form of multivariate pattern analysis (MVPA). MVPA is used to examine phenomena that are difficult to measure with traditional techniques. For example, support vector machines (SVMs) were used to examine to “what extent item-specific information about complex natural scenes is represented in several category-selective areas of human extrastriate visual cortex during visual perception and visual mental imagery” (Johnson, McCarthy, Muller, Brudner, & Johnson, 2015). Other examples of basic research involving the classification of brain signals include topics such as affect recognition, semantic language representation in bilingual speakers, and exploring individual differences in pain tolerance (AlZoubi, Calvo, & Stevens, 2009; Correia, Jansma, Hausfeld, Kikkert, & Bonte, 2015; Schulz, Zherdin, Tiemann, Plant, & Ploner, 2012).

1.2.2 Brain State Detection

Another application of classification involves the continuous monitoring of brain states using imaging devices like the EEG. Rather than looking to control an external device, these techniques look to determining the subject’s inner state. These applications tend to consider longer periods of data and focus on frequency analysis.

These techniques are being researched for use in areas such as seizure detection and prevention (Gabor, Leach, & Dowla, 1996; Ramgopal et al., 2014) and truth detection (Gao et al., 2013). Brain state detection is also used as part of larger applications, such as monitoring mood to allow a larger human computer interface system to adapt its display to user’s current mental state (Molina, Nijholt, & Twente, 2009). There is even interest in the classification of more disparate states such whether the subject is resting quietly, remembering events from their day, performing subtraction or silently singing lyrics (Shirer, Ryali, Rykhlevskaia, Menon, & Greicius, 2012).

1.2.3 Medical Diagnosis

Brain signal classification is likely to play an increasing role in the diagnosis of brain diseases in the future. While convergent evidence will always be necessary, classification could be useful as a screening tool or another point of reference for diagnosis.

For example, efforts have been made into using SVM techniques to aid in Alzheimer’s disease diagnosis (Trambaiolli et al., 2011). Other applications in diagnosis include drug addiction (Zhang, Samaras, Tomasi, Volkow, & Goldstein, 2005) and diagnosis of psychiatric disorders, such as schizophrenia (Koutsouleris et al., 2015), ADHD (Fair, Bathula, Nikolas, & Nigg, 2012), and bipolar disorder (Fair, Bathula, Nikolas, & Nigg, 2012).

1.2.4 Brain Computer Interface

Brain computer interfaces (BCIs) use monitored brain activity and computation to achieve an external activity. Classification of mental states and intentions based on patterns of brain signals is a common goal in such applications. While early methods of BCIs have tended to use manually determined features and relied on the user adapting to the machine, more modern techniques generally involve the use of machine learning and allow the machine to adapt to the user.

These B CIs have a w ide range of applications, from control of robotic arms and other prosthetic devices (McFarland & Wolpow 2008) to speech synthesizers (Lotte, Congedo, Lécuyer, Lamarche, & Arnaldi 2007) and other communication devices (Guan et al., 2004). In general, most BCI applications are geared toward providing capability to the disabled.

1 . 3 Brain Imaging Technique

A variety of devices and techniques capable of measuring brain signals are available.

These techniques either measure primary signals of activity such as the electrical or magnetic signal produced by neural activity, or secondary signals such as the blood flow to regions of the brain that are active.

Functional Magnetic Resonance Imaging (fMRI) measures what is known as the blood oxygen level dependent (BOLD) signal. It is capable of high spatial resolution and imaging deep brain structures, but requires a room free of electromagnetic interference and very expensive equipment. Furthermore, the temporal resolution of the signal is quite poor since it relies on measuring blood flow rather than a direct marker of brain activity.

The output of fMRI, after several statistical methods are applied, is a series of voxel based images of the BOLD signal.

Magnetoencephalography (MEG) measures the magnetic component created by the electricity moving through the brain. It has both a relatively high spatial resolution and a very high temporal resolution, since it measures a direct and quickly propagated marker of brain activity that is largely unaffected by the scalp. While the spatial resolution is fairly good, only outer portions of the brain can be accurately measured due to the drop off in strength of magnetic fields with the square of the distance. Additionally, it requires a highly magnetically shielded room and a constant supply of liquid helium to function, leading to very high costs and a lack of portability. The output of MEG is one time series per channel (generally 306 channels) representing the strength of the magnetic field at the channel.

Other brain imaging modalities such as standard magnetic resonance imaging (MRI) and positron emission tomography (PET) do not offer the temporal resolution necessary to monitor brain activity at time scales useful in most classification applications and may have other drawbacks, including exposure to ionizing radiation.

Thus, due to the disadvantages of other brain imaging modalities, this thesis will focus on data collected by electroencephalography (EEG). EEG functions by attaching electrodes to a subject’s scalp in order to measure the changes in electrical potential that occur as a result of brain activity. Due to the near instantaneous propagation of these 8

voltage changes, information acquired by the EEG can be sampled with high temporal precision. However, the human skull and scalp are insulators, which have the effect of dispersing the signal, thus limiting the spatial resolution capable of being achieved by the EEG. Furthermore, localizing the source of activity involves solving an inverse problem with an infinite number of solutions, thus further limiting the spatial resolution. However, it is comparatively inexpensive and does not require onerous protections such as magnetically shielded rooms. Furthermore, it is, unlike the previously discussed modalities, more practical in applications that require mobility, such as BCI. The output of EEG is one time series per channel (anywhere from 32 to 256 channels is common) that represents the electrical potential on the scalp at the given channel with respect to a reference electrode, typically recorded at rates from 250 Hz to 1000 Hz.

1 .4 Research Challenges and Contribution

Brain imaging data presents several challenging obstacles to machine learning, all of which are current topics of open research: • Brain signals are noisy. The information is polluted by a variety of factors including muscle movement, measurement error, brain activity that is not of interest, and electromagnetic interference from the environment.

• Brain signals have high dimensionality. There are frequently hundreds of channels, sampled at up to 1000 Hz. The raw data frequently presents up to hundreds of thousands of features per trial.

• The data are a time series and have spatial interactions, potentially requiring investigation of temporal and frequency components in conjunction with spatial analyses.

• Data collection is expensive and time consuming. Thus, it can be difficult to collect sufficient data for many of the most powerful machine learning techniques. Generally thousands or tens of thousands of samples per class are desired in modern deep learning applications, but it is often impractical to produce more than a few hundred samples per subject in brain imaging tasks, particularly if they need to be derived for clinical or medical studies.

• Brain activity can vary significantly between subjects and even between data acquisition sessions within a subject. Thus, classifiers that can address high levels of variability are needed.

The goal of this thesis is to investigate techniques for addressing these challenges.

In this thesis we have demonstrated the effectiveness of various deep learning architectures, particularly convolutional and recurrent, in classifying brain signals. We have shown that certain recurrent architectures outperform traditional techniques.

Furthermore, transfer learning strengthens the effectiveness of all deep learning architectures.

1 . 4 Thesis Outline

The rest of the thesis is organized as follows: Chapter 2 will review the related works that forms the basis for this research. Chapter 3 will define the specific problems and approaches used in this thesis. Chapter 4 will include a summary of the datasets used, the implementation of the techniques used, and the results of the experiments.

Chapter 5 will present a summary of the findings and a discussion on future work.

C h a p t e r 2

Related Works and Background

This chapter is made up of two sections, one covering the basic concepts behind the deep learning algorithms used in this paper and the other covering the current state of EEG classification.

2.1 Deep Learning

2.1.1 History of Neural NetworkDeep learning is a subfield of machine learning that has evolved out of the traditional approaches to artificial neural networks. Artificial neural networks are computational systems originally inspired by the human brain. They consist of many computational units, called neurons, which perform a basic operation and pass the information of that operation to further neurons. The operation is generally a summation of the information received by the neuron followed by the application of a simple, non-linear function. In most neural networks, these neurons are then organized into units called layers. The processing of neurons in one layer usually feeds into the calculations of the next, though certain types of networks will allow for information to pass within layers or even to previous layers. The final layer of a neural network outputs a result, which is interpreted 12

for classification or regression. Figure 1 shows a depiction of the structure of a simple neural network.

Figure 1 Structure of a basic neural network The basic concepts date back to the McCulloch-Pitts neuron of the 1940s (McCulloch & Pitts, 1943). This model was very simple compared to modern neural networks—it only allowed for binary outputs from each neuron, summing the input and comparing the result to 0. Furthermore, there was no update rule defined. An update rule is a mathematical rule that allows for the adaptation of the neural network to new information. Without an update rule, all of the values in a neural network must be handcrafted.

In the 1950s, the perceptron algorithm was introduced. It generalized the McCulloch-Pitts neuron to have continuous valued weights on the connection and introduced a basic update rule to compute the weights at time t + 1 from the weights at time t:

! + 1 = ! + ( ! − ! ) !,!

where !is the weight of the ith neuron, ! is the expected value of the jth input, ! is the calculated value of the jth input, and !,! is the ith value of the jth input.

The perceptron learning rule was only capable of training networks with a single layer, greatly limiting the power of the model. It was originally conceived as a hardware implementation, though it was also the first neural network to be implemented in software (Rosenblatt, 1958).

In the late 1960s, it was mathematically proven that a network with only a single layer lacks the representative power to classify many common types of problems, including the Exclusive OR (XOR) function (Minsky & Papert, 1969). Furthermore, attempts to use neural networks in speech recognition during this era were largely considered failures, capable of only recognizing a very limited vocabulary of words for a single user (Pierce, 1969). This combination of theoretical limitations and practical failures led to an “AI Winter”. Funding for neural networks and other forms of AI research largely dried up over this period. The 1970s saw very little progress in the development of neural networks.

The ability to train a network with multiple layers was crucial for the continued development of neural networks. While Minksy and Papert’s book was most famous for 14

showing that a single hidden layer network was lacking in representative power, it also showed that a network with even two hidden layers could model almost any function. A method known as automatic differentiation was proposed in Seppo Linnainmaa’s Master’s thesis in 1970 for calculating the derivative of a differentiable composite function represented by a graph (Linnainmaa, 1970). This method would later form the basis for backpropagation, a learning rule capable of training neural networks with multiple layers (Werbos, 1982). In essence, backpropagation allows multiple layer networks to be trained by iteratively using the gradient of a loss function with respect to the weights of the network to assign updates to previous neurons in the network, starting from the output neurons. It can be described in pseudocode as: d o

p r e d i c t i o n = networkOutput(training_examples) actual = label(trainin_examples) error = averageErrorFunction(prediction, actual) grad = gradient(l, error)

for layer l in reverse(net)

new_l = update(l, grad)

grad = grad(l, grad)

l = new_l

while !(all samples classified correctly or stopping criteria reached) 15

With the advent of backpropagation, neural networks began to see renewed interest and significant theoretical advancement in the 1980s. New forms of networks arose including Hopfield networks, which set the groundwork for modern recurrent neural networks (RNN) and the Neurocognitron, which inspired modern convolutional neural networks (CNN) (Hopfield, 1982; Fukushima, 1980). However, the lack of raw processing power and the lack of techniques to handle the vanishing gradient problem, the tendency for the backpropagated error gradient to approach zero causing early layers of the network to fail to train, led to disappointing performance from neural networks.

By the beginning of the 1990s, neural network research had fallen into disfavor again.

New techniques, including support vector machines, were introduced and proved far easier to train at the time (Cortes & Vapnik, 1995).

In 2006, Geoffrey Hinton, Simon Osindero, and Yee-Whye Teh published a paper titled “A Fast Learning Algorithm for Deep Belief Nets” which showed promising results in the classification of the Modified National Institute of Standards and Technology handwritten dataset (MNIST), a standard dataset in the field. The paper proposed a greedy technique to train a neural network built out of statistical units, known as Restricted Boltzman Machines, layer by layer. By only training one layer at a time, this technique avoided the vanishing gradient problem and allowed for deeper networks than 16

previously viable, and far faster training. Hinton’s paper reignited interest in neural networks and paved the way for modern deep learning.

2.1.2 Deep Learning Overview

Deep Learning is a set of techniques that are a natural progression of traditional neural network techniques. These include: Stochastic Gradient Descent (SGD) algorithm and its descendants (e.g., Bottou, 2010). Gradient descent is a first-order iterative optimization algorithm used for finding the minimum of a function. In neural networks, it is used in conjunction with backpropagation to update the weights in the network. It is formally defined with the update rule:

! = !!! − ( !!!),

where

! is the current point in the space, !!!is the previous point in the space, is the learning rate, and ( !!!) is the gradient of the value of the function being optimized at the previous point.

SGD is a derivative of traditional gradient descent, differing in that the error function is calculated using only a subsample of the available data. This is both easier to use and more efficient for training datasets that do not fit in memory. Furthermore, adding randomness to the optimization can aid in breaking through plateaus and avoiding 17

local minima. The addition of momentum terms, which biases the gradient in the direction of recently calculated gradients, greatly improved the ability to train deep models by further increasing speed of convergence (Sutskever, Martens, Dahl, & Hinton 2013). Newer SGD derived algorithms, such as Adaptive Moment Estimation (ADAM), calculate per parameter adaptive learning rates, enabling even more efficient training, at the cost of memory (Kingma & Ba, 2015).

New Activation Functions: One of the largest and most persistent problems in the development of neural networks is known as the vanishing gradient problem. In essence, as the gradient is propagated back along the network, the error is multiplied by values between 0 and 1 repeatedly. This causes the error, and thus the update, to trend toward 0 exponentially, resulting in little to no ability to update the early layers in a multilayer network (Hochreiter, 1991). Sigmoid activation functions, which were historically the most prominently activation function, are especially susceptible to this problem due to having a first derivative that rapidly tends toward zero as a neuron saturates. The sigmoid function is defined as: !

!!!!!

. The Rectified Linear Unit (ReLU), which maps the input to 0 if it below 0, and to itself if the input is above 0, and other modern activation functions have larger gradients and saturate less quickly, thus avoiding the vanishing gradient problem more effectively (Jarrett, Kavukcuoglu, Ranzato, & LeCun, 2009).

New types of layers. Perhaps the biggest difference between traditional neural networks and deep learning is the adoption of new types of layers in the network.

Traditional neural network research focused on fully connected layers, in which every neuron in one layer is connected to every neuron in the next. While many of these layer types existed in the past, they usually were not able to used to significant effect due to various issues in training.

Convolutional neural networks learn filter banks that are convolved with the original data. The filters can also be represented as a fully connected layer where the weights of the edges are tied together in way that replicates the convolution operation.

This weight sharing structure allows for fewer parameters than having each weight be unique, and directly accounts for structure in the data. As shown in Figure 2, each filter creates a new, processed version of the image.

Figure 2 Depiction of convolutional network.

(A) is the original data with a convolutional filter being applied. (B) is the transformed data with each filter applied, showing where the operation in (A) mapped. (C) shows the data flattened for use in MLP or classification layers.

Recurrent Neural Networks (RNN), networks with connections within a layer or from a layer to a previous layer, have contributed to the success of deep learning in many fields, particularly in time series classification. Training an RNN is a difficult task.

Backpropagation must be modified to function in RNNs, since there are cycles in the graph. This is frequently handled through a technique known as backpropagation through time, wherein the network is “unrolled” for a discrete number of steps. This process creates a network with only forward connections, allowing backpropagation to work as normal, at the cost of limiting the impact of the recurrent connections(Werbos, 1990).

Because of this, only a few architectures have seen widespread use and success in classification tasks.

In particular, Long Short-Term Memory (LSTM) networks have proven successful in recent years, though they were introduced in the 1990s (Hochreiter & Schmidhuber, 1997). It was not viable to train an LSTM network on the hardware available in the 1990s due to the large number of computations necessitated by recurrent architectures. The specific architecture of an LSTM network is shown in Figure 3.

LSTMs are made to function specifically with time series or sequential data. Each time point in the data is processed by its own LSTM unit, and the results are both passed to the next layer and to the processing of the next time point within the same layer. The 20

information passed through to the next layer is passed through an activation function, such as the tanh unit in Figure 3; however, the recurrent connection within the layer is not subjected to an activation function. The lack of an activation function in the recurrent layer is critical in avoiding the vanishing gradient problem, as it allows the gradient to be a constant value of one. Each LSTM unit has a number of gates which control the flow of information. A gate is a combination of a sigmoidal activation unit and pointwise multiplication. These gates control the amount of information that flows from one time point to the next, the amount of information that is the output of each unit, and other functions of the LSTM unit.

Figure 3 Architecture of an LSTM network Proposed by Christopher Olah (http://colah.github.io/posts/2015-08-Understanding-LSTMs/)

Essay: Brain imaging data

Essay details and download:

Text preview of this essay:

About this essay:

Essay details and download:

Text preview of this essay:

About this essay:

Essay Categories: