2.2.4 Long Short-term Memory (LSTMs)
Long Short-term memory is a specific recurrent neural network architecture designed specifically to deal with the Vanishing Gradient Problem, which we discussed in Section 2.2.2. LSTM units are among the most widely used models in Deep Learning for a wide variety of problems related to sequential data. They are more effective compared to simple recurrent neural networks as they have the ability to capture long-term temporal dependencies and are also robust against long time lags between important events.
The main idea behind LSTMs is a memory cell which can maintain its state over time with non-linear gating units which regulate the information flow into and out of the cell [10]. See Figure 8.
Figure 8: Vanilla Long Short-term Memory Cell [10]
All gates (input, output and forget) use the sigmoid activation function, while the block input usually uses the tanh one. The forget gate, which was not part of the initial design by [11], enables the cell to reset its own state, i.e. “forget”. The input gate controls to what extent input flows into the cell and similarly for the output gate. The peephole connections (in blue) were also added later to the design to enable the cell to control the gates themselves, thus making precise timings easier to learn. In the design above, all input, recurrent and peephole connections are weighted, whose weights are usually learned using Backpropagation Through Time (BPTT). We mentioned that the gradients tend to become very small when using BPTT which introduces the problem of certain layers tend to learn too slowly. However, as LSTMs do not have activation functions within their recurrent components the gradients do not tend to vanish.
We will be extensively taking use of LSTM blocks due to their ability for success- fully maintaining temporal behaviour, which comes very handy when learning and
8 generating data sequences.
2.3 Basic Probability Concepts
2.3.1 Probability Theory
A very important concept in Machine Learning is that of uncertainty, due to noise in our training data and limited size of our data sets. Probability theory is a fundamental part of pattern recognition as it allows us to quantify and manipulate uncertainty, vital for our ability to make correct predictions and therefore optimal decisions through decision theory covered in the following section.
Let’s now illustrate the basic concepts of probability theory using a simple example. Imagine we have two bags, one made of silk and one made of cotton. The silk bag has 2 white stones and 2 black ones, while the cotton one has 3 purple and 1 orange stone. Now we start randomly picking stones from the bags, recognizing their color, putting them back and repeat. Imagine 30% of the cases we chose the cotton bag and 70% the silk one (because it feels nicer). We can now introduce the concept of a random variable – a variable whose value depends on the occurrence of random events. The identity of which bag we choose can be represented as a random variable B with values c for cotton and s for silk. The same we can do for the color of the stones – C with values, w, b, p and o. The probability of an event to occur is the fraction of times the event occurred over the total number of trials. For example, the probability of selecting the silk bag is p(B = s) = 7/10 and the cotton one p(B = c) = 3/10. We can now also introduce the condition probability p(X = x|Y = y) which can be read as “what is the probability of x given y already occurred?”. For example, the probability of drawing a white stone if we chose the silk bag is p(S = w|B = s) = 1/2. Furthermore, the joint probability p(X = x, Y = y) can be read as “what is the probability of both x and y occurring at the same time?”. For example, p(S = w, B = s) = 4/10∗1/2 = 1/5. Most of the time we do not really care about specific probabilities but the whole distribution of probabilities, e.g. p(B), called probability distribution.
We can now introduce the following relationship between conditional probabilities, Bayes’ theorem:
likelihood
prior
z }| { z}|{ p(X|Y ) p(Y )
p(Y |X)
= posterior
p(X) | {z } evidence The prior probability above expressed our knowledge before we observe the event Y . For example, if we are asked which bag has been chosen before being given the color of the stone drawn, all the information we have is the probability distribution p(B). However, after we know the actual color chosen we can use Bayes’ theorem to compute the posterior probability p(B|S). The likelihood and the evidence can be easily computed from the data.
| {z }
9 2.3.2 Decision Theory
Suppose we have an input vector x and a corresponding target vector t. For classifi- cation tasks, t represents the class labels that each element of x belongs to. The joint probability p(x, t) introduced above, provides a complete summary of the uncertainty associated with these variables [12]. Determining p(x, t) is a pure example of inference and is the underlying problem of generative models, which we will be discussing in the next section.
However, let’s first illustrate the purpose of decision theory using a simpler classifi- cation task. Consider, for example, a HIV diagnosis in which we have taken a blood sample of a patient, and we wish to determine whether the patient has HIV or not. The input vector x will be represent the blood sample, and the output vector t can only two values (classes), P (positive) and N (negative). The inference problem we need to solve is determining the joint distribution p(x, {P, N}). Decision theory comes with the decision step whether to give treatment to the patient or not.
However, just for the classification task we are interested in the probabilities of the two classes given the blood sample, which are given by p(P|x) and p(N|x). We can again use Bayes’ theorem to find those:
p(x|P)p(P) p(P|x) = p(x)
Note that any of the quantities needed for applying the Bayer’ theorem can be ob- tained from the joint distribution by marginalizing. p(P) above is our prior knowledge and p(P|x) is the posterior probability of the patient being HIV-positive. For correct classification, we can just choose the class with the highest posterior probability. For pure classification tasks, there are several compelling reasons for calculating the con- ditional probabilities straight from the data rather than from the joint probability, one of which is that “one should solve the classification problem directly and never solve a more general problem as an intermediate step” [13]. However, for generative problems we are particularly interested in understanding the distribution behind the input data which we can retrieve from the joint distribution.
2.4 Discriminative vs. Generative Models
So in the section above we broke down the classification problem into two separate stages, the inference stage in which we try to learn a model for p(t|x), and the subse- quent decision stage in which we calculate the posterior probabilities and decide the optimal class assignments for future input data. An alternative way of solving this problem in one go using a discriminant function, which learns a mapping from x to decisions. For example, we can achieve that with an ANN (Section 2.2.1). There are, in fact, three approaches for solving decision problems.
The first one, and simplest, is to find a function f(x) that maps each input x onto a class label with as low error as possible.
10 The second one, which we discussed above, is to find the posterior probabilities using Bayes’ theorem and then use decision theory to make class assignments to new data. Models that take advantage of this technique are called discriminative.
The third one is when we try to model the joint distribution of probabilities and then normalize it to find the posterior probabilities. Therefore, we can again use decision theory to make class assignments. However, modelling the joint distribution gives us one more very interesting opportunity – we can actually sample from it to generate synthetic data points in the input space. In other words, we can “dream up” possible input data that can map to a certain class. Such models are called generative and around those we will be focusing in the rest of the report.
2.5 Generative Adversarial Networks (GAN)
Generative modelling is an active area of deep learning research and many researchers believe that breakthroughs in generative models are key to solving “unsupervised learn- ing”. Above we mentioned that in order to be able to generate synthetic samples from the input space we need to model the joint probability distribution between the input data and the class labels. Ian Goodfellow’s paper Generative Adversarial Nets [14] proposes an elegant way for a neural network to directly learn the sampler function for the joint distribution, by pitting it against another neural network that tries to reject samples from the generator distribution while only accepting samples from the “true” distribution.
Let’s illustrate this with an example. Suppose that George is a Monet art forger and Davina is an art advisor who tries to tell fake from real paintings. Let’s simplify things a bit and focus on only one specific feature. Suppose that all real Monet paintings have a certain number of trees painted on them. That number of trees on each painting is sampled randomly from a probability distribution, with a density function that only Monet knew, therefore neither Davina nor George are aware of that distribution. George’s goal is to generate synthetic samples x 0 from that distribution so that his fake art is indistinguishable from the real one. As George is not aware of the true underlying generative process he needs to approximate it using some kind of an approximator, e.g. a neural network. Naturally, George’s model (G) is a generative one while Davina’s one (D) is discriminative trying to tell whether the sample x 0 is real or fake.
In more technical terms, we train D to maximize the probability of assigning the correct label to both training examples and samples from G. While, G is trained to minimize:
log(1 − D(G(z))),
where z ∼ uniform(0, 1) are input noise variables. This setting resembles a lot the Minimax algorithm, quite popular in game theory and game AIs, where the AI tries to minimize its potential loss in a worst case scenario. The problem we are trying to solve is therefore (with a value function V (D, G)):
min max V (D, G) = E[log D(x)] + E[log(1 − D(G(z)))]
G
D
11 We can use gradient-based methods to find the optimal G and D. Let’s denote p g to be the distribution approximated by the generator G, and p data to be the true distribution. In Figure 9 we can see the distribution p g before and after training. After training the discriminative model can barely distinguish between the generator and the true underlying distribution (shown by the decision boundary). In the figure the gradient descent was run for 10,000 iterations, after pretraining the discriminative model D using a simple MSE loss function to fit p data .
Figure 9: Generative Adversarial Nets before and after 10000 iterations [15].
However, there is one big disadvantage with GANs. The images are generated from the arbitrary noise variables z, which are sampled from a uniform distribution. In other words, there is no way to generate images with specific features, unless you keep generating samples until you get such that works for you. However, we are still going to explore that model for generating single bars, however, it is highly likely that it won’t work for large sequences as part of any higher level RNN architecture.
Furthermore, GANs are very difficult to optimize as, for large problems, the two networks have to be in sync. If the generator wins by a big margin, it will start learning too slowly as the discriminator is unable to teach it anything new. On the other hand, if the discriminator is almost optimal, the generator cannot improve as the error is too low. However, there are ways to mitigate this, such as adding an L-2 regularization to the discriminator model in order to prevent it from overfitting the data and becoming too “strong”. Also limiting the expressiveness (by reducing the number of model parameters) of the discriminator is also another possibility to weaken it.
2.6 Auto-encoding Variational Bayes (VAE)
Another very popular framework for generative models are the Variational Autoen- coders introduced in 2014 by D. Kingma and M. Welling [16]. The main idea behind VAE is to try and maximize the probability of each x in the training set under the entire generative process, as in:
Z P(X) = P(X|z, θ)P(z)dz,
where z are the latent variables and θ are the model parameters.
12 In the other words, we need to perform maximum likelihood estimation to find the parameters θ. P(X|z, θ) is normally a Gaussian distribution or a Bernoulli, where the mean is a parameter to be learned. In the case of a Gaussian the variance is a hyperparameter. In fact, as our image data is binary, i.e. pixels are either black or white, we will be using the Bernoulli distribution.
Furthermore, for the prior P(z) we use a Normal distribution with mean 0 and variance I: N(0, I), from which we can sample zs and then make the network map those zs to the latent values that will maximize the probability of each x. However, that space is way too big and in fact for most z, P(x|z) will be 0. In other words, this will not be very efficient. We are in fact more interested in the posterior distribution P(z|x). However, due to the integral of the marginal likelihood this is not tractable to compute. That is why we try to find an approximator to the posterior Q(z|X), that given an x will give a distribution over z that is likely to produce x.
Therefore, we want to minimize the difference between the two conditional distribu- tions Q(z|X) and P(z|X), which can be represented by the KL-divergence (known as the latent loss):
D
KL
[Q(z|X)||P(z)].
Naturally, we would also like to reduce the reconstruction loss, yielding the total loss function for a single sample x (i) :
1 L(θ; x (i) ) = D KL [Q(z (i) |x (i) )||P(z (i) )] −
L
L X
log P(x (i) |z (i,l) )
l=1
Therefore, we can consider Q as the “encoder”, as it gives us a code z for an input image x, and P(X|z) as the “decoder”, since given a code z it produces a distribution over the possible corresponding values of x. See Figure 10.
Figure 10: Variational Autoencoder
The KL-divergence can be computed analytically as shown in [16], thus only the re- construction loss has to be estimated by sampling. Therefore, we can use stochastic gradient descent to find the optimal parameters θ.
2.7 Adversarial Autoencoders
Adversarial Autoencoders can be seen as a hybrid between VAE and GANs, where we use the GANs to perform variational inference by matching the aggregated posterior of the hidden code z with an arbitrary prior distribution [17]. TBC
13 2.8 Deep Recurrent Attentive Writer (DRAW)
Google DeepMind’s DRAW [18] is another generative model architecture which evolved from the family of Variational Autoencoders. What DRAW proposes is a model that captures visual structure using a sequence of partial glimpses, i.e. focusing on parts of the image while ignoring the rest, similar to what humans do. Furthermore, the encoder and decoder neural networks are recurrent, which allows the encoder to be aware of the decoder’s previous output, thus being able to amend the codes appro- priately. To accommodate this idea, the outputs of the decoder at each unrolling are added together to get the final generative distribution P(X|z). See Figure 11.
Figure 11: Deep Recurrent Attentive Writer Architecture
The intuition behind this unrolling in time can be seen as if we make the network improve the reconstruction by a bit with every time-step. The RNN at the previous time-step also specifies where to read as signified in the figure with the connection from the decoder to the focus block. h enc:t−1 and h dec:t−1 are the encoder and decoder hidden vectors, respectively. The latent loss is very similar to the one of VAEs, with the only difference that we sum KL-divergence over all time–steps:
T X
L z =
D
enc t
KL [Q(z t |h
)||P(z t )].
t=1
The c t above is an intermediate canvas-matrix, initialized to 0. With every time step we add the output of the decoder network with the previous canvas matrix c t−1 in order to get the final c T canvas matrix, which is used to parametrise a model D(x|c T ), which represents the P(x|z) sampling distribution. The reconstruction loss is therefore L x = − log D(x|c T ). To get the total loss we add the latent loss with the reconstruction loss:
T X
L = [L x + L z ] z∼Q =
D
KL
enc t
[Q(z t |h
)||P(z t )] − log D(x|c T )
t=1
14 2.9 TensorFlow
All generative models discussed so far in the previous sections fall under the family of deep learning models, thus they all require vast amounts of data to be able to produce good results and therefore serious computational power. It will be vital for us to have a stable tool-set with a robust pipeline that will be able to accommodate all GPU resources of ours and fully utilize them. That is why I will reside to using TensorFlow [19] which is a deep learning framework designed by Google’s Brain team. There are quite a few advantages to using this framework over other similar ones, which we will cover in the next few paragraphs.
Firstly, as it is designed by Google, TensorFlow became quite popular and managed to attract a big community of researchers around itself. With that said, there are tons of discussions around many of the issues that arise while implementing machine learning models in this particular framework, which can only bring us benefits. In fact, quite a few of the models that I will be exploring have already been implemented and open-sourced for the simpler problem of generating hand-written digits (MNIST). Furthermore, it is still being actively developed by the same team that created it initially, together with several open-source contributors.
Secondly, TensorFlow is considered quite “fat” as it comes with a lot of features and implementations of simpler deep learning architectures such as LSTMs, which can be used as a black box. This will be particularly useful for my project as I will not be faced with the hassle of implementing basic features such as convolutional and decon- volutional layers, LSTM blocks, gradient-based optimizers and so on. Furthermore, it provides useful facilities such as TensorBoard which can be used to monitor the performance of the model training, or even investigate particular convolutions.
Thirdly, TensorFlow has built-in GPU support which is enabled automatically for all powerful GPUs on the system. The framework automatically decides whether to use a GPU or CPU based on whether a certain operator, e.g. gradient update, has a GPU implementation, apart from a CPU one.
Fourthly, TensorFlow is a python-based framework which makes the development much easier and also allows us to take advantage of other useful python libraries such as numpy and scipy.
15 3
Evaluation
As part of the evaluation we will perform 3 different types of tests to assess the originality of the generated music pieces and also how indistinguishable are they from real music.
The first test will be a plagiarism test similar to the one performed by [5], where for each of the generated music pieces we calculate a measure of plagiarism – the length of the longest chorale subsequence which can be found identically in the training set. This will give us an idea how original and creative our models are and that they have not overfitted the training set. To perform this evaluation we will generate between 50 and 100 music pieces using each of the seemingly successful models, then for each of them find the longest common subsequence in the training set. This can then be plotted on a simple histogram to visually and empirically assess the creativity of the generated chorales. However, to be able to perform this plagiarism test we will need to find a way to compare to bars. Initially, to do so, we will use the Eucledian distance between any 2 bar images. If this does not work, we can then look into other methods, e.g. feature histograms.
The second test will be a listening test where subjects categorized by their level of musical proficiency will be asked to classify real and generated music pieces between two binary options – “Human” and “Computer”. This will give us a robust method to empirically tell how human-sounding the generated music pieces are. Before beginning the subjects will have to provide their level of proficiency on a scale from 1 to 5, where 1 is “I rarely listen to music” and 5 is “I am a professional musician”. To get the contacts of such individuals who might be interested in taking part of our test we will get in touch with the Royal College of Music, Goldsmiths (University of London) and other similar institutions.
A third test that we might explore, also similar to what they performed in [5], is a “perception” test, where we provide subjects with music pieces both real and generated from a certain time period, e.g. Romantic. Random pairs are drawn and provided to the subjects and they need to say which of two “sounds more Romantic” (for the same example above). After collecting that data we can compare empirically, which is the model being able to generate most Romantic-like music pieces.