The Accuracy of Multi-Layer Recurrent Neural Networks for Musical Lyric Generation
Abstract
With the rise of machine learning use in artificial content creation for art, poetry, and more, there are many questions regarding the optimal neural network configurations to achieve the highest qual- ity content possible. This paper examines the relationship between training data size and the resulting accuracy of generated text con- tent as defined by objective/perceived measured assessors of quality. To assess the quality and accuracy of the generated text, a standard- ised measurement was developed to consistently mark different artists with varying sample data sizes. Following the formulaic, empirical measurement of accuracy related to training data size, a survey was completed by participants from the University of Auck- land to gather data regarding the relationship between training set size and artist recognition. Google’s open source machine learning framework, TensorFlow, was used to generate mock lyrics trained under a specific musical artist.
ACM Reference Format:
Josue Espinosa Godinez. 2018. The Accuracy of Multi-Layer Recurrent Neural Networks for Musical Lyric Generation. In Proceedings of ACM Conference (Conference’17). ACM, New York, NY, USA, 4 pages. https://doi. org/10.1145/nnnnnnn.nnnnnnn
1 INTRODUCTION
To generate lyrics emulating a particular artist, a multi-layer re- current neural network using a long short-term memory model is trained on a subset of an artist’s lyrical discography in basic plain text format. This plain text file contains varying amounts of the discography for a particular artist and is used to train the neural network. Once trained, the network is used to generate text resem- bling the training data by using statistical analysis to predict next tokens in a sequence. Upon completion, the rnn-generated lyrics are compared against a validation data set and listed on a survey for human analysis.
∗Research conducted for COMPSCI 380 at the University of Auckland under the super- vision of Professor Gillian Dobbie and David Huang.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.
Conference’17, July 2017, Washington, DC, USA
© 2018 Association for Computing Machinery. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 https://doi.org/10.1145/nnnnnnn.nnnnnnn
2 RELATED RESEARCH
There have been many examples of using machine learning to generate artificially created content resembling human work. For example, Karpathy’s work on training a neural network on Shake- speare’s work to establish a similar structure and style in artificially generated content resembling the legendary author’s writings [1]. One of the most interesting and more complex examples of ar- tificially generated content comes through the use of attentional generative adversarial networks[6]. Using an attentional generative adversarial network, researchers were able to create a system that attempts to produce an image described by text. Another related example is the Neural Doodle project, which allows you to draw a quick doodle, upload a source image, and generate an actual artis- tic work with very good results if you annotate patches from the style image [7]. Another example in a different context is Google’s DeepMind work on writing programs that generate images through two interacting neural networks which captures the main traits of the image, much like artists initially capture the broad colours and shapes of their images, with meaningful, realistic brush strokes [3].
3 OBTAINING ACCURATE MODELS
It can be challenging to obtain accurate models for a particular dataset. Generally it’s preferable to make networks larger if you have the computational power and a reasonably sized dataset. When models are overfit, playing with dropout values and picking the model with the best validation performance can prove to be useful. If the dataset is too small for validation, TensorBoard will be noisy and not very accurate or helpful.
Josue Espinosa Godinez University of Auckland jesp142@aucklanduni.ac.nz
Figure 1: Graph depicting error on the training set of data over time
Conference’17, July 2017, Washington, DC, USA
Josue Espinosa
3.1 TensorBoard
TensorBoard is a visualisation toolkit used to debug and optimise TensorFlow programs. It was used to visually assess the perfor- mance of the recurrent neural network and tweak parameters to obtain the optimal configuration for the lyric generation. The Ten- sorBoard figure displayed shows the decrease of training loss over time for the training set with 3 Red Hot Chili Peppers albums.
3.2 Input Sanitisation
In order to minimise the independent variables through varying artists and data sizes, a consistent data format is maintained be- tween the different training/validation data sets. Dictated by prac- ticality, the data sets consist of a plain text file with no formatting except standard spacing, line breaks for stanzas, and structural song parts denoted in square brackets (e.g. [Chorus]).
3.3 RNN size
Choosing the hidden state size of RNNs is challenging since the hyperparameters don’t have a “set in stone” method for deciding their values and a lot of it is based on feel and experience. I alter- nated between high and low extremes – large sized vectors were computationally challenging and also overkill for the comparatively tiny datasets. The opposing low dimensions aren’t effective at learn- ing at all so I went upwards in factors of 2 before arriving at the reasonable value of 128 (for the relatively small sizes of data I was working with).
3.4 Number of Layers
Due to the relatively small size of data, 2 layers was suitable for my intents and purposes. For the 3 album dataset with Guns N’ Roses, there was a low error on the training set and a higher error on the validation set. To deal with the overfitting a small amount of dropout was added.
3.5 Sequence Length
Generally songs have specific themes and a tone. Contextually, all songs vary in their composition and story-telling, especially with their complexity. Normally I found one stanza to be sufficient for establishing context and remaining relevant yet interesting so 50 was chosen as the number of time steps to unroll for. The LSTM cells can remember longer than 50 but the effect falls off for longer sequences.
3.6 Dropout
Adding dropout was unnecessary since a vast majority of the time the network worked without overfitting. For the 3 album training set of Guns N’ Roses, the probability of keeping weights in the output layer was dropped to 90% and 95% for the input layer.
4 IMPLEMENTATION
(Brief text overview with main structure in a flow chart diagram)
5
MEASUREMENTS OF ACCURACY
(EMPIRICAL ACCURACY ANALYSIS)
Due to the artistic nature of music, it is difficult to assess the “qual- ity” of generated lyrics because they have an inherently subjective value assessment individual to personal taste. The goal is to de- velop a consistent measurement for similarity between different trial runs and input data sets and outputs. Beyond the subjective human tests performed later, an objective test based on the factors listed above is performed with varying weights placed in a basic algorithmic/formulaic format. This evaluation should have a ten- dency to be more objective than the later human experiment, it may be interesting to compare any similarities or trends between the two results.
5.1 Repetitiveness/Structure
One could argue repetition is a key element of songs CITE. As referenced in DAVID LINKED COMPRESSION SITE CITE, using this train of thought, it is natural to see a relationship between repetitiveness and compressibility. Take for example the Lempel- Ziv algorithm. Heavily based on repeated sequences, the higher the compressibility, the higher the repetitiveness. Comparing com- pression rates between a particular artists’ hits, take for example, comparison between the compression rate in “Can’t Buy Me Love”, “Help!”, “Yesterday”, and “Yellow Submarine” compared to a gener- ated song, or perhaps an average of several quintessential Beatles’ songs.
Investigate also general poetic structure – line length, sentence length, paragraphs, general structure; stanzas, etc.
5.2 Word-based
Artist-specific vocabulary – see “goo goo g’joob” from “I Am the Walrus” by The Beatles. California themes from Red Hot Chili Peppers. English words from Ed Sheeran. Sex/women themes from Bruno Mars. Guitar solo sections and specific words from very unique songs (Brownstone listed in reasonings behind selecting an artist in one of the participant’s answers.)
5.3 Character-based
Development/accuracy of language model for English language + words + punctuation.
6 ANALYSIS RESULTS
7 BLIND SURVEY (SUBJECTIVE ACCURACY ANALYSIS)
7.1 Pre-Survey
A pre-survey questionnaire was sent out asking 6 participants about their ability to recognise artists exclusively via song lyrics. The survey also asked for their levels of familiarity with particular genres/artists to assist in selecting the artists to generate lyrics for as well as to increase familiarity with lyrical habits and traits of their personal musical taste. Demographics were also collected to evaluate how their age/gender might influence their musical taste, judgement, etc. to list possible skews/biases in the results. The demographics heavily influenced the results, with all of the
LSTM RNN Text Generation
Conference’17, July 2017, Washington, DC, USA
participant being over the age of 25 and a majority being between 35 and 44. There was an even 50% split between men and women. Pop was selected as the most popular genre with Rock selected as a close second. Specific artists that were mentioned several times included Guns N’ Roses, Red Hot Chili Peppers, Bruno Mars, and Ed Sheeran, so there were the artists selected for use in the survey. Conveniently, the 2 most popular genres rock and pop both had 2 artists available for sampling in the survey that followed.
7.2 Survey Design
A blind test was conducted on a group of 9 people from the Com- puter Science department at the University of Auckland to deter- mine the optimal amount of training data for participants to recog- nize artists. The most popular artists listed in the pre-survey were selected to be used in the survey with 2 artists from the 2 most popular genres. Red Hot Chili Peppers and Guns N’ Roses for the rock genre and Bruno Mars and Ed Sheeran for the pop genre. Due to Bruno Mars and Ed Sheeran both only having 3 albums, 3 levels of data sets were tested for all of the artists – 1 album, 2 albums, and 3 albums. All artists had similar sizes between their albums e.g. all 3 of Ed Sheeran’s albums shared a very similar lyrical sizes. The recurrent neural network was fed each of the 3 data sizes for each artist, generating lyrics every time a new set was inputted. Consequently 12 “songs” were generated for the 4 artists at the 3 levels of data sizes.
The survey was 12 pages with one page for each “song” and the same questions for all of the lyrics:
1) Which artist do these lyrics most closely resemble? 2) How confident is your guess? 3) What made you guess this particular artist?
The artists selected for question 1 were selected via the Spotify page for that particular artists and selecting the top artists from the “Fans Also Like” page to construct an answer set of similar artists to determine if the lyrics generated truly resemble the artist. This answer set is reused for all of the questions regarding this particular artist so overall there are 4 answer sets (a unique answer set for each of the 4 artists). The songs are randomly shuffled as well as the answer order to avoid bias/preference for a particular letter e.g. ’C’ or someone selecting ’A’ for every question or someone getting used to an artist answer being a particular letter and not reading all of the answers.
The purpose of question 2 is to reduce the weight of answers on participants that do not feel confident about their familiarity with the artist or participants who are just guess/don’t have an answer that demonstrates similarity to an artist, just a guess. The options were no confidence, low confidence, average confidence, and high confidence.
The purpose of question 3 is to establish if there’s any consistent pattern that biases results. For example, many partipants marked their reasoning for picking Red Hot Chili Peppers as “because it mentioned California”, perhaps unfairly skewing the correct re- sponse rate positively (it is well known that the Red Hot Chili Peppers reference California in many of their songs).
Figure 2: Graph depicting % of correct answers at varying confidence levels for the 4 artists and their data set sizes.
8 SURVEY RESULTS/FINDINGS
There were 9 participants in the survey. The majority of the par- tipants were male and over the age of 35. 73% of all answers for the 4 questions with only 1 album of training data marked with average confidence or higher were correct. Comparitively 80% of questions marked with average or higher confidence were correct for questions with 2 albums of training data. Similarly 80% of 3- album-trained questions marked with average or higher confidence were correct.
Since this is relatively uninformative, I investigated artist-specific results to determine which artist had the highest results to use it as a more pointed example. Including all confidence levels and all 3 questions for every artist, Ed Sheeran scored 37% correct overall, Bruno Mars scored 48%, Red Hot Chili Peppers scored 59%, and Guns N’ Roses scored 63%. Given that the participants were mainly 30-40 year old men, the results are consistent and unsurprising since RHCP and GnR are both roughly 30 years old and were very popular at their peak compared to Ed Sheeran and Bruno Mars who are both very modern artists.
Using exclusively Guns N’ Roses and the Red Hot Chili Pep- pers, the following metrics were determined. 80% of Guns N’ Roses questions marked with an average or higher confidence level were correct while 100% of the Red Hot Chili Peppers songs were correct at this same confidence level. This may also be due to lyrical content bias as covered in the Pre-Survey explanation for the third question of explaining the reasoning for choosing a particular artist. How- ever, since the purpose of this survey is to determine how data size impacts lyrical artist resemblance, let’s investigate the relationship between data sizes and correct guesses. Using only one album and
Conference’17, July 2017, Washington, DC, USA
Josue Espinosa
all answers regardless of confidence levels, Guns N’ Roses songs were correct 67% of the time while the Red Hot Chili Peppers were correct 78% of the time. Using 2 albums and all confidence level answers, Guns N’ Roses was correct again 67% of the time while RHCP was correct only 44% of the time. Finally, using 3 albums at all confidence level answers, Guns N’ Roses was correct 56% of the time and RHCP was also at 56%. It seems that the more training data provided, the harder it is to guess which artist it is.
One possible reason behind this finding might not be due to the size of the recurrent network losing accuracy but rather artists evolve over time and change styles between albums and eras. When fed a data set with many different types of lyrics and themes, it mixes and mashes and comes up with an average that is difficult to differentiate akin to having several bright colours that stand alone well and are easy to identify but when mixed together create a grimy brown color that is difficult to distinguish.
A small segue, the confidence/correct relationship is very intu- itive. Questions with no confidence were correct 43% of the time, low confidence were correct 39% of the time, average confidence 75% of the time, and high confidence 82%.
Make apparent the relationship between size of training data and perceived/measured accuracy of generated lyrics. Regarding both empirical and subjective analysis.
9 CONCLUSION AND FUTURE RESEARCH
These sorts of challenges pose additional levels of functionality for neural networks to utilise context and potentially word/sentence structure using the Stanford Parser. Summarise, establish findings and what was learned/surprising, what can be done from here. Po- tential places for further research/expansion. Potential limitations to current implementation and/or things that could have been done differently/more efficiently. Justify arbitrary choices, etc.
References
[1] Karpathy, A. (2015, May 21). The Unreasonable Effectiveness of Recurrent Neural Networks. Retrieved from http://karpathy.github.io/2015/05/21/rnn- effectiveness/
[2] https://pudding.cool/2017/05/song-repetition/
[3] https://deepmind.com/blog/learning-to-generate-images/
[4] https://rajpurkar.github.io/SQuAD-explorer/
[5] https://nlp.stanford.edu/software/lex-parser.shtml
[6] https://github.com/taoxugit/AttnGAN/
[7] https://github.com/alexjc/neural-doodle