ABSTRACT—When all is said in done music created by recurrent neural systems (RNNs) experiences an absence of worldwide structure. In spite of the fact that systems can learn note-by-note progress probabilities and even repeat phrases, endeavors at taking in a whole melodic frame and utilizing that information to direct arrangement have been unsuccessful. Music specialists have formed pieces that are both imaginative and exact. For instance, established music is notable for its fastidious structure and enthusiastic impact. Recurrent Neural Networks (RNNs) are capable models that have accomplished incredible execution on troublesome learning errands having temporal dependencies. We propose generative RNN models that make sheet music with well- formed structure and expressive traditions without predefining music piece standards to the models.
1. INTRODUCTION
Music in itself is a very difficult language, so for beginners to work on music composition, they first have to learn the language of music. Similar to other languages, music acts as a form of expression, with different key notes and melodies. So RNN has been very successful in tackling different problems like sentiment analysis, building chat bots, text generation etc. with great results. So for beginners to not indulge into learning the music language, such as time signatures, beats and notes, we use deep learning to make a dynamic model of musical structure and effects.
We extend the application of RNNs to building a music generator using character-based neural language models such as LSTM cells including bi-directionality and attention. For my network design, there were a few properties I wanted it to have:
• In order for the music to be composed time signature understanding is the fundamental thing. So that neural network can have its current time in reference to a time signature, since most music is composed with a fixed time signature.
• Be time-invariant: I needed the system to have the capacity to compose inconclusively, so it should have been indistinguishable for each time step.
• Be (for the most part) note-invariant: Music can be uninhibitedly transposed here and there, and it stays fundamentally the same. Therefore, I needed the structure of the neural system to be relatively indistinguishable for each note.
• Enable various notes to be played all the while, and permit choice of intelligible harmonies.
• Allow the same note to be repeated.
2. RELATED WORK
The models utilized in previous works have differed, including utilization of Recurrent Neural Networks joined with Restricted Boltzmann Machines (RNN-RBM) and Character RNN. The RNN-RBMs concentrated on making polyphonic music, but utilizing piano rolls while the Char-RNNs concentrated simply on Irish people music re-creation and not self-assertive music age. Our attention is on general monophonic music composition across various classes
Most past work in music organization influences utilization of MIDI or raw audio to arrangement to learn complex polyphonic structure in music. On account of raw audio, it is normal to delineate piece onto the mel-recurrence cepstrum coefficients (MFCCs).
3. METHODOLOGY
Due to each time step in a network being a single iteration, most existing RNN-based music composition approaches are invariant in time. Note in general is not invariant. There is typically some particular output node that speaks to each note. So transposing everything up by, say, one entire advance, would deliver a totally different output. For most progression, this is something you would want: “Bye” is completely different from “Hmmp”, which is just “transposed” one letter. But for music, you want to emphasize the relative relationships over the absolute positions: a
National University of Computer and Emerging Sciences Islamabad, Pakistan
Page 1
C major chords sounds more like a D major chord than like a C minor chord, even though the C minor chord is closer with regard to absolute note positions.
There is one sort of neural system that is broadly being used today that has this invariant property along different bearings: convolutional neural systems for image recognition. These work by essentially taking in a convolution part and after that applying that same convolution piece over each pixel of the input picture.
3.1. Algorithm
For training data for LSTM we use midi files as input. In preprocessing we convert the midi file into matrix to feed to network.
In MIDI, an event is characterized by the following features: • Font: Integer in [0,127]
• Note: Integer in [0,127]
• Velocity: Integer in [0,127]
• Duration: Integer in (0,infinity)
3.2. Implementation
Every input vector x is 66-dimensional and structured as follows: – x[0]: relative event time from previous input (tick)
– x[1]: BPM
– x[2-5]: Channel 1: font, note, velocity, duration
– x[6-9]: Channel 2
– …
– x[62-65]: Channel 16
The python library that we used was Theano to implement the model. Theano makes it easy for the compiling of the network because it generates fast neural networks by automatically calculating gradients and by compiling the network to GPU- optimized code for you.
We use a -1 in all features to indicate a silenced channel. The 16 channel font features are fixed for the full duration of a single track. The BPM is fixed for the full duration of a single track.
“Biaxial RNN,” is the solution that we came across. The idea is that we have two axes (and one pseudo-axis): there is the time axis and the note axis (and the direction-of-computation pseudo-
axis). Each intermittent layer changes inputs to outputs, and furthermore sends recurrent associations along one of these axes. However, there is no motivation behind why they all need to send associations along a similar axis!
3.3. Input and Output Details
My system depends on this design thought, obviously the genuine usage is more intricate. To start with, we have the contribution to the first run through hub layer at each time step: (the number in sections is the quantity of components in the information vector that compare to each part)
• Position: The MIDI note estimation of the present note. Used to get an ambiguous thought of how high or low a given note is, to take into consideration contrasts (like the idea that lower notes are ordinarily harmonies, upper notes are commonly tune).
• Pitch-class: Will be 1 at the situation of the present note, beginning at A for 0 and expanding by 1 for each half-advance, and 0 for all the others. Used to permit choice of more typical harmonies (i.e. it's more typical to have a C significant harmony than an E-level real harmony)
• Past Vicinity: Gives setting for encompassing notes in the last time step, one octave toward every path. The incentive at file 2(i+12) is 1 if the note at balance I from current note was played last time step, and 0 in the event that it was most certainly not. The incentive at 2(i+12) + 1 will be 1 if that note was enunciated last time step, and 0 in the event that it was most certainly not. (So on the off chance that you play a note and hold it, first time step has 1 in both, second has it just in first. On the off chance that you rehash a note, second will have 1 the two times.)
• Past Context: Value at record I will be the circumstances any note x where mod 12 was played last time step. Consequently if current note is C and there were 2 E's last time step, the incentive at record 4 (since E is 4 half strides above C) would be 2.
• Beat: Essentially a paired portrayal of position inside the measure, expecting 4/4 time. With each line being one of the beat inputs, and every section being a period step, it fundamentally just rehashes the accompanying example:
0101010101010101 0011001100110011 0000111100001111 0000000011111111
Page 2
4. RESULTS & EXPERIMENTATIONS
research material provided by their authors, Last but not the least thanks to Allah Almighty.
We used the Piano Midi files and preprocessed them into the
feature vectors of our own desired notes and musical structure. We
can make preparing quicker by exploiting the way that we
definitely know precisely which yield we will pick at each time 8.REFERENCES step. Essentially, we would first be able to cluster the majority of
the notes together and prepare the time-axis layers, and after that we can reorder the yield to group the greater part of the circumstances together and prepare all the note-axis layers. This enables us to all the more viably use the GPU, which is great at duplicating enormous matrices.
To keep our model from being overfit (which would mean learning particular parts of particular pieces rather than general examples and highlights), we can utilize something many refer to as dropout. Applying dropout basically implies arbitrarily removing half of the concealed notes from each layer amid each training step. This keeps the notes from inclining toward delicate conditions on each other and rather advances specialization. (We can actualize this by duplicating a cover with the outputs of each layer. Notes are “exterminated" by focusing their output in the given time step.)
Amid arrangement, we tragically can't cluster everything as adequately. At each time step, we need to first run the time-axis layers by one tick, and after that run a whole repetitive arrangement of the note-axis layers to figure out what contribution to provide for the time-hub layers at the following tick. This makes creation slower. What's more, we need to add a revision factor to represent the dropout amid preparing. For all intents and purposes, this implies increasing the output of every axis by 0.5. This keeps the system from getting to be overexcited because of the higher number of dynamic hubs.
5. FUTURE WORK
Chord notes are represented no differently than melody notes and in future experiments we intend to blend the two in a more realistic manner.
In the future, we would like to fix the GAN model so we can create even more musical and even more syntactically correct music. Also, we would like to extend the project by trying out different music notation formats, such as MusicXML, so we can have a more diverse dataset.
This would help us to ultimately create any kind of music dynamically.
6.CONCLUSION
Our music composition model based on LSTM and GANs was able to learn the overall architecture of a musical form, and used that information to compose new piano tunes. The network was trained on the MIDI files which were first preprocessed and converted to vectors, and instead of using all 12-13 features we just used 4. These experiments are preliminary and much more work is warranted. However by demonstrating that an RNN can capture both the local structure of melody and the long-term structure of a musical style, these experiments represent an advance in neural network music composition.
7.ACKNOWLEDGMENTS
We like to thank our parents, teachers, National University of Computer and Emerging Sciences, and credit to all those online