Fürst Steps

I started off this whole project by trying out an existing implementation of the Music Transformer. Transformers are based on attention, although a lot more complicated. Here’s a good guide.

It was my first time working in depth with somebody else’s code, and it took some debugging to get it to work at all. However, it was fascinating to see the implementation of the things I had researched. I learned a lot about the in-depth workings of the Transformer architecture just by drawing a flowchart with matrix sizes based on the code. It also helped me realize what ‘query’, ‘key’, and ‘value’ vectors really meant.

Due to computational limitations on my PC, I was only able to train at a quarter of the sequence length suggested by the paper (512 instead of 2048.)

aaaaaaa (out of memory)

The model still reproduced the ~40% train accuracy on the Yamaha eComp dataset. As suggested in another paper, I tried training the model on a dataset of Beethoven pieces instead. This reached about 80% train accuracy and 75% validation accuracy. For reference, train accuracy is accuracy on training data and validation accuracy is accuracy on data from the dataset you don’t train on.

A sample generated without any notes as a prompt.

After that, I decided that I wanted to try emulating the style of a specific piece, namely Für Elise. However, training it on one piece only definitely wasn’t the move. Although I augmented the data to introduce variation, the final result wasn’t what I expected at all. The model just forgot the starting sequence and instantly went to playing the piece it was trained on:

Music from the Touhou games is used as a prompt.

Hence, I’d created the Für Elise Machine. It achieved 95%+ train accuracy, and knew nothing else.

Another sample.

Some key takeaways here:

  1. This model used millisecond timings for music, allowing it to generate performances effectively, but if you wanted to generate scores, you’d have to quantize the notes to beats.
  2. Augmentation of data is important in making the model more flexible, especially if the dataset is small. I ended up transposing pieces and making them slower and faster to create more data.
  3. Training on a single song/data point = bad.

In the future, I’m thinking training on the large dataset and then training on individual pieces or adding an identifier to the data for each piece might be better than just training on a single piece.

Your thoughts...?