Generating Musical Synthesizer Patches with Machine Learning
Music producers spend a lot of time designing sounds with an electronic instrument called the synthesizer. To do this, they select “patches” which are configurations that are used to generate sounds. Generative machine learning algorithms have recently made it theoretically possible to generate synthesizer patches in the style of genres or artists automatically. Here I explain the theory and provide an open-source prototype implementation.
Note: the source code for this project is available here, feel free to try it yourself!
What is a synthesizer?
A synthesizer is an electronic musical instrument that generates audio signals. Synthesizers generate audio through methods including subtractive synthesis, additive synthesis, and frequency modulation synthesis. These sounds may be shaped and modulated by components such as filters, envelopes, and low-frequency oscillators. Synthesizers are typically played with keyboards or controlled by sequencers, software, or other instruments, often via MIDI. [Wikipedia]
Synthesizers are the backbone of modern music production, they give producers the power to generate a massive variety of sounds using a single device. While synthesizers were originally analog devices, it is more common today to use digital simulations due to their convenience and increased flexibility.
Synthesizers + Machine Learning = 🤘
From a “black-box” perspective, a synthesizer generates a sound waveform as a function of a set of configuration parameters and the desired musical pitches to be played. There are fairly large libraries of digital synthesizer patches available to music producers, organized by the style and tonal characteristics of each sound. These patch libraries can be used as training data for a generative model. A well-performing model would be a huge win for music producers, as it would give them access to an infinite library of expert-quality sounds.
Making it happen
I built a crude version of such a tool, though it’s not yet useful enough to be built into a product yet.
- Reading the training data (and writing generated presets)
First, we need to read in the synth patches. There are a number of popular digital synthesizers, but unfortunately they all use a proprietary patch (AKA preset) format. Fortunately, Ableton Analog uses a simple XML schema to encode patches as presets:
Simply unzip an Analog preset and you get this XML. The next task is to convert these documents into mathematical vectors that are suitable for use as training data. This is also harder that you might expect. For example, all values will need to be normalized to the same range between 0 and 1 to be understandable by the network. As an example, a simplified set of parameters and values might look like this:
Volume: 0.123
Filter Frequency: 0.457
Oscillator Shape: 1 (indicates “Square Wave”)
The desired vector would unroll the categorical value “Oscillator Shape”, yielding the following vector format:
[volume, filter frequency, osc shape category sawtooth, osc shape category square, osc shape category sine]
A vector of this format would look like this
[0.132, 0.457, 0, 1, 0]
2. Building the model
I tried generating presets using two different models: a variational autoencoder (VAE) and a generative adversarial network (GAN). The GAN had better performance, as it is a more sophisticated model.
A generative adversarial network consists of two separate machine learning models. First, a “generator” attempts to produce plausible presets. Next, a separate “discriminator” attempts to determine whether a given preset was plausibly generated by a human. These “adversarial” networks duel each other in the training process. The discriminator is shown both real synthesizer patches from the Ableton Analog preset library and fake ones produced by the generator. The discriminator learns based on its success at guessing real from fake and the generator learns based on its success fooling the discriminator.
The Ableton preset library buckets sounds into various categories. I used the category of the desired sound as an additional input to both models. This allows the generator to learn to make sounds in the style of these categories. In order to keep the categories relatively balanced, I consolidated some of the buckets since “Lead”, “Pad”, and “Bass” had the most presets in the data set.
Chosen Sound Categories:
- Lead
- Pad
- Bass
- Strings_Brass_Keys_Guitar_Other
I tried various architectures, one of the best performing was the Wasserstein GAN. The one I implemented was based on this tutorial. Diagrams of the model in an example configuration are provided below.
As shown above, the generator takes the sound category as a one-hot vector input of length 4 and runs it through a feedforward neural network to attempt to make sense of what the different categories mean. Then it takes a vector input of 100 random numbers as a “seed” input to build a unique preset from and concatenates it with the category network output. The result is fed through another feedforward neural network to generate the preset, and the output is a vector 208 numbers between 0 and 1 which map to parameters in Ableton Analog.
The critic uses roughly the same architecture, but instead of taking a seed as the input prior to concatenation, it takes a vector of length 208 which is the preset to be evaluated. Also of note, it adds a small amount of gaussian noise to the preset parameters. This helps prevent the critic from identifying irrelevant differences between the real and generated samples. For example, the precision of floating point numbers output by the generator vs used in the real presets. The output is the classification confidence that the sample was real, on a scale from 0 to 1.
3. Evaluating results
To demonstrate human performance, provided are two example patches from the training set. One of the patches is simple, the other is more complex. Both of these patches are relatively simple compared to the average patch across all synthesizers as Ableton Analog is a simple synthesizer and patches with additional effects were not included in the training set.
I trained the network in a variety of configurations using GPU machines via vast.ai.
When the network first starts training, the parameters will tend to all be set very close to their initial values, which will result in dead center settings for every parameter.
In an extreme case, a haywire model may generate patches that don’t produce any sound because the network has set various values to their maximums and minimums and hasn’t learned that a plausible sound doesn’t have its output volume set to zero.
Additionally, you can see the “Semi” and “Detune” settings are set to extreme levels. One of the key concepts the model has to learn is that some parameters only require subtle adjustments, while others may require larger ones in specific increments. For example, a human would usually only apply subtle changes to the detune knob, as setting a sound to be far out of tune is usually undesirable. Comparatively, the “Semi” knob would most often be set to zero or a pleasant interval such as 5 for a perfect fourth or 7 for a perfect fifth. 1 is rarely a good choice. High values for vibrato don’t usually sound good. If a low pass filter has its frequency set too low, you can’t hear anything. LFO rate doesn’t do anything unless the LFO is mapped to another parameter. These are just a few examples of concepts the model must learn.
A good model must also learn more sophisticated interactions to replicate more abstract concepts present in the training set. For example, modulating the filter with an LFO on a driven, low frequency sawtooth wave would create a dubstep wobble bass. Or, using sawtooth waveforms in unison would create an orchestra-like sound.
Fortunately the GAN also generated many very plausible presets. Notice how the model has learned to safely deviate from dead center for parameters such as Filter Frequency and Resonance, Oscillator Shape, Vibrato, and Amplitude Envelope.
Of note, the GAN learned that filter resonance and vibrato are often set to relatively low values. It learned to play with the oscillator waveform shapes, where each preset uses a different shape. It learned that the “error” feature on vibrato is usually set to zero, and seemingly various other relationships as all of the above patches sound plausible.
As will be discussed later in “Challenges”, I’ve not been able to avoid mode collapse, so the variety between generated presets is low.
Loss curves are used to evaluate the learning performance of a model. This plot shows the training loss for the critic on real samples in blue, fake samples in orange, and when updating the generator with fake samples in green.
The benefit of the WGAN is that the loss correlates with generated patch quality. Lower loss means better quality patches for a stable training process.
In this case, lower loss specifically refers to lower Wasserstein loss for generated patches as reported by the critic, shown as the orange line. A well-performing WGAN should show this line trending down as the patch quality of the generated model is increased.
In actual training, a steady decline of the crit_fake loss is observed, which should correlate to higher patch quality. However, this doesn’t seem to be the case. This indicates that the training process is not stable and would lead me to believe there is either an error in the model implementation or that the critic is optimizing for irrelevant differences between the real and fake samples.
Challenges
Difficulty removing irrelevant differences between generated and real samples
Machine learning models are dumb. The critic’s goal is to attempt to determine whether a sample was plausibly generated by the human, and it may accomplish this goal using information other than the tonal and aesthetic characteristics of the sound. For example, if the training set and generated samples use a different precision to represent floating point numbers, the critic could simply use this one fact to make its decision with perfect accuracy. This gives the generator no opportunity to learn, as there’s nothing it can do to fool the critic with respect to this attribute. There are many other problematic cases like this.
Mode collapse of the GAN on generating a single good preset.
A common problem I experienced is that the GAN would learn to generate a single good preset given any input. No matter the input seed or category, the output would sound roughly the same. This is not supposed to happen with WGANs
Lack of data
When it comes to machine learning, the more data the better. The Ableton preset library only has a few hundred presets. Much better performance would be achieved with 1000–10,000+ presets. Other synths are more complex but also have much larger preset libraries available.
Applying this to more popular/powerful synths such as Serum
More powerful synths will make training the GAN more challenging as they have many more configuration parameters than Ableton Analog. Additionally, there is the aforementioned issue of software interoperability: we can’t easily read and write presets for the most popular synths. What would it look like to get this working with a synth like Serum?
VST fxp presets follow a predefined structure based on a specification. However, Serum presets are “Opaque Chunk” format meaning the data we care about is an opaque sequence of ones and zeros.
Fortunately, it is still possible to make some sense of them. I was able to figure out that the chunk data is compressed by Zlib. We can decompress, make arbitrary single changes, and compare the result to the initial patch to reverse engineer the format.
It would also theoretically be possible to build a VST host that loads the synthesizers, manipulates parameters, and writes presets using the VST interface but writing a custom VST host seemed like a lot of work so I figured it would be easier to start with the XML.
A message for makers of software synthesizers: interoperability would be great. If you could allow other software to read and write presets from your synth it could enable some big advances in music production tools with the advent of machine learning.
Future Work
The networks that I’ve built fully rely on the preset configuration parameters as the training data the network learns from. An additional possibility that would be much harder to set up would be to also use the waveform generated by the sound as input to the model. I suspect this could significantly improve performance because the waveform is what a human would use to determine the aesthetic desirability of a sound. However, generating these waveforms from entire preset libraries in a suitable format would require a lot of scripting work.
Assuming that an effective GAN architecture is possible for this problem, building a useful tool for music producers will be a matter of:
- Fully reverse engineering popular preset formats OR building a special VST host to generate input data
- Collecting a massive dataset of annotated synth patches
- Building the model into a desktop program, VST, or online service.
Source Code
I’ve decided to open source this project under GPL in case anyone else wants to give it a shot. I would love to hear ideas on how to improve the project.
Support Me
Like this post? Give me a follow on LinkedIn or Twitter for more like it:
https://www.linkedin.com/in/jakespracher
https://twitter.com/jakespracher