Evolution Strategies: Populations, Genomes, and… a Nature God?

Paper Summary: “Evolution Strategies as a Scalable Alternative to Reinforcement Learning”

11 min readSep 9, 2021

Abstract:

For a long time, Markov Decision Process (MDP) based RL techniques have dominated. But this paper explores an alternative → Evolution Strategies (ES). They experiment on MuJoCo and Atari — and ES is competitive with RL!

ES scales very well (over 1000 parallel workers), invariant to action frequency & delayed rewards, tolerant to extremely long horizons, and no need for value discounting/value function approximations. ES was able to get a 3D humanoid walking in just 10 minutes!

Introduction:

One key goal for ML is performing complex, tasks in uncertain environments. One approach is with RL — where it’s been traditionally dominated by MDP concepts. But there’s an alternative — black-box optimization! One set of black-box optimizations are ES.

This paper takes ES (an old idea) and brings it back into the light! They’re able to show that ES is competitive with traditional RL techniques. The key findings of this paper are:

They use Virtual Batch Normalization + other parameterizations to improve the reliability of ES (because without them ES is not good).
Parallelize ES like crazy through a new communication strategy between workers to achieve linear speedups.
Data efficiency of ES is pretty good (compared to A3C it only needs 3x — 10x the data). But we’re reducing the computation (by 3x) because we don’t need backprop nor value functions.
ES explored the search space much better than policy gradient methods.
ES is quite robust to hyperparameter settings.

Black box optimizers also have quite attractive properties:

Indifference to the distribution of rewards (sparse/dense)
No need for backprop
Tolerance to arbitrarily long time horizons

But they’re less effective at solving hard RL tasks. This paper hopes to inspire more work in this field (because it’s so cool!) since ES is competitive with RL!

Evolution Strategies:

ES is a subset of Black Box Optimization that was inspired by (you guessed it) evolution. And it works like this:

We have a generation of organisms (an iteration)
Each organism has its own genome (parameters)
And these genomes help the organisms be fit. The top organisms (ones with the highest fitness) get to reproduce and create the next generation
But this next generation will have different genomes than the previous generation because their genes have been mutated (parameters being perturbed (by noise))
We do this until we have organisms that are performing well for their environment

Different algorithms do the reproduction, mutations and populations differently. But this work focuses on Natural Evolution Strategies (NES) which works like this:

Imagine we have a box. This box is called F. We put in a ton of numbers (θ) into F, and we get out 1 single number (loss/reward).

Now think of these numbers (θ) like a genome. The genome goes in and the black box (the simulator) will dictate how fit the genome is. But a species doesn’t only have 1 genome — it has an entire population of them. So we have this a nature god (p) over here that creates a population:

(We’ll get back to ψ later)

This nature god doesn’t just muck around and create random organisms — nope! It has a goal: to create the best organisms possible. AKA maximize the average value that comes out of the black box.

Alright so now the nature god has a goal. And it tries to work towards it. Therefore ψ is the skill of the nature god. We update the skill value by:

And this is where the analogy breaks down. We can convert this gradient (how we upgrade the skill value) by adapting a similar form as REINFORCE from RL:

This work deals with RL. So the black box (F) is actually just the environment. The genomes (θ) are parameters for policy functions (like organisms!).

Most RL problems aren’t differentiable (you can’t just find the gradients of the weights with respect to the loss). For ES we can get around it too with the nature god, which in this case is an Isotropic Multivariate Gaussian (let’s just say that the german mathematician was a powerful guy).

The mean of the Gaussian is ψ, and it has a fixed covariance. Thus we can actually just write the whole expectation value part in terms of θ instead of ψ:

Where ϵ is the Gaussian distribution. And we can see the parameters being injected with the noise (scaled by σ).

And by putting everything together — this is how we define the gradient of θ:

We can describe this in plain pseudocode:

Or plain English:

We start with some parameters θ that are randomly initialized.
Then we make some copies of the original θ and inject them each with lots of noise.
We evaluate them → The ones that do well will have their weights weighed more. The ones that don’t do well won’t have their weights weighed as much. (Fitter organisms dictate the path of the future generation)
Then we repeat. 2 & 3 over and over again until we have a good θ.

On another note, ES also has a unique advantage — it’s insanely scalable because:

We need to finish entire episodes before sending anything — thus infrequent communication
And when we do communicate we only need a single value (the random seeds). Each worker needs a different random seed, so they just have to know who’s got which seed. Gradient methods on the other hand have to share entire gradients
No need for value functions! These guys are not cool because we need to update them a few times to get enough signal from them, plus when the policy is changed we need to wait a couple of iterations before the value function catches up.

ES only needs random seeds and in pseudocode:

And in plain English:

We working will sample some noise, inject it into the parameters and run the simulation
Each worker will send the results of each simulation to each other
What does each worker know? Well, they have a list of the results from each simulation. And they know the random seeds associated with each simulation. Therefore…
They can reconstruct the injected noise using the random seeds
Finally update the parameter of the model using the (injected noise, results)
Repeat the loop of 1–5 again and again.

In addition, we’re going to do a lot of other cool stuff to help the model:

In practice, we generate a block of Gaussian noise, and the random seed will just index it → And it doesn’t cause any problems. The upside is that reconstructing the injected noise is super fast.
For bigger networks / more workers, we could just have them only perturb a subset of the parameters.
We also use mirrored sampling, where we evaluate pairs of perturbations (-ϵ, ϵ).
We also do fitness shaping through a rank transformer to avoid outliers + avoid falling into local optima
Apply weight decay so the parameters aren’t crazy big compared to the perturbations.
Not adapting σ during training because it provided no value.
Cap each episode length at a constant number of steps — and we adjust this number as we go along (one way = to set it equal to twice the mean number, thus worst-case scenario of 50% CPU utilization)
Virtual batch norm (literally the same as regular batch norm). It’s a negligible cost (because it’s only for calculating normalizing statistics), but prevents the massive problem of: sometimes when we perturb the network, we end up with all the policies doing the same action. Virtual Batchnorm makes the policy more sensitive.
Sometimes we discretized the actions in games with continuous actions. This made actions non-smooth → Causing a wider variety of behaviours.

Smoothing in Parameter Space versus Smoothing in Action Space:

Otherwise known as where do we inject noise! Gradient methods do it in the action space while ES does it in the parameter space.

Let’s look at it from the RL’s perspective: The action space can be discrete, and policies can be deterministic — meaning that we’re going to get some unsmooth functions. Plus we can’t do backpropagation the traditional way since it’s not smooth.

Now suppose we have a reward function R. And it takes in some actions a. These actions are determined through a policy network, which is parameterized by θ. So we get:

Now we can try to make it smoother by injecting noise into the action space! So imagine the outputs of the network. Just before we softmax it, we’re just going to inject some noise to mess it up a bit (introduce an element of randomness). The perturbation looks like:

Now the new black box function looks like:

And the gradient looks like:

Like what we saw in the last section, ES takes a different approach — they’re injecting noise directly into the parameter space. We get new θ’s by injecting noise into the old θ’s.

When is ES better than Policy Gradients?

That depends on the problem + the type of gradient estimator. For hard RL problems, where the correlation between the return and individual actions is low, the variance for both gradient estimators will look like:

For Policy Gradients, the variance of the actions increases linearly with time (since at each action we’re injecting noise, thus the range of possible actions increases). While the variance of ES is independent of time (which is an attractive point).

To reduce the variance in Policy gradients, can implement discounting rewards, or if the effects of actions are short-lasting we don’t have to worry about it, or we could use value function approximations. But it starts breaking down when actions are long-lasting.

Thus ES is advantageous when actions are long-lasting, when we have super long timesteps and when there are no good value function estimates.

Problem Dimensionality:

ES is like estimating the gradient among all the dimensions!

Since ES resembles gradient estimates using finite differences, it actually scales rather poorly with respect to the number of parameters (because the search space grows). This means that larger networks take longer to optimize, but does not mean they do worse.

The performance of ES only depends on the difficulty of the task, not the number of parameters. Example: If we had linear regression, doubling the number of parameters doesn't make the problem any harder.

ES performs better in larger networks compared to smaller ones — just like for gradient-based approaches — since larger networks have fewer local minima.

Advantages of not Calculating Gradients:

Super scalable. We don’t need to communicate entire gradients — only the reward outcome + the random seed
We can also deal with sparse or delayer rewards — since we only communicate the total reward of the episode — while Gradient-based methods have individual rewards at the exact timing.
Reduce the computation per episode by ~66%. Reduces memory by much more too!
No exploding gradients (like in RNNs).
We can also use non-differentiable elements in the architecture!
We can also use low-precision hardware + TPUs can be used since we don’t need that much memory.
Also not burdened by needing to tune the frameskip parameter + or hierarchy (which is used in MDP based RL)

Experiments:

First, we test it on MoJoCo. It’s a comparison between TRPO and ES — same environments, same networks. For ES we sometimes had to discretize the action space to encourage more exploration. For complex environments, we needed <10x the timesteps, and for simple ones, it’s <3x.

For Atari, they trained for ~3x more frames, but they were able to parallelize the system such that it could crunch all those episodes in 1 hour per game. The published results for A3C took 1 day. The final results = performed better for 23 games and worse on 28.

For parallelization, there’s no need for any special networking — and it was used on a regular Amazon EC2. For the 3D humanoid walking task (one of the hardest tasks), SOTA RL takes 1 day to learn it on modern hardware.

ES takes 11 hours for a single 18-core machine. But ES has the potential to scale to 80 machines + 440 CPU core → resulting in the task being solved in 10 minutes. That’s 2 orders of magnitude lower!

Additionally, when experimenting on frameskip (which is a crucial hyperparameter for MDP RL), ES is robust and isn’t affected by frameskip that much (since it’s invariant to the length of the episode:

Related Work:

[They literally cite 14 papers + their titles + brief description of them]

But the main contribution of this paper is that ES is super scalable + is competitive with SOTA RL in terms of performance for even the hardest problems + close data efficiency + less wallclock time to train.

Conclusion:

This paper has gone through ES as an alternative to MDP — through experiments, it turns out that ES is competitive with traditional RL techniques! It’s invariant to action frequency & delayed rewards, with no need for temporal discounting nor value functions. Most importantly, ES is super scalable — thus makes it fast.

Future work is applying ES to areas where MDP is not that good. Plus exploring Meta-learning in RL. Also, combine ES with fast low precision NN implementations to fully take advantage of ES’s 0 gradient nature.

If you want to find out more: Read the paper here, or check out OpenAI’s blog post. Shoutout to OpenAI for always coming up with awesome papers! I loved this paper!

Thanks for reading! I’m Dickson, an 18-year-old ML enthusiast who’s excited to use it to impact billions of people 🌎

If you want to follow along on my journey, you can join my monthly newsletter, check out my website, and connect on LinkedIn or Twitter 😃