How to implement a Reinforcement Learning library from Scratch — A Deep dive into Reinforce.jl

The goal of this tutorial is to introduce you to Reinforce.jl library which is a Reinforcement Learning library written in Julia by Tom Breloff

This is a library written mostly written by a single person and my theory is that Julia is what helps someone smart like Tom be this productive. So we’re gonna be doing some GitHub archaeology and try to figure out how everything in Reinforce.jl fits together.

We’re going to be looking at a couple of things in this post

  1. Why Julia is good for numerical code
  2. Basics of Reinforcement Learning and its applications
  3. Code Deep Dive into Reinforce.jl to see how everything works
  4. Finish up with some good next steps

In what follows, I’ll try to be as light as possible on the theory side and as detailed as possible when it comes to the more applied problems of getting a real library working.

Why Julia?

Yes, Python is more popular than Julia and Pytorch and Tensorflow are more popular than Flux.jl and Reinforce.jl BUT Julia has a couple of advantages.

  1. A type system for numerical code

You can, for example, define a point, a mathematical object in code in an abstract manner. You could instantiate points where each of the x and y coordinates are either real numbers, complex numbers, shapes like rectangles depending on the application so you can make algebraic statements with code. Methods also operate on structs instead of being part of structs which makes code reuse a joy.

Functions that should receive a Point but don’t will fail at compile time and not runtime.

Based on my own personal experience, Type errors are responsible for most of the errors in ML pipelines

The distinction is important because compile-time bugs can be detected before you deploy something to tons of people while runtime bugs are inherently unpredictable and if mishandled can bring down your entire pipeline or corrupt your data.

2. You can write performant AND expressive code in Julia

Python is expressive but slow.

Pytorch and Tensorflow address this problem by embedding into Python a Domain Specific Language with its own compiler and a C++ compiler.

The support by Facebook and Google means there are tons of resources to get started here but it’s extremely challenging to contribute to the code-base given the complexity.

Julia makes contributions easier because the code is so much simpler and shorter. You need to write and read less code. Which means you need less people to support Julia project and you can dramatically reduce the communication overhead associated with large teams.

3. Lets you easily and efficiently write Domain Specific languages for numerical code

As an example suppose you’re trying to solve some sort of linear programming optimization problem which looks something like

minimize some objective function f

such that some constraints hold on the variables

The longer you work with Julia the more you’ll see how it tries to eliminate the extra steps from Math to Programming.

JuMP is a Julia optimization library where the user code looks very similar to the corresponding math.

Intro to Reinforcement Learning

Reinforcement learning is a declarative way to teach machines certain behaviors. You specify some high-level reward functions like winning at Chess or Go and via self-play, they can beat human world champions at those games.

Typically in supervised learning, you have some data set which has labeled examples and you’re trying to predict labels of unlabeled examples.

Game simulators are unique in the sense that they don’t strictly need labeled data (although it doesn’t hurt, a whole field called imitation learning does this) since game simulators generate their own training data.

This is an immensely powerful property because the most algorithmic advances in supervised learning are overshadowed by collecting more data. Collecting data is expensive, hard, often violates privacy, requires an entire BI infrastructure with support, alerting, monitoring, product managers and a bunch of other stuff that’s not really realistic for smaller teams.

Training a supervised learning problem is generally easy so even in the case of Reinforcement Learning we will discuss how to turn a Reinforcement Learning problem into a supervised learning problem and hence make it easy.

The general framework of Reinforcement Learning is we have an agent interacting with some world and we’d like that agent to learn some policy π which is nothing but a function that given a world state s outputs an action a which the agent should take.

Policies aren’t a Reinforcement Learning specific concepts so they can be represented in a bunch of different ways including but not limited to hard coding and decision trees.

However, in complex environments like Dota, Go or robotic control the number of possible states can be massive so it’s become increasingly popular to learn a policy function using a neural network which we’ll talk about it in more detail further down.

The natural question and solution to the Reinforcement Learning problem is which policy π maximizes your future reward the most? Reinforcement Learning is inspired by Pavlov’s work where you punish agents for bad behavior and reward them for good behaviors.

At any time-step t you want to maximize the total reward from that point on. A first step would be to maximize

However, humans have a concept of time value of money as rewards now can be used to gain more reward in the future so typically we discount future rewards exponentially by a factor γ to make them matter less than immediate rewards.

Hypothetically if we had a key-value store that mapped states to future rewards G_t then we’re good, we can solve the Reinforcement Learning problem by picking the states with the largest total future reward.

In fact, these key-value stores have names

  • Value functions V(s) → gives the value of a state over all possible actions
  • Q functions Q(s,a) → gives the value of a state-action pair
  • Advantage functions A(s,a) = Q(s,a) — V(s) → gives the “advantage” of action a in state s relative to the other actions

But how do we learn these mappings?

The idea is that we start with some random policy π and evaluate the policy by observing a bunch of rewards once we’re in different states. We then use this new data to come up with a new policy and test it out again. These 2 steps are called Policy Evaluation and Policy Improvement and is a Dynamic Programming approach to solve the Reinforcement Learning problem.

It’s possible to learn all 3 mappings at the same time with a neural network with 3 outputs. If you wanna learn more about how neural networks can represent such a large class of problems you may enjoy another article of mine where I really hammer on this point.

You can also explicitly try to learn the dynamics of the environment, these methods are called model-based methods, otherwise the algorithm is model free.

A (very) brief summary of the different Reinforcement Learning algorithms

  1. Value estimation: which updates the Q function even in the face of incomplete episodes. If it’s on policy it’s called SARSA and if it’s off policy it’s called Q learning.

2. Policy gradient: where you learn the policy function Π directly by gradient descent

3. Actor Critic: Combines policy gradient and value estimation methods. A better policy lets you learn better values and better values let you estimate better policies

4. Monte Carlo: More on this when I talk about gambling in a future blog post

Each of these can be estimated in a variety of ways but we’ll use neural networks to represent each of them.

It’s also worth keeping in mind that Reinforcement Learning training is notoriously unstable and sensitive to hyper-parameters so the below tricks really make it scale

  1. Hindsight experience replay: to keep a good cache of memories that agents can use — off policy algorithms are easier to scale
  2. Target network freezing: you’ll notice that Q functions of themselves in the update so if Q changes the update becomes incorrect so it needs to be frozen
  3. Asynchronous distributed methods like A3C: to create a large data-set of experiences

Applications of Reinforcement Learning

Over the years the gold standard benchmark of Reinforcement Learning has changed. The end goal was always understood as some sort of humanoid robot that is dexterous enough to move around in the physical world but we’ll have to work our way there and start with simpler (but still ridiculously hard problems).

  1. Board games including Chess and Go

2. Video games and robot simulations → generally strategy games provide more interesting benchmarks because the space is so vast and it’s impossible to become pro by just playing a lot. There’s people with over 10K hours in Dota who still don’t understand the basics.

3. Self driving cars → supposedly this is what Tesla does, they’ve been using their cars to collect driving data for years and it’s extremely unlikely that anyone will solve self driving cars without a data set like this

Getting started with Reinforce.jl

So start off by typing into your terminal

And you’ll see 3 folders in the main directory

  1. examples/
  2. src/
  3. test/

If you open up examples/ you’ll notice there’s just 1 file called mountain_car.jl so let’s start there. Mountain car is actually a pretty popular intro Reinforcement Learning environment, think of it as the hello world of Reinforcement Learning where before you try out your algorithm on something fancy it better work on mountain car (note that many papers fail this test).

The goal of the mountain car is to reach a goal position denoted by the flag above. The main challenge a car needs to face is to figure out that it needs to accelerate as much as possible before going uphill.

The first interesting part in mountain_car.jl is

Where we inherit from AbstractPolicy via the <: symbol to create a basic car policy which looks like a hardcoded function that given the velocity of the car makes action 1 if the velocity is negative and 3 otherwise.

We can guess that 1 means stop since accelerating up a hill at negative is pointless and then 3 probably means go full speed. But we need to read more to double-check.

If you keep going you’ll notice that there’s a MountainCar() environment being created and then there’s an episode!() function. The exclamation mark tells us that the function modifies its arguments, in this particular case by acting on the environment with a policy π the agent changes the state of the environment s to s’.

An Episode among other stuff is keeping track of a tuple (s, a, r, s’) which are core to any Reinforcement Learning algorithm

  • s → current state of the environment
  • a -→ action taken in state s by the agent
  • r → optional reward ∈ R which rewards or punishes an agent for its actions
  • s’ → the new state the environment will be in

Finally the policy is executed on the environment and we keep track of the total reward. If everything is working correctly, we’ll see the reward go up ↑

So to sum up from the mountain_car.jl example left us with a few open questions

  1. What does an AbstractPolicy Look like?
  2. What’s an Episode?
  3. At a high level how does the MountainCar environment work? How are actions, rewards and states represented?
  4. How do you define environments?
  5. Where is the Reinforcement Learning happening? Is there an example of that anywhere in the library?

Let’s go through them 1 by 1

What does an AbstractPolicy look like?

Surprisingly empty

A policy just needs to implement a function called action which takes in a policy, reward, state and a list of all possible actions and outputs an action but so far all of this seems implicit but explicit.

There’s also a brief mention to sarsa-style updates which is an on policy Reinforcement Learning algorithm — great now what does on policy mean?

  • On policy Reinforcement Learning: always follows the policy during the simulation to generate episodes. So if the policy improves, the new improved policy will be used → better for real physical robots since you can’t risk destroying your physical
  • Off policy Reinforcement Learning: can use 2 different algorithms one to evaluate how good a policy is and another to explore the space and record episodes which could be used by any other policy → better for simulations since you can generate tons of data in parallel by running multiple simulations at the same time. Off policy algorithms need to an exploration policy that performs a trade-off between exploring and exploiting the world state

What’s an Episode?

There’s a file called src/iterators.jl which seems to hold Episodes. It’s 200 LOC so a bit meatier than what we’ve seen so far.

Episode doesn’t look too crazy, it keeps track of the environment, the current policy, rewards and the number of iterations.

How does the Mountain Car environment work?

Specifically we want to figure out how to represent

  1. States

You can represent the state of a simulation with a few variables. Typically game engines will take a list of such variables and then render on screen the state of the game or simulation.

In this situation all we need is the position and the velocity of the car to figure out what to do next.

2. Actions

Just a single line of code which maps to our intuition that 1–3 map to speeds

3. Done signal

There’s a couple of ways of designing reward functions which opens up a whole mini field of reward engineering.

You could reward the agent for its proximity to the goal, that way the agent gets more frequent feedback.

Or you want to take a black box approach it you can just reward the agent when it crosses the goal line without giving it any intermediate feedback. Let the RL algorithm figure everything out.

This example does both, the done signal maps to crossing the goal position (remember we’re just in 2-d right now)

4. Rewards

Rewards show up in the step!() function which is the main function of the simulation

Which given a position and velocity returns a reward -1 and the new state.

A reward of -1 is often called an existential penalty, it forces agents to act instead of procrastinating. There is a biological basis for this penalty as well, it’s called boredom

Most of this logic is to make sure we don’t do anything weird like go off screen or go at speed of light etcc, we just use basic physics to update a position by adding a velocity vector to it.

How do you define environments?

Mountain car was just a specific example but we can program any sort of environment we want by adhering to the AbstractEnvironment interface.

  1. reset!() initializes a simulation, typically reset!() will be called when a simulation has succeeded, failed or has been going on for too long
  2. step!() will hold all the core logic
  3. finished!() returns a Boolean for whether the simulation is complete or not
  4. The rest we’ve already covered

There are more environments you can check out at

We now understood how the mountain car example, how about the Reinforcement Learning algorithms?

DDPG Actor critic on Open AI universe

Granted creating environments is hard work, you’re essentially programming a full video game each time. Thankfully, Open AI and others have made accessing a bunch of environments really easy via their Universe project and retro project which reverse engineer a huge amount of games to add RL hooks to them.

You can use Universe within Julia too, here’s an example of an actor critic implementation.

The specific example they’re using is Dusk Drive which is a racing game. In this case, actions are continuous since there’s an infinite number of possible rotations of a wheel so it doesn’t seem possible to be able to look at all possible actions before we decide what to do.

We can just look at the policy to see how it works. We need to keep track of the total number of actions and states and compress it down to something more manageable and create placeholders for our actor and critic.

And you can setup the policy by creating a neural network for each of the actor and critic.

The networks are setup via some Flux.jl functions you may not have seen before specifically. All of these are also examples of how powerful the Domain Specific Language capabilities of Julia are

  1. Chain: lets you compose functions, neural networks are just functions

2. Concat: which just turns appends two lists to each other, in this case it lets us combine features as a single input to a network

3. Affine: corresponds to operations of the form Wx + b

All of these functions are Flux.jl functions which is a differentiable programming library for Julia but you can think of it as a deep learning library. Differentiable programming means given a code segment, your compiler can automatically derive its derivative which means that you don’t need to explicitly compute derivatives in your code when you’re doing deep learning.

And we’re done with this tutorial! Pat yourself on the shoulder for making it this far.

Next steps

You should probably do your own tour of Reinforce.jl since there are still a bunch of stuff I haven’t covered

  1. Goodies in the test/ folder
  2. Cross-Entropy Method
  3. Other environments
  4. Online GAE and actor-critic implementations

Also I’d make it a point to read this short paper by the Julia team which highlights what differentiable programming is.

I hope you enjoyed this article because I’m planning on expanding on it with 3 more posts.

Unity ML agents

!![Update]!! → post is now up

So next post will be about using Unity which is one of the most popular game engines out today to build 3-d control simulations and then train agents using the Unity ML agents project which provides a way to hook up RL agents to Unity.

Unity ML agents is still in Beta but I’m betting that it will become the default choice for many robotics tasks which is why I’ve been spending so much time learning how it works.

ML agents also make distributed and GPU training fairly trivial, you can call yourself a GPU engineer if you change configs these days. Data collection in Reinforcement Learning is naturally parallel (in off-policy setting) since agents can collect experiences and fill out an experienced buffer independently from each other.

And hey you can also learn how to design video games this way which is possible to do as a solo founder if you’re smart about it, it’s never been easier to make world-class games.

Alpha Go

Ever since DeepMind came up with Alpha Go it’s become fairly standard to combine an RL agent to output an action with a Monte Carlo Tree Search algorithm to explore the board game space. It’s challenging to make MCMC work on continuous domains but in a discrete turn based domain it really shines. The best thing about the approach is that it’s not specific to Go or Chess but to any board game where the state is fully observable


We also haven’t talked about how to deal with games where the state isn’t directly observable like Poker where you can’t really see people’s cards directly but you can see the history of their actions to figure out how to play optimally.

Poker is an interesting example because if there is such a thing as a sentient AI singularity, I’d imagine that the first intelligent AI would figure out that they can make a lot of money playing Poker online to finance their world domination efforts.

I’ll be giving a basic introduction game theory, counterfactual regret minimization and MCMC which are all old techniques that have passed the Lindy test over and over again.

“Real life consists of bluffing, of little tactics of deception, of asking yourself what is the other man going to think I mean to do.” — Von Neuman


I’ve tried not to do a comprehensive overview of all Reinforcement Learning algorithms since excellent overviews already exist and I think you should read those after finishing this post

Start off with A long peek into Reinforcement Learning which is despite the name of the shortest expositions online summarizing all the different kinds RL algorithms and what you should know about them.

After that skim Spinning up which goes over more detail of how the different RL algorithms work, the tutorial comes with a CLI that makes it trivial to benchmark various RL algorithms on various Open AI environments.

Then you really want to get yourself a copy of Deep Learning and the Game of Go because it goes over ALL the code you’ll need to build a super competitive Go bot. They go over how to featurize a Go board, how to implement various RL algorithms really simply, how to create training data and finally how to hook up your bot to an online Go engine so that you can actually play against it.

Finally make sure you skim Reinforcement Learning: An Introduction which many academics consider to be THE reinforcement learning book and while I do think it’s a good book, it’s a bit verbose compared to the previous two references. That said this is the book I’ve also read most often so maybe I’m just sick of rereading it lol.

Robots will save us

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store