Automatic Differentiation lets you compute exact derivatives in constant time
Differentiation shows up everywhere from the backprop algorithm in deep neural networks to the equations of motion in physics and to pretty much any field that needs to quantify a rate of change.
Automatic Differentiation is the secret sauce that powers all the hottest and latest Machine Learning frameworks from Flux.jl to Pytorch to Tensorflow. Differentiation in general is becoming a first class citizen in programming languages with early work started by Chris Lattner of LLVM fame — see the Differentiable Programming Manifesto for more detail.
Those frameworks essentially offer you a mini programming language embedded in a larger one where computing the derivative of a function takes as much time as evaluating the function. In the case of Deep Learning, you define a network with a loss function and get a gradient for free.
Automatic Differentiation != Numeric Differentiation
To make the above point clearer let’s go over Symbolic Differentiation and Numerical Differentiation and then introduce Automatic Differentiation and what makes it so great.
But just know that it doesn’t have to be complicated, in fact you can have a working implementation of Automatic Differentiation fit in a single Tweet.
Symbolic differentiation works by breaking apart a complex expression into a bunch of simpler expressions by using various rules — very similar to a compiler.
Examples of some rules
Derivatives of powers rule
And then you would use those rules to solve a differential expression, as an example we’d like to find the derivative of f where f
By the sum rule we get
If we combine the constant rule and derivatives of powers rule we get
The main issue with symbolic differentiation is that in simplifying the expression we could instead end up with an exponentially large expression to evaluate which will be prohibitively slow — O(2^n)
Symbolic differentiation isn’t used much in practice but it’s great to prototype with.
Numeric Differentiation is far more popular and effectively the default way to compute derivatives in most applications. The usual formula is below
This equation is derived via the famous Taylor Series expansion which lets you write a function in terms of its higher order derivatives. I won’t cover why Taylor series expansion works, you can check out this post to learn more if you’re interested.
The main issue with numeric differentiation is that if ϵ is too small, computers will end up facing floating point errors and give incorrect results! If ϵ is too large then the result will be an approximation. It’s also slow O(n).
Wow, differentiation algorithms really such eh? Well maybe not.
Automatic Differentiation gives exact answers in constant time. However, it does require introducing some unfamiliar math but it’s really simple after you skim it once.
Before we introduce Automatic Differentiation we need to talk about Dual Numbers.
Dual numbers are numbers of the form a + bϵ where ϵ² = 0
Dual numbers look like complex numbers if you replace i² = -1 by ϵ² = 0 but they’re unrelated so don’t let the similarity distract you.
Suppose you have two dual numbers a + bϵ and c + dϵ you can do arithmetic expressions on them
If you add 2 dual numbers you get a dual number
If you multiply 2 dual numbers you get a dual number
Addition and Multiplication in of themselves are not particularly useful but the power of dual numbers shines when we take a look a the Taylor series expansion of a function about a dual point instead of a regular point like what we saw in numeric optimization.
Taylor Series about a Dual Point
Plain Taylor series approximates a function f about a point a by using all its higher order derivatives.
Instead of approximating f about a real number a
We will approximate f about a real number a + ϵ
We end up with the expression
Dual numbers have the convenient property that ϵ² = 0 which means ϵ³, ϵ⁴ … all = 0.
So the expression simplifies to
So if you evaluate a function f at a dual number a + ϵ you evaluate the function AND get its derivative for free 🤯
The solution we obtained is also EXACT because we are not ignoring the higher order derivatives like in numeric differentiation but we are eliminating them.
The solution can be computed FAST because we are just evaluating a function and not dealing with infinite sums or any such nonsense. Symbolic differentiation is also exact but it’s slow because you need to expand out an exponential number of expressions to evaluate it
A simple example
Let’s say we have a function f(x) = x² + 1 and we’d like to compute its derivative via Automatic Differentiation.
So we evaluate f(x + ϵ) = (x + ϵ)² + 1 = x² + ϵ² + 2xϵ + 1
ϵ² = 0 so f(x +was ϵ) = (x² + 1) + 2xϵ
2x is the derivative of x² + 1 🤯
And that’s it, you now understand Automatic Differentiation and call yourself a Deep Learning OSS contributor.
While the above algorithm works it suffers from the same performance issues as Symbolic Differentiation where we need to expand out a potentially exponentially sized expression.
So next we’ll go over the two main algorithms for Automatic Differentiation forward mode and Reverse Mode Automatic Differentiation with an example borrowed from Automatic Differentiation in Machine Learning: a Survey
Forward Mode Differentiation
The first step in both the forward and Reverse Mode AD algorithms is to represent a function as a computational graph, also called a Wengert list.
Each node v in this list will represent an intermediate result of the computation. The intermediate results can then be assembled using the chain rule to get the final derivative we’re looking for.
As an example suppose you have a function f where
And we’d like to evaluate the derivative f’ at
The computational graph for this function is
And we’re interested in calculating
We can break this problem down into calculating how much each intermediate node varies with respect to the input.
The primal trace simply stores all the intermediate computations of each node
But our goal is to compute
Which we can do via the Dual Trace where we differentiate each expression v with respect to x_2.
And since we compute v̇_i at the same time as v_i then we can calculate a derivative at the same time as evaluating a function with no memory overhead.
That said you may have noticed that we only computed the derivative for x_2, in fact we’d need to repeat this process for each input which on deep learning applications is a non starter given that input data is very often high dimensional.
Working with vector valued functions
More generally we’d like to work with vector valued functions f that can take in multiple inputs and produce multiple outputs.
So we need a way to represent the derivative of each output y with respect to each input x and the Jacobian matrix helps us do this.
The technique we’ve described so far is called Forward Mode Differentiation.
Forward Mode Differentiation really suffers when n is large because you need to do a prime and dual trace for each input variable x_i. But the algorithm scales for free when you increase the number of outputs m so it’s still good for generative applications that need to generate large sequences from small input data seed.
So as a result we need to learn about a different technique called Reverse Mode Differentiation. Most deep learning workflows have n >> m so Reverse Mode Differentiation is the algorithm of choice for back-propagation in Pytorch, Flux.jl, Tensorflow and other Deep Learning libraries.
Reverse Mode Differentiation
Let’s use the same function f as an example of how this works
The Dual Trace for Reverse Mode AD is more complex than its forward counterpart but the main trick is the chain rule.
The chain rule is a technique to break apart a derivative we don’t know how to solve into derivatives that we do know how to solve.
Applied to the context of a computational graph we can write the chain rule as.
But the v_k we’ll be picking won’t be arbitrary. In fact v_k would be the parent of v_i in the computational graph. If v_k has more than one parent then we sum up the chain rule over all its parents. This is called the Multi-variable chain rule, you can find a proof here
The above expression has a name and it’s called the adjoint of v_i which we’ll denote as ̅v_i.
Given this definition we can then rewrite the adjoint in terms of the adjoint of its parents.
Which gives us a recursive algorithm where we start from the output node y and go back all the way to the input nodes by going over adjoints.
To be clear on what we mean by parent, in the example we’re working with v_3 is a parent of both v_1 and v_2.
The Dual/Adjoint Trace
Even though the algorithm was completely different we got back back the same result which if you can verify either by symbolic or numeric differentiation.
You can also see how the process is kinda fiddly for a human but is ideally suited for computers.
The amazing thing about Reverse Mode AD is that it computed in one full iteration both of
While the example we worked through with Forward Mode AD only gave us back.
If you’d like to see one more example fully worked out, Step-by-step example of reverse-mode automatic differentiation is another great resource.
Wengert List for Logistic Regression
And if you want to generalize this idea to deep neural networks you just need to apply either Forward or Reverse Mode AD on a Wengert List that looks something like the below. All of the operations are differentiable and if you’re writing them in a deep learning library, it takes care of making them differentiable for you even if you for e.g have control statements.
But that’s just code? What do you mean it’s a Wengert List? Code that does 1 operation per line is a Wengert List — in languages like Julia this is explicit.
Summary of Forward vs Reverse Mode AD
It’s hard to remember all the mechanics of both algorithms on a first read but you can remember their trade-offs so you know what to use when.
- Forward Mode AD can throw away intermediate results since it’s an iterative algorithm.
- Reverse Mode AD needs to keep all intermediate results in memory since it’s a recursive algorithm.
- Forward Mode AD needs to run once per input to compute the full Jacobian matrix.
- Reverse Mode AD needs to run once per output to compute the full Jacobian matrix.
But it’s also worth remembering that your human time is more valuable than computer storage and deep learning applications mostly have large outputs and small inputs.
Now let’s go over some code so nothing feels like magic.
Implementing Automatic Differentiation in Julia
So far it looks like we’re just indulging ourselves, so we’ll now take a close look at how Zygote.jl (Julia’s AD library) implements Automatic Differentiation.
Mike Innes one of the core maintainers of Zygote.jl has an excellent tutorial on AD with Zygote https://GitHub/MikeInnes/diff-zoo — Mike is also my favorite person to follow on Github, all his repos are amazing
Let’s go through some snippets of his project diff-zoo.
Remember the compiler analogy we talked about? This is how easy it is to build a compiler in Julia.
You can get a very visceral feel at how symbolic differentiation expands out an expression into something huge — there are also repeated expressions which we’d be re-deriving from scratch each time 1 + x²
y_1 till y_3 is showing you the primal trace and then y_4 till y_9 is the dual trace.
The definition of dual numbers is straightforward
Dual number arithmetic
Taking derivatives becomes as simple as
Concluding notes on AD with Julia
However, as an end user dual numbers are just an abstraction to make generating a Wengert List more efficient so once you’re working with AD library you don’t ever need to be thinking about dual numbers.
In fact you don’t even need to be thinking about Wengert Lists explicitly because Julia makes it a point to represent its code natively as a Wengert List which includes mathematical expressions but also control flow statements.
I highly recommend you go diff-zoo since it goes into deeper into some of the interesting features of Julia to make this all work. In particular I really like the way they describe the difference between the static graph model that Tensorflow 1.0 uses vs the eager execution model of Pytorch.
There’s a bunch of different next steps you could take from here depending on your interests.
Extra Automatic Differentiation References
Getting my head around Automatic Differentiation took a while but the process was a lot less painful with these 2 references.
- Automatic differentiation in Machine Learning: a survey — the best short comprehensive review. After reading this I grokked the insides of AD
- Machine Learning: Alchemy for the Modern Computer Scientist [talk] — There aren’t all that many talks around on Differentiable programming but after watching this I ironed out my high level understanding of AD and how powerful its applications are
I also own a textbook Evaluating Derivatives which looks good but I haven’t read it enough to really comment on how good it is. The angry Amazon review is about the Kindle version so you should form your own opinion here.
Haskell implementation of Automatic Differentiation
I’d also like to briefly show you the Haskell implementation because it made me religious about the potential of functional programming in numeric code.
The entire H2 Wiki is a fantastic resource if you’re interested in both Machine Learning and Programming Languages.
There is a deeper relationship that’s heavily under-explored between Scientific Computing like Machine Learning and Functional Programming. But there are a few references that do a good job showing how all scientific code can be turned into a really short but expressive DSL for Scientific Computing.
- Backprop as Functor: A compositional perspective on supervised learning — a hard read but well worth it especially if you’re not sure how to use Category Theory in a useful way
- The Simple Essence of Automatic Differentiation — reads very similarly to the first post but is about AD instead of supervised learning
- Purely Functional Typed Models — shows how to model neural networks as types in Haskell
- Can I use Deep Learning for that? A Type Theoretic Heuristic — I talk about how thinking of networks as types can make it really easy to figure out which problems are solvable vs sci-fi
This entire discussion may have given you the impression that Automatic Differentiation is a technique for numeric code only. AD is part of a larger programming paradigm called Differentiable Programming where you write your code in a differentiable manner and get its derivative for free. Even control flow operations like for and if admit derivatives.
Within the context of code, a derivative represents a small step forward or backward in the state and the range of applications this opens up is endless.
Compilers for Machine Learning
Given a computational graph there’s many tricks you can perform to make it smaller and avoid duplicate data. This is a growing field in of itself as Deep Learning models are memory and compute intensive so any improvements here immediately translate into dollars/bitcoins saved. The effort I’m following the closest is Swift for Tensorflow Graph Program Extraction but I’m sure there are others, since compilers for ML is far from being a solved problem.
If you know of any or are working on compilers for ML please reach out, I’d love to check it out and blog about it!
We programmers don’t use enough math in our work — if we did, we wouldn’t write as much code.
And if you enjoyed this post you’ll probably also enjoy my book “The Robot Overlord Manual”, please subscribe below and I’ll email you when new chapters are released.
Thank you to Adam Nemecek, Brendean, gabibbo97 for their early feedback.