The first time someone builds a neural netwerk, it is often “from scratch”. Meaning, without the use of a Deep Learning framework like Tensorflow or PyTorch. A neural network is actually quite easy to implement. The hard part is the back propagation algorithm, the code that actually trains the network. To implement backpropagation you will need to compute the gradient of your neural network with respect to its parameters. This is doable by hand for a small network with only few layers. Although this is fun to do and will give you a good intuition on how neural networks train, the approach soons needs to be abandoned.
The approach is not scalable: computing gradients by hand for very complicated and deep neural networks can become extremely terse and difficult.
The approach is not flexible: implementing a new gradient every time you decide to change something to your network, makes it almost impossible to experiment with your network architecture.
This is bad, so in practice people do not actually compute gradients by hand. Instead, they use a Deep Learning framework. These frameworks do all the gradient computation automatically for you. But how?
The two most used algorithms in automatic Differentiation are called forward mode AD and reverse mode AD. For reasons that will be mentioned later on, everyone in Deep Learning uses reverse mode AD. In fact, it would be incredibly stupid and strange to train a neural network with forward mode AD. So of course, that is precisely what we will do in this blog post.
Although it has no practicality in Deep Learning, forward mode AD is still a beautiful technique and it often considered to be the more approachable AD technique. It is based on the concenpt of Dual Numbers. Dual numbers form an alternative number system (similar to complex numbers) that work very nicely with differentiable functions. In this blogpost we will introduce them, implement them in Python and train a neural network with them on the famous MNIST dataset.
One last thing before we begin. One might ask “why not just use finite differencing?”. That is, one can reasonably suggest to just use the definition of a derivative to compute it: \[\begin{align*}
f\pr(x) = \lim_{x\to\infty} \dfrac{f(x) + f(x+h)}{h}.
\end{align*}\]
By just picking a small \(h\) we can compute an approximation of \(f\pr(x)\). In practice this turns out to be inpractical for training deep neural nets. When \(h\) is too large, the error between the estimated value and the true value is too large (this is called the truncation error). When \(h\) is too small, the error caused by the fact that we are working with floating point numbers instead of real real numbers is too large (the roundoff error). As neural nets involve a lot of computational steps, these errors accumulate and simply become too big.
All the code in this blog post, can be found in this repo.
Dual Numbers
We can introduce dual numbers by first remembering complex numbers. A complex number is a number of the form \(a + bi\) with \(a, b\in \R\) and \(i\) a special number satisfying \(i^2 = -1\). Similarly a dual numbers is a number of the form \(a+b\eps\) with \(a, b\in\R\) and \(\eps\) a special number satisfying \(\eps^2 = 0\). Although this set up seems extremely similar, the structure of dual numbers, \(\mathbb D\), is very different from the reals or the complex plane. This is because the duals do not form a field. With the real numbers and the complex numbers (and the rationals) we can do \(x/y\) for any \(x\) and \(y\) as long as \(y\) is not equal to zero. With the dual numbers, we can not do this. It is easy to see this, for example, we can not divide by \(\eps\), otherwise we would have \(1 = \eps / \eps = \eps^2 / \eps^2\) but \(\eps^2 = 0\) so we arrive at a contradiction. If you are comfortable with abstract algebra, we can define the dual numbers to be the quotient ring \(\R[x]/(x^2)\) and as \((x^2)\) is not a maximal ideal, \(\mathbb D\) is not a field.
One of the cool things about dual numbers is it’s interaction with derivatives. Let \(a, b\in\R\). We have \[\begin{align*}
(a+b\eps)^2 &= a^2 + 2ab\eps + b\eps^2 \\
&= a^2 + 2ab\eps \\ \\
(a+b\eps)^3 &= (a^2 + 2ab\eps)(a+b\eps)\\
&= a^3 + a^2b\eps + 2a^2b\eps + 2ab^2\eps^2 \\
&= a^3 + 3a^2b\eps \\ \\
&\ldots \\ \\
(a+b\eps)^n &= (a^n + na^{n-1}b\eps).
\end{align*}\]
Generally, for any polynomial \(p\) we have \[\begin{align*}
p(a+b\eps) &= p(a) + p\pr(a)b\eps.
\end{align*}\]
Now let us consider any function \(f\c\R\to\R\). If \(f\) is cool enough, it can be approximated by a sequence of Taylor polynomials. It becomes natural to make the following definition.
Definition.
Let \(f\c\R\to\R\) be an analytic function. The dual extension of \(f\) is the function \(\tilde f\c\mathbb{D}\to\mathbb{D}\) given by \[\begin{align*}
a + b\eps \mapsto f(a) + f\pr(a)b\eps.
\end{align*}\]
If we think about this long enough, we can view the first glimpse of how dual numbers can be used for automatical differentiation. Suppose there is some way in which we can evaulate a function \(f\c\R\to\R\) implemented in Python at a dual number of the form \(a+\eps\), than by looking at the result, we get both the values \(f(a)\) and \(f\pr(a)\). How we achieve this, will be explained shortly. First, it’s about time we write some code. So let us implement dual numbers in Python1!
Before we can start differentiating arbitrary functions, we need a few more ingredients. So here it is, behold the niceness of dual extensions.
Proposition.
Let \(f\c\R\to\R\) and \(g\c\R\to\R\) be an analytic functions. We have \[\begin{align*}
\widetilde{f+g} &= \tilde f + \tilde g\\
\widetilde{f\cdot g} &= \tilde f \cdot \tilde g \\
\widetilde{f\circ g} &= \tilde f \circ \tilde g.
\end{align*}\]
Proof.
The result for pointwise addition is easy.
The result for pointwise multiplication is just the product rule: \[\begin{align*}
(\tilde f \cdot \tilde g)(a+b\eps) &= (f(a) + f\pr(a)b\eps)(g(a) + g\pr(a)b\eps)\\
&= f(a)g(a) + f(a)g\pr(a)b\eps + f\pr(a)g(a)b\eps + f\pr(a)g\pr(a)b^2\eps^2 \\
&= f(a)g(a) + f(a)g\pr(a)b\eps + f\pr(a)g(a)b\eps \\
&= f(a)g(a) + f\pr(a)g(a) + f(a)g\pr(a)b\eps \\
&= (f\cdot g)(a) + (f\cdot g)\pr(a)b\eps \\
&= \widetilde{f\cdot g} (a + b\eps)
\end{align*}\]
The result for composition is just the chain rule: \[\begin{align*}
(\tilde f \circ \tilde g)(a + b\eps) &= \tilde f(g(a) + g\pr(a)b\eps) \\
&= f(g(a)) + f\pr(g(a))g\pr(a)b\eps \\
&= (f\circ g)(a) + (f\circ g)\pr(a)b\eps \\
&= \widetilde{f\circ g}(a + b\eps).
\end{align*}\]
This result suggest a very effective way for automatic differentiation of a very big class of analytic functions. First, we start with a small set of functions and explicitely implement the dual extensions of these functions. Then, any function that can be build from this small set using only composition, pointwise addition and pointwise multiplication will also automatically work!
Let’s just do it. The code below automatically computes the derivative of
It would be nice if we were already almost at the end of this blog post, but alas, we are not done yet. We would be done if neural nets were functions \(\R\to\R\). But of course they are not. Neural nets are functions \(\R^n\to \R^m\) for some large \(n\) and some much smaller \(m\). We will make our definitions fast and without lingering on them too much. However, they should feel natural if one thinks about it for a bit. Note that any \(\bs x \in \mathbb{D}^n\) is of the form \(\bs a + \bs b\eps\) for some \(\bs a, \bs b \in \R^n\).
Definition.
Let \(f\c\R^n\to\R^m\) be an analytic function with \(n, m\ge 1\). The dual extension of \(f\) is the function \(\tilde f\c\mathbb{D}^n\to\mathbb{D}^m\) given by \[\begin{align*}
\bs a + \bs b\eps \mapsto f(\bs a) + J_f(\bs a)\bs b\eps.
\end{align*}\]
Here \(J_f(\bs a)\) denotes the Jacobian of \(f\) at \(\bs a\). The matrix-vector product \(J_f(\bs a)\bs b\) is called the Jacobian Vector Product, a term you will encounter frequently when reading about automatic differentiation. Note that there is a big difference between our dual extension for multivariate functions and the dual extension for single variable functions. In the single variable case, we automatically get the derivate value for any point \(x\) when evaluating on \(x+\eps\). In the multivariate case, this is not true. If we want to compute the full Jacobian \(J_f(\bs a)\) then we will need to evaluate the dual extension at \(\bs a + \bs e_i\eps\) for every \(1\le i \le n\). This is an extremely big problem if we want to use this method of automatic differentiation to train neural networks. For neural networks, \(n\) is the number of parameters of the network, which can be extremely big. This is the reason that forward mode AD is not used in deep learning. In fact this problem is so big that we will shortly see me jumping through some hoops, to get the training on MNIST doable in OK time. People doing real deep learning use Reverse mode AD, which needs \(m\) evaluations, a much, much smaller number for neural networks. This point is important, so let me highlight it.
Turns out the goal of this blogpost is ridiculous
Training neural nets with forward mode AD as described in this blog post is extremely ineffecient and only for educational and fun purposes. Do not do it for reals.
Of course, neural networks are not functions \(\R^n\to\R^m\) either. They are often actually functions \(\R^\bs{n}\to \R^{\bs m}\). Here \(\R^{\bs n}\) denotes the space of all tensors of shape \(\bs n\). Shit, I hear you thinking, should I now go and read that other long blogpost just linked. Well, if you are like me and would like to read an overly, unnecessarily formal treatment of tensors, you will hopelfully find happiness in the other blogpost. Otherwise, you can just pretend like tensors are normal vectors and that all the superscripts \(\bs n\) are not bold at all. In fact I can summarize the other blogpost in one sentence:
Tensors are just multi dimensional arrays and their linear algebra works like you expect it to. You can just pretend that \(\R^{\bs n} = \R^n\) and most people do.
For those who were crazy enough to read the other post, and came back, let’s see the definition of the dual extension in the tensor case.
Definition.
Let \(f\c\R^{\bs n}\to\R^{\bs m}\) be an analytic function with \(\bs n\) a shape of length \(K\) and \(\bs m\) a shape of length \(L\). The dual extension of \(f\) is the function \(\tilde f\c\mathbb{D}^{\bs n}\to\mathbb{D}^{\bs m}\) given by \[\begin{align*}
\bs{\mathcal{A}} + \bs{\mathcal{B}}\eps \mapsto f(\bs{\mathcal{A}}) + \mathcal{J}_f(\bs{\mathcal{A}})\ast_L\bs{\mathcal{B}}\eps.
\end{align*}\]
Time to implement vectors (and tensors) of dual numbers in Python. For the components we will use NumPy arrays, which are vectors when np.ndarray(...).ndim == 1 and tensors otherwise.
Note that we also implement division for tensors of dual numbers, even though we claimed this was impossible not long ago in this blog post. However for \(x, y \in \mathbb{D}\) we can think of \(x/y\) as \(x*\tilde f(y)\) where \(f\c a \mapsto 1/a\).
The only thing left to do now is to implement the dual versions of some primitive functions. As an example, we consider the simple case of functions \(f\c\R^n \to \R^n\) that are defined pointwise by a function \(g\c\R\to\R\), that is, \(f(\bs{x}) = (g(x_i))_{i=1}^n\). As examples, we can think of np.sin, np.exp, np.sqrt, etc. In this case the Jacobian vector product \(J_f(\bs{x})\bs{v}\) equals \((g\pr(x_i)v_i)_{i=1}^n\) (check this!).
We could do this in the way done before for single variable autodiff, by just creating new functions. However, it would be even more cool if we could make our DualTensor class interoprable with numpy. Luckily, numpy is actually very flexible in this regard. By defining a method __array_ufunc__ on a class, we can make numpy call this method when we try to call a numpy ufunc with as argument our own class. This is best illustrated with a minimal example:
import numpy as npclass Foo:def __array_ufunc__(self, ufunc: np.ufunc, method: str, *args, **kwargs) ->str:if ufunc is np.sin and method =="__call__":return"Wow Cool!"returnNotImplementedf = Foo()np.sin(f)
'Wow Cool!'
Training a Neural Net on MNIST with Dual Numbers
Now that we can automatically differentiate functions \(\R^{\bs n} \to \R^{\bs m}\), we can use this to train neural networks. We will train a small neural net to recognize hand written digits in the MNSIT dataset, the “Hello World” of Deep Learning. We start by implementing the data structure of a Neural Network. The code below is quite straight forward. The only noteworthy piece is the initialization function of the weight matrix of a layer. Initialization of your weights is actually a more delicate issue then it has any right to be. When initialization your weight in a naive way there is a high probability of your gradient vanishing or exploding. The vanishing gradient problem, as it is often called, kicked me down a lot of times during my training, so I ended up implementing more sophistacated intialization techniques. Which technique to use, depends on the activation function of the layer. I implemented Xavier Initialization2 (for use with softmax) and He Initialization (for use with relu).
We will now add all the code for backpropagation. This code required more thinking. We can compute the gradient of the weight matrix of a layer as follows:
We are given an observation \((\bm x, y)\).
We compute the activation of the previous layer \(x\pr\).
For every position \(i, j\) in the weight matrix \(W\) of our layer, we do a forward pass through the remainder of the network with as weight matrix \(W + E_{i, j}\eps\). Here \(E_{i, j}\) is the matrix with all zeros, but a one on position \((i, j)\).
We end up with a dual number that is the loss of the network \(a + b\eps\).
The number \(b\) is now the value of our wanted gradient at positio \((i, j)\).
The gradient of a bias vector is computed in a similar way.
We are now ready to test our code against a real dataset: MNIST. We start by preparing our data. Images in the MNIST dataset are 28 by 28 pixels, resulting in an input size of 728. Suppose that we start with a layer with 100 neurons. That means that our first layer needs has a weight matrix with 72.800 entries. To compute the gradient with respect to this matrix, we need 72.800 passes through our network! So yeah, the conclusion that this was a stupid idea, really was correct. I actually tried this, and computing one gradient took more then 2 seconds… Training on the MNIST training set would take around 33 hours. Even I thought this was a bit rediculous for training the on the Hello World dataset of Deep Learning. The solution is of course to make our network smaller, and the most efficient place to do that is by reducing the input size. So all images are resized to 12 by 12 pixels. Most digits I sampled were still easily recognizable.
import numpy as npimport numpy.typing as nptfrom PIL import Imagefrom datasets import load_datasetfrom albumentations import Resizeimport matplotlib.pyplot as pltfrom dual import NeuralNetworkdef transform_image(image: Image) -> npt.NDArray: image = np.array(image) # to numpy image = Resize(width=12, height=12)(image=image)["image"] # resize image = image /255# normalize image = image.flatten() # flattenreturn imagedef one_hot_encode_label(label: int) ->int: result = np.zeros(10) result[label] =1return resultdef transform(batch): batch["input"] = [transform_image(image) for image in batch["input"]] batch["label"] = [one_hot_encode_label(label) for label in batch["label"]]return batchds = ( load_dataset("mnist", keep_in_memory=True) .with_transform(transform=transform) .rename_column(original_column_name="image", new_column_name="input"))
We now train a small network with two layers[^two-layer]. After training only one hour, the accuracy on the test set is already up to 85%.
The decorators on the DualNumber class are there so that we don’t have to implement both left and right addition, but can simply declare the operator to be commutative. This also holds for multiplication. Subtraction is simply defined by x-y = x + (-y). Code for the decorators can be found in dual.structure_decorators.↩︎
Named after Xavier Glorot and sometimes also referred to a Glorot Initialization. Still, Xavier Initialization is the more prominant name, and this is probably the only time I have ever seen an idea named after someones first name.↩︎