In this blog post we will go over what we learned from the Third chapter of the fastai series. if you haven't checked out my previous post you can click here.
During this session,we'll discuss stochastic gradient descent (SGD) and how linear functions are layered with non-linear activation functions.This article focuses on a popular combination called Rectified linear function (ReLU).
Let's begin.
Click here to interact with the interactive features of the notebook
Fitting a function with gradient descent
A neural network is just a mathematical function. In the most standard kind of neural network, the function:
- Multiplies each input by a number of values. These values are known as parameters
- Adds them up for each group of values
- Replaces the negative numbers with zeros
This represents one "layer". Then these three steps are repeated, using the outputs of the previous layer as the inputs to the next layer. Initially, the parameters in this function are selected randomly. Therefore a newly created neural network doesn't do anything useful at all -- it's just random!
To get the function to "learn" to do something useful, we have to change the parameters to make them "better" in some way. We do this using gradient descent. Let's see how this works...
from ipywidgets import interact
from fastai.basics import *
plt.rc('figure', dpi=90)
def plot_function(f, title=None, min=-2.1, max=2.1, color='r', ylim=None):
x = torch.linspace(min,max, 100)[:,None]
if ylim: plt.ylim(ylim)
plt.plot(x, f(x), color)
if title is not None: plt.title(title)
To learn how gradient descent works, we're going to start by fitting a quadratic, since that's a function most of us are probably more familiar with than a neural network. Here's the quadratic we're going to try to fit:
def f(x): return 3*x**2 + 2*x + 1
plot_function(f, "$3x^2 + 2x + 1$")
This quadratic is of the form $ax^2+bx+c$, with parameters $a=3$, $b=2$, $c=1$. To make it easier to try out different quadratics for fitting a model to the data we'll create, let's create a function that calculates the value of a point on any quadratic:
def quad(a, b, c, x): return a*x**2 + b*x + c
If we fix some particular values of a, b, and c, then we'll have made a quadratic. To fix values passed to a function in python, we use the partial
function, like so:
def mk_quad(a,b,c): return partial(quad, a,b,c)
So for instance, we can recreate our previous quadratic:
f2 = mk_quad(3,2,1)
plot_function(f2)
Now let's simulate making some noisy measurements of our quadratic f
. We'll then use gradient descent to see if we can recreate the original function from the data.
Here's a couple of functions to add some random noise to data:
def noise(x, scale): return np.random.normal(scale=scale, size=x.shape)
def add_noise(x, mult, add): return x * (1+noise(x,mult)) + noise(x,add)
Let's use the now to create our noisy measurements based on the quadratic above:
np.random.seed(42)
x = torch.linspace(-2, 2, steps=20)[:,None]
y = add_noise(f(x), 0.15, 1.5)
Here's the first few values of each of x and y:
x[:5],y[:5]
As you can see, they're tensors. A tensor is just like an array
in numpy (if you're not familiar with numpy, I strongly recommend reading this great book, because it's a critical foundation for nearly all numeric programming in Python. Furthermore, PyTorch, which most researchers use for deep learning, is modeled closely on numpy.) A tensor can be a single number (a scalar or rank-0 tensor), a list of numbers (a vector or rank-1 tensor), a table of numbers (a matrix or rank-2 tensor), a table of tables of numbers (a rank-3 tensor), and so forth.
We're not going to learn much about our data by just looking at the raw numbers, so let's draw a picture:
plt.scatter(x,y);
How do we find values of a, b, and c which fit this data? One approach is to try a few values and see what fits. Here's a function which overlays a quadratic on top of our data, along with some sliders to change a, b, and c, and see how it looks:
@interact(a=1.1, b=1.1, c=1.1)
def plot_quad(a, b, c):
plt.scatter(x,y)
plot_function(mk_quad(a,b,c), ylim=(-3,13))
Try moving slider a
a bit to the left. Does that look better or worse? How about if you move it a bit to the right? Find out which direction seems to improve the fit of the quadratic to the data, and move the slider a bit in that direction. Next, do the same for slider b
: first figure out which direction improves the fit, then move it a bit in that direction. Then do the same for c
.
OK, now go back to slider a
and repeat the process. Do it again for b
and c
as well.
Did you notice that by going back and doing the sliders a second time that you were able to improve things a bit further? That's an important insight -- it's only after changing b
and c
, for instance, that you realise that a
actually needs some adjustment based on those new values.
One thing that's making this tricky is that we don't really have a great sense of whether our fit is really better or worse. It would be easier if we had a numeric measure of that. On easy metric we could use is mean absolute error -- which is the distance from each data point to the curve:
def mae(preds, acts): return (torch.abs(preds-acts)).mean()
We'll update our interactive function to print this at the top for us.
Use this to repeat the approach we took before to try to find the best fit, but this time just use the value of the metric to decide which direction to move each slider, and how far to move it.
This time around, try doing it in the opposite order: c, then b, then a.
You'll probably find that you have to go through the set of sliders a couple of times to get the best fit.
@interact(a=1.1, b=1.1, c=1.1)
def plot_quad(a, b, c):
f = mk_quad(a,b,c)
plt.scatter(x,y)
loss = mae(f(x), y)
plot_function(f, ylim=(-3,12), title=f"MAE: {loss:.2f}")
In a modern neural network we'll often have tens of millions of parameters to fit, or more, and thousands or millions of data points to fit them to. We're not going to be able to do that by moving sliders around! We'll need to automate this process.
Thankfully, that turns out to be pretty straightforward. We can use calculus to figure out, for each parameter, whether we should increase or decrease it.
Uh oh, calculus! If you haven't touched calculus since school, you might be getting ready to run away at this point. But don't worry, we don't actually need much calculus at all. Just derivatives, which measure the rate of change of a function. We don't even need to calculate them ourselves, because the computer will do it for us! If you've forgotten what a derivitive is, then watch the first three of these fantastic videos by Professor Dave. It's only 15 minutes in total, so give it a go! Then come back here and we'll continue on our journey...
Automating gradient descent
The basic idea is this: if we know the gradient of our mae()
function with respect to our parameters, a
, b
, and c
, then that means we know how adjusting (for instance) a
will change the value of mae()
. If, say, a
has a negative gradient, then we know that increasing a
will decrease mae()
. Then we know that's what we need to do, since we trying to make mae()
as low as possible.
So, we find the gradient of mae()
for each of our parameters, and then adjust our parameters a bit in the opposite direction to the sign of the gradient.
To do this, first we need a function that takes all the parameters a
, b
, and c
as a single vector input, and returns the value mae()
based on those parameters:
def quad_mae(params):
f = mk_quad(*params)
return mae(f(x), y)
Lets try it:
quad_mae([1.1, 1.1, 1.1])
Yup, that's the same as the starting mae() we had in our plot before.
We're first going to do exactly the same thing as we did manually -- pick some arbritrary starting point for our parameters. We'll put them all into a single tensor:
abc = torch.tensor([1.1,1.1,1.1])
To tell PyTorch that we want it to calculate gradients for these parameters, we need to call requiresgrad():
abc.requires_grad_()
tensor([1.1000, 1.1000, 1.1000], requires_grad=True)
We can now calculate mae()
. Generally, when doing gradient descent, the thing we're trying to minimise is called the loss:
loss = quad_mae(abc)
loss
tensor(2.4219, dtype=torch.float64, grad_fn=)
To get PyTorch to now calculate the gradients, we need to call backward()
loss.backward()
The gradients will be stored for us in an attribute called grad
:
abc.grad
tensor([-1.3529, -0.0316, -0.5000])
According to these gradients, all our parameters are a little low. So let's increase them a bit. If we subtract the gradient, multiplied by a small number, that should improve them a bit:
with torch.no_grad():
abc -= abc.grad*0.01
loss = quad_mae(abc)
print(f'loss={loss:.2f}')
loss=2.40
Yes, our loss has gone down!
The "small number" we multiply is called the learning rate, and is the most important hyper-parameter to set when training a neural network.
BTW, you'll see we had to wrap our calculation of the new parameters in with torch.no_grad()
. That disables the calculation of gradients for any operations inside that context manager. We have to do that, because abc -= abc.grad*0.01
isn't actually part of our quadratic model, so we don't want derivitives to include that calculation.
We can use a loop to do a few more iterations of this:
for i in range(10):
loss = quad_mae(abc)
loss.backward()
with torch.no_grad(): abc -= abc.grad*0.01
print(f'step={i}; loss={loss:.2f}')
**step=0; loss=2.40
step=1; loss=2.36
step=2; loss=2.30
step=3; loss=2.21
step=4; loss=2.11
step=5; loss=1.98
step=6; loss=1.85
step=7; loss=1.72
step=8; loss=1.58
step=9; loss=1.46 ** As you can see, our loss keeps going down!
If you keep running this loop for long enough however, you'll see that the loss eventually starts increasing for a while. That's because once the parameters get close to the correct answer, our parameter updates will jump right over the correct answer! To avoid this, we need to decrease our learning rate as we train. This is done using a learning rate schedule, and can be automated in most deep learning frameworks, such as fastai and PyTorch.
How a neural network approximates any given function
But neural nets are much more convenient and powerful than this example showed, because we can learn much more than just a quadratic with them. How does that work?
The trick is that a neural network is a very expressive function. In fact -- it's infinitely expressive. A neural network can approximate any computable function, given enough parameters. A "computable function" can cover just about anything you can imagine: understand and translate human speech; paint a picture; diagnose a disease from medical imaging; write an essay; etc...
The way a neural network approximates a function actually turns out to be very simple. The key trick is to combine two extremely basic steps:
- Matrix multiplication, which is just multiplying things together and then adding them up
- The function $max(x,0)$, which simply replaces all negative numbers with zero.
In PyTorch, the function $max(x,0)$ is written as np.clip(x,0)
. The combination of a linear function and this max() is called a rectified linear function, and it can be implemented like this:
def rectified_linear(m,b,x):
y = m*x+b
return torch.clip(y, 0.)
Here's what it looks like:
plot_function(partial(rectified_linear, 1,1))
BTW, instead of torch.clip(y, 0.)
, we can instead use F.relu(x)
, which does exactly the same thing. In PyTorch, F
refers to the torch.nn.functional
module.
import torch.nn.functional as F
def rectified_linear2(m,b,x): return F.relu(m*x+b)
plot_function(partial(rectified_linear2, 1,1))
To understand how this function works, try using this interactive version to play around with the parameters m
and b
:
@interact(m=1.5, b=1.5)
def plot_relu(m, b):
plot_function(partial(rectified_linear, m,b), ylim=(-1,4))
As you see, m
changes the slope, and b
changes where the "hook" appears. This function doesn't do much on its own, but look what happens when we add two of them together:
def double_relu(m1,b1,m2,b2,x):
return rectified_linear(m1,b1,x) + rectified_linear(m2,b2,x)
@interact(m1=-1.5, b1=-1.5, m2=1.5, b2=1.5)
def plot_double_relu(m1, b1, m2, b2):
plot_function(partial(double_relu, m1,b1,m2,b2), ylim=(-1,6))
If you play around with that for a while, you notice something quite profound: with enough of these rectified linear functions added together, you could approximate any function with a single input, to whatever accuracy you like! Any time the function doesn't quite match, you can just add a few more additions to the mix to make it a bit closer. As an experiment, perhaps you'd like to try creating your own plot_triple_relu
interactive function, and maybe even include the scatter plot of our data from before, to see how close you can get?
This exact same approach can be expanded to functions of 2, 3, or more parameters.