[00:03] (3.24s)
you are currently watching an artificial
[00:05] (5.68s)
neural network
[00:07] (7.28s)
learn in this video as we watch neural
[00:10] (10.00s)
networks learn you will learn how they
[00:12] (12.12s)
learn this is a continuation of my
[00:14] (14.52s)
previous videos about neural networks as
[00:16] (16.72s)
universal function approximators and
[00:19] (19.08s)
today I will finally explain the
[00:21] (21.28s)
learning algorithm itself and already
[00:24] (24.24s)
there are 500 pedantic comments saying
[00:26] (26.60s)
uh it's not real learning cuz it's not
[00:28] (28.72s)
conscious and it's just matri
[00:30] (30.24s)
multiplications and AI is neither a nor
[00:33] (33.20s)
I that sounds smart yeah yeah whatever
[00:35] (35.88s)
nerds nobody cares by learning I mean
[00:39] (39.36s)
optimizing we will compare two different
[00:41] (41.84s)
optimization algorithms for neural
[00:43] (43.88s)
networks the state-of-the-art stochastic
[00:46] (46.52s)
gradient descent and for contrast a
[00:49] (49.08s)
simple evolutionary algorithm we will
[00:51] (51.56s)
watch real neural networks learn simple
[00:53] (53.72s)
functions that let us visualize the
[00:55] (55.64s)
actual training process in action I will
[00:58] (58.44s)
also be using a little interact web toy
[01:00] (60.72s)
that I made you can play with this right
[01:02] (62.44s)
now it is a free and open- Source
[01:04] (64.52s)
website I hope to provide some insight
[01:07] (67.00s)
for both newbies and experts and show
[01:09] (69.48s)
the similarities and differences the
[01:11] (71.60s)
advantages and disadvantages of these
[01:14] (74.16s)
algorithms and ultimately explain why
[01:16] (76.64s)
one of them is the most optimized
[01:20] (80.24s)
Optimizer again I recommend you watch my
[01:22] (82.84s)
other neural network videos but in short
[01:25] (85.24s)
functions are input output machines
[01:27] (87.88s)
numbers in numbers out neur networks are
[01:30] (90.80s)
also functions and can reconstruct or
[01:33] (93.72s)
approximate any other function to any
[01:36] (96.28s)
degree of precision given only a sample
[01:38] (98.64s)
of its data points they are Universal
[01:41] (101.84s)
function approximators and this makes
[01:43] (103.72s)
them extremely general purpose for real
[01:46] (106.04s)
world tasks because functions describe
[01:51] (111.00s)
world yes the network shape as a
[01:54] (114.56s)
function is defined by its set of
[01:56] (116.56s)
parameters the values of its weights and
[01:58] (118.96s)
biases parameters are designed to be
[02:01] (121.80s)
changeable tunable and different values
[02:04] (124.24s)
produce different shapes more parameters
[02:06] (126.68s)
let you build more complex functions
[02:09] (129.20s)
with four parameters I can fit an
[02:10] (130.76s)
elephant and with five I can make him
[02:12] (132.72s)
wiggle his
[02:13] (133.76s)
trunk this web toy has an extremely
[02:16] (136.52s)
simple network with two parameters A and
[02:18] (138.92s)
B which we can control down here the
[02:22] (142.00s)
green line is the approximation and the
[02:24] (144.32s)
blue line is the target function and
[02:26] (146.28s)
data set it's a sine wave the neural
[02:29] (149.00s)
Network's actual function definition is
[02:31] (151.24s)
here it's using a tan H activation and
[02:34] (154.84s)
as we change the values we can see this
[02:37] (157.16s)
Green Point moving around down here this
[02:40] (160.04s)
is parameter space the space of all
[02:42] (162.52s)
possible combinations of all possible
[02:44] (164.72s)
values of A and B between 5 and POS 5 it
[02:49] (169.20s)
is a 2d plane where the parameters are
[02:52] (172.00s)
the coordinates 2D for 2 params each
[02:55] (175.72s)
point represents a different network and
[02:58] (178.04s)
collectively this is the space of all
[03:00] (180.36s)
possible networks with this particular
[03:02] (182.36s)
architecture with two parameters wired
[03:04] (184.40s)
up in this specific way note that this
[03:06] (186.80s)
space would be very difficult to
[03:08] (188.32s)
visualize if we had lots of parameters
[03:10] (190.64s)
like hundreds this problem will haunt us
[03:13] (193.00s)
throughout the video but we will keep it
[03:14] (194.60s)
simple for now so given a Target
[03:17] (197.52s)
function it is clear that some of these
[03:19] (199.40s)
parameters fit the target function
[03:21] (201.08s)
better than others we can search the
[03:23] (203.36s)
space manually to find good ones but how
[03:25] (205.56s)
do we automatically find the best ones
[03:28] (208.72s)
this is the job of our optimization
[03:31] (211.00s)
algorithm which is fundamentally solving
[03:33] (213.40s)
a search problem it must search the
[03:36] (216.00s)
space for the best set of parameters
[03:38] (218.12s)
that produces the best
[03:40] (220.16s)
approximation in order to see which is
[03:42] (222.44s)
best we need a way to evaluate a network
[03:45] (225.44s)
given a set of parameters tell me how
[03:47] (227.48s)
well a network with those parameters
[03:49] (229.64s)
fits the target function remember that
[03:51] (231.92s)
we don't actually know what the target
[03:53] (233.40s)
function is we only have a sample of
[03:55] (235.44s)
data points so let's take three data
[03:57] (237.28s)
points from a sine wave we can now give
[03:59] (239.84s)
our Network each input from the data set
[04:02] (242.08s)
and ask it to predict each output and
[04:04] (244.36s)
then compare the predicted outputs to
[04:06] (246.36s)
the true outputs we measure the
[04:08] (248.44s)
difference between these two values
[04:10] (250.00s)
using what's called a loss function in
[04:12] (252.48s)
this case we will use mean squared error
[04:14] (254.76s)
but there are many other loss functions
[04:16] (256.32s)
for different tasks we calculate loss
[04:18] (258.80s)
for all predicted versus true outputs
[04:21] (261.52s)
and then take the average across the
[04:23] (263.12s)
entire data set which gives us a single
[04:25] (265.60s)
final score for a given Network it's
[04:28] (268.36s)
loss loss loss is a measure of error so
[04:31] (271.60s)
lower loss is better we're playing golf
[04:33] (273.84s)
here low loss means a small difference
[04:36] (276.20s)
between predicted and true outputs which
[04:38] (278.28s)
means you have a good approximation this
[04:40] (280.76s)
big calculation can be generalized into
[04:43] (283.28s)
a function the loss landscape function
[04:46] (286.04s)
which takes a set of parameters as input
[04:48] (288.64s)
and outputs the loss for those
[04:50] (290.64s)
parameters on our current data set now
[04:53] (293.40s)
remember parameter space using this
[04:55] (295.64s)
function we can visit every point in
[04:57] (297.72s)
this space and calculate loss for every
[05:00] (300.08s)
set of parameters every Network and
[05:02] (302.52s)
visualize the loss as the height of the
[05:04] (304.64s)
plane at that point this is the loss
[05:08] (308.76s)
landscape it tells us how good every
[05:11] (311.44s)
possible Network in this space is at
[05:13] (313.84s)
approximating the Target function the
[05:16] (316.48s)
Green Point represents our current
[05:18] (318.12s)
Network the highest points are the worst
[05:20] (320.20s)
networks the lowest points are the best
[05:22] (322.48s)
we want the lowest mathematically
[05:25] (325.08s)
optimization simply means you are either
[05:27] (327.08s)
minimizing or maximizing a function or
[05:30] (330.16s)
some kind of Min maxing but we're not
[05:31] (331.68s)
worrying about that in this video for
[05:33] (333.80s)
minimizing you're trying to find the
[05:35] (335.60s)
best inputs that produce the lowest
[05:37] (337.80s)
output we are trying to minimize the
[05:40] (340.24s)
loss landscape function whose inputs are
[05:43] (343.44s)
parameters the data set values are not
[05:46] (346.04s)
passed in as variables but rather
[05:47] (347.84s)
treated as constants numbers hardwired
[05:50] (350.32s)
into the function they don't change no
[05:52] (352.36s)
matter what your parameters are this
[05:54] (354.68s)
requires a conceptual rewiring of our
[05:57] (357.12s)
neural network such that what were
[05:59] (359.08s)
variables data set inputs are now
[06:01] (361.12s)
constants and what were constants
[06:03] (363.08s)
parameters are now variables thinking
[06:05] (365.68s)
this way allows us to optimize input
[06:07] (367.92s)
parameters to minimize output loss we
[06:11] (371.12s)
can now search parameter space using the
[06:13] (373.52s)
loss landscape as our map we can just
[06:16] (376.68s)
visually look at it and see that the
[06:18] (378.32s)
lowest point is here so these parameters
[06:20] (380.48s)
make for the best network problem solved
[06:22] (382.56s)
we're done except of course in order to
[06:24] (384.88s)
show this landscape I have to generate
[06:27] (387.00s)
and evaluate every possible variation of
[06:29] (389.32s)
the Network in this space at least with
[06:32] (392.00s)
two parameters this is easy but two
[06:34] (394.00s)
parameters is nothing even on easy
[06:36] (396.24s)
problems a good approximation requires
[06:38] (398.32s)
hundreds of parameters which is both
[06:40] (400.20s)
impossible to visualize and intractable
[06:42] (402.48s)
to compute you cannot realistically
[06:44] (404.56s)
generate every possible combination of
[06:46] (406.92s)
100 parameters in a 100 dimensional
[06:49] (409.28s)
hyperspace and that is still very small
[06:51] (411.36s)
for a neural network with that many
[06:53] (413.44s)
parameters the loss landscape function
[06:55] (415.40s)
can still be defined and the loss for an
[06:58] (418.08s)
individual Point can be count calculated
[07:00] (420.32s)
but the entire landscape can never be
[07:02] (422.48s)
explicitly calculated in the way that
[07:04] (424.28s)
we've done here so our optimization
[07:07] (427.36s)
algorithm must make do with this
[07:09] (429.32s)
limitation it must scale even with
[07:11] (431.92s)
absurdly high numbers of parameters
[07:14] (434.32s)
we've kept the dimensionality low so we
[07:16] (436.20s)
can visualize it but our algorithm must
[07:18] (438.72s)
navigate the landscape
[07:20] (440.30s)
[Music]
[07:24] (444.32s)
blind imagine you are near the top of a
[07:27] (447.04s)
mountain range the landscape is rugged
[07:29] (449.32s)
with with many Peaks and valleys the air
[07:31] (451.56s)
is cold and thin and you must find your
[07:34] (454.08s)
way down the mountain but there's a
[07:36] (456.72s)
hitch you are blind I'm going to
[07:39] (459.48s)
interview Eric Wan mayor who climbed the
[07:42] (462.48s)
highest mountain in the world Mount
[07:43] (463.92s)
Everest but he's gay I mean he's gay
[07:47] (467.28s)
excuse me he's blind so we'll hear about
[07:49] (469.68s)
that coming up okay that's you they're
[07:51] (471.32s)
talking about blind and gay and stuck at
[07:53] (473.76s)
the top of Lost Mountain how do you find
[07:56] (476.08s)
your way down well one way you can try
[07:58] (478.76s)
is to Simply step in a random Direction
[08:01] (481.16s)
and if you go down commit to the step
[08:03] (483.64s)
otherwise do another random Step better
[08:05] (485.92s)
yet take like 10 random steps and only
[08:08] (488.60s)
commit to the best one repeat this over
[08:11] (491.04s)
and over this is the evolutionary way or
[08:14] (494.88s)
you could just feel the slope of the
[08:17] (497.04s)
ground under your feet and step in
[08:19] (499.48s)
exactly the direction that it feels
[08:21] (501.56s)
steepest downhill that is the gradient
[08:24] (504.64s)
descent way both are hill climbing
[08:27] (507.64s)
optimization algorithms step by step
[08:30] (510.56s)
they climb down loss Mountain seeking
[08:33] (513.40s)
the lowest point on the entire landscape
[08:35] (515.96s)
the global
[08:37] (517.72s)
minimum let's start with the
[08:39] (519.72s)
evolutionary way random guessing and
[08:42] (522.00s)
checking we'll use an algorithm known as
[08:44] (524.60s)
a local search and it is just about the
[08:46] (526.88s)
most Bare Bones genetic algorithm that
[08:49] (529.04s)
you could ask for start with a random
[08:51] (531.44s)
Network a random point and parameter
[08:53] (533.64s)
space copy it and mutate the copy's
[08:56] (536.20s)
parameters by adding a small random
[08:58] (538.48s)
value to all or some of them this moves
[09:00] (540.92s)
the point around a bit do this several
[09:03] (543.08s)
times so generate a population of
[09:05] (545.28s)
candidate random steps calculate loss
[09:07] (547.80s)
for all and choose the one with the
[09:09] (549.64s)
lowest loss for the next round and
[09:12] (552.00s)
repeat repeat repeat gradually step by
[09:15] (555.08s)
step its descendants descend the
[09:17] (557.48s)
mountain the parameters are the Genome
[09:20] (560.44s)
of the neural network and our loss
[09:22] (562.32s)
function is the fitness function or
[09:24] (564.60s)
anti- Fitness function as biological
[09:27] (567.52s)
life seeks the highest point on the the
[09:29] (569.52s)
fitness landscape we seek the lowest
[09:31] (571.88s)
point on the loss landscape this website
[09:34] (574.80s)
is actually performing a local search
[09:36] (576.88s)
which works pretty well on this problem
[09:38] (578.76s)
it can quickly find the lowest point on
[09:40] (580.76s)
the landscape we say it converges on the
[09:43] (583.80s)
global minimum it reaches a point where
[09:45] (585.76s)
it just keeps getting closer and closer
[09:47] (587.68s)
and closer it is not guaranteed to
[09:50] (590.28s)
converge on the global minimum I will
[09:52] (592.16s)
explain more about that in a bit either
[09:54] (594.28s)
way it finds the best approximation but
[09:56] (596.56s)
it still isn't a great approximation and
[09:58] (598.56s)
of course it's not two parameters does
[10:00] (600.64s)
not make for a very useful neural
[10:02] (602.40s)
network so let's switch over to a real
[10:05] (605.08s)
pytorch neural network and evolve it
[10:07] (607.20s)
using a local search this network has
[10:11] (611.08s)
1,741 parameters so we can no longer
[10:13] (613.96s)
observe the Lost landscape directly but
[10:16] (616.12s)
we can observe the Learned function it
[10:18] (618.60s)
mutates a random subset of parameters at
[10:21] (621.08s)
a time and you can actually see it
[10:22] (622.96s)
adjusting them as it learns the mutation
[10:26] (626.08s)
rate gets smaller over time so it can
[10:28] (628.44s)
hopefully learn f finer and finer
[10:30] (630.28s)
details I ran this for 150 rounds each
[10:33] (633.84s)
with a population of a th000 so I had to
[10:36] (636.48s)
generate 150,000 networks for this
[10:38] (638.96s)
footage that is a sacrifice I'm willing
[10:41] (641.24s)
to make that's pretty good now let's try
[10:44] (644.76s)
a more complex thing an image we'll use
[10:48] (648.16s)
a smiley face refresher on how this
[10:50] (650.56s)
works the network takes the pixel
[10:52] (652.60s)
coordinates as input and outputs a
[10:54] (654.92s)
single Pixel value two inputs one output
[10:58] (658.36s)
this image is made of of every output
[11:00] (660.52s)
for every pixel you treat every pixel in
[11:03] (663.52s)
the Target image as a data point in your
[11:05] (665.68s)
data set and approximate it every frame
[11:09] (669.08s)
is a successful mutation a strict
[11:11] (671.52s)
Improvement on the frame before loss can
[11:14] (674.44s)
only go down or stay the same this task
[11:17] (677.88s)
is harder than the previous one it takes
[11:20] (680.16s)
a while but it does eventually
[11:23] (683.20s)
converge now if we give it a proper
[11:25] (685.48s)
image like this picture of Mr Darwin
[11:27] (687.52s)
himself we can see just just how much it
[11:30] (690.12s)
struggles to optimize I had to tune the
[11:33] (693.20s)
algorithm quite a bit to get it to
[11:34] (694.72s)
perform even decently the values that I
[11:37] (697.48s)
tune the population size the number of
[11:39] (699.92s)
rounds the number of neurons these
[11:42] (702.00s)
values are called hyperparameters they
[11:44] (704.32s)
are not learned during training but set
[11:46] (706.64s)
beforehand you can tune them by hand as
[11:48] (708.84s)
I have or use a meta optimization
[11:51] (711.40s)
algorithm to find good hyperparameters
[11:53] (713.96s)
we're not doing that
[11:56] (716.24s)
here so it seems to have hit a wall this
[11:59] (719.56s)
may be due to the local search getting
[12:01] (721.56s)
stuck in one of the many shallow valleys
[12:04] (724.16s)
that are still high up on loss Mountain
[12:06] (726.92s)
these are local Minima and they are
[12:09] (729.44s)
where hill climbers go to die our local
[12:12] (732.52s)
search cannot climb up and out it can
[12:14] (734.96s)
only go down loss always goes down
[12:18] (738.64s)
likewise real genetic fitness always
[12:20] (740.92s)
goes up and it too can get stuck on low
[12:23] (743.80s)
hilltops it can only make do with what
[12:26] (746.20s)
it's got it cannot go back to the
[12:28] (748.08s)
drawing board our algorithm might be
[12:30] (750.76s)
getting stuck in one of these local
[12:32] (752.56s)
Minima but if you just keep running it
[12:35] (755.60s)
it does keep improving very slowly
[12:39] (759.60s)
gradualism baby we could improve this by
[12:42] (762.44s)
keeping a diverse population of good
[12:44] (764.64s)
candidates throughout the entire run
[12:46] (766.60s)
mixing their parameters with crossover
[12:48] (768.44s)
mutations we could be mutating the
[12:50] (770.72s)
neural architecture itself and other
[12:52] (772.96s)
hyper parameters along the way I will
[12:55] (775.40s)
explore these in a future video but for
[12:57] (777.36s)
now we must move on a local search works
[13:00] (780.00s)
pretty well if you have a billion years
[13:02] (782.12s)
but if you don't want to die before it
[13:04] (784.00s)
converges you will need something like
[13:06] (786.08s)
gradient descent stochastic gradient
[13:09] (789.08s)
descent or SGD stochastic means it
[13:12] (792.04s)
involves Randomness descent it's going
[13:14] (794.16s)
down the hill and the gradient is our
[13:16] (796.56s)
secret weapon the gradient is a vector a
[13:19] (799.84s)
list of numbers each holding the slope
[13:22] (802.04s)
of the corresponding parameter on loss
[13:24] (804.40s)
Mountain if you have two parameters you
[13:26] (806.84s)
have two gradient values Collective ly
[13:29] (809.44s)
they make a line seen here actually that
[13:32] (812.32s)
is the line tangent to the surface which
[13:34] (814.24s)
is not quite the same thing but it
[13:35] (815.92s)
includes the gradient the gradient
[13:37] (817.96s)
Vector points in the direction of the
[13:39] (819.84s)
steepest slope from the current position
[13:42] (822.16s)
on the landscape give it a negative sign
[13:44] (824.92s)
and it points to the steepest slope
[13:47] (827.28s)
downhill the gradient is our compass on
[13:50] (830.04s)
L Mountain it is the slope of the ground
[13:52] (832.64s)
that we can feel under our feet
[13:55] (835.00s)
calculating it requires some calculus
[13:57] (837.24s)
multivariable calc I defined the Lost
[14:00] (840.08s)
landscape function a bit more generally
[14:02] (842.00s)
here we need to take its derivative with
[14:04] (844.64s)
respect to each parameter and we can do
[14:07] (847.12s)
this because our lost landscape function
[14:09] (849.28s)
is always well defined even though our
[14:11] (851.52s)
Target function is unknown so we start
[14:14] (854.72s)
by using the derivative law known as use
[14:16] (856.92s)
wolf from Alpha and here's the
[14:18] (858.56s)
derivatives go watch three blue one
[14:20] (860.40s)
Brown's video for the mathy math there's
[14:22] (862.16s)
a chain rule in there somewhere that
[14:24] (864.16s)
gives us a partial derivative function
[14:26] (866.08s)
for each parameter which combined gives
[14:28] (868.48s)
us our great radiant Vector these
[14:30] (870.52s)
functions tell us how loss changes as
[14:32] (872.96s)
each parameter changes this gets
[14:35] (875.40s)
complicated for large neural networks
[14:37] (877.32s)
but it can be fully automated now we can
[14:40] (880.48s)
calculate the gradient for any point and
[14:42] (882.64s)
step directly towards the steepest
[14:44] (884.80s)
downhill Direction by simply subtracting
[14:47] (887.76s)
the gradient from the parameters except
[14:50] (890.84s)
that's probably too much you will
[14:52] (892.44s)
overshoot it you want to scale the
[14:54] (894.72s)
gradient down with a learning rate
[14:56] (896.68s)
hyperparameter something like 0 2 the
[14:59] (899.68s)
gradient tells you the optimal Direction
[15:01] (901.92s)
but not the optimal step size you're
[15:03] (903.96s)
still blind you don't know what the
[15:05] (905.88s)
function will look like in the next step
[15:08] (908.48s)
there are clever ways of choosing better
[15:10] (910.32s)
step sizes we'll talk more about that
[15:12] (912.20s)
later just know that because the
[15:14] (914.12s)
gradient tells you the best direction to
[15:16] (916.04s)
step there is no guessing and checking
[15:18] (918.64s)
this is a crucial Advantage we once
[15:21] (921.68s)
again randomly initialize our Network
[15:23] (923.88s)
and evaluate loss with a forward pass
[15:26] (926.88s)
then we go back throughout the entire
[15:28] (928.60s)
network calculate the gradient for each
[15:30] (930.88s)
parameter and apply the step which moves
[15:33] (933.36s)
the point this is the backwards pass AKA
[15:36] (936.84s)
back propagation and repeat this results
[15:40] (940.12s)
in a much smoother more precise stepping
[15:42] (942.92s)
algorithm than Evolution it is still a
[15:45] (945.56s)
gradualistic hill climber but more
[15:47] (947.56s)
direct and less Spazzy although it can
[15:49] (949.96s)
get jumpy on steep slopes it is not
[15:52] (952.44s)
guaranteed to always reduce loss the
[15:55] (955.84s)
stochastic part comes from the random
[15:58] (958.00s)
initialization but also something called
[16:00] (960.28s)
random batching rather than training on
[16:02] (962.64s)
the entire data set all at once we train
[16:05] (965.08s)
on small random subsets of the data
[16:07] (967.48s)
called batches we go through every batch
[16:09] (969.96s)
in the data set taking a step for each
[16:12] (972.28s)
and then randomize the batches for the
[16:14] (974.28s)
next pass you can simulate this effect
[16:16] (976.96s)
with simulate batching which purely
[16:19] (979.16s)
randomizes the data on every step
[16:21] (981.84s)
because the landscape is defined by the
[16:24] (984.12s)
data set it changes with every batch but
[16:26] (986.88s)
in the long run these variations will
[16:28] (988.96s)
average out to the same landscape
[16:31] (991.52s)
mathematically it is equivalent to train
[16:33] (993.52s)
on the entire data set all at once as it
[16:36] (996.12s)
is to train on small random batches
[16:38] (998.48s)
though in practice hyperparameters must
[16:40] (1000.40s)
be tuned batching is extremely useful
[16:43] (1003.28s)
because it allows us to spread out the
[16:45] (1005.20s)
expensive back propagation algorithm
[16:47] (1007.48s)
over more compute
[16:51] (1011.60s)
time so let's train a real network using
[16:55] (1015.20s)
SGD this is using the exact same
[16:57] (1017.76s)
architecture as our search but this time
[17:00] (1020.32s)
with pytorch's SGD
[17:04] (1024.52s)
Optimizer this works good it is smoother
[17:07] (1027.44s)
than the local search and able to refine
[17:09] (1029.88s)
details to get a slightly lower
[17:12] (1032.52s)
loss if we train it on an image we can
[17:15] (1035.40s)
see that it is also smoother and really
[17:18] (1038.32s)
pretty looks like a
[17:20] (1040.12s)
candle and be aware that this visual
[17:22] (1042.52s)
comparison is not scientific or Fair
[17:25] (1045.04s)
we're just eyeballing it though SGD did
[17:27] (1047.72s)
take significantly less compute time to
[17:29] (1049.96s)
end up with a lower loss but for the
[17:32] (1052.68s)
image of Darwin we seem to run into a
[17:34] (1054.88s)
similar problem it converges very slowly
[17:37] (1057.80s)
and not very well actually worse than
[17:40] (1060.04s)
the local search again we might be
[17:42] (1062.72s)
getting stuck in local Minima which are
[17:45] (1065.00s)
just as much of a problem for gradient
[17:46] (1066.72s)
descent as Evolution to help we can add
[17:49] (1069.76s)
momentum to the final step which helps
[17:51] (1071.80s)
it roll out of little goalies but we've
[17:54] (1074.64s)
actually already been using momentum a
[17:57] (1077.12s)
much more radical Improvement comes from
[17:59] (1079.20s)
using a variant of SGD called atom atom
[18:03] (1083.20s)
is basically momentum on steroids it
[18:05] (1085.96s)
improves upon the final step size by
[18:08] (1088.04s)
using the first and second moments of
[18:10] (1090.32s)
the gradient I won't explain more about
[18:12] (1092.68s)
Adam but I will sing its Praises atom
[18:15] (1095.32s)
rocks it makes the training process way
[18:18] (1098.00s)
more efficient way faster and ultimately
[18:20] (1100.52s)
more capable the final learned functions
[18:22] (1102.88s)
are much more precise hello Mr Darwin
[18:26] (1106.32s)
looking a little creepy today
[18:28] (1108.88s)
now remember that these are all
[18:30] (1110.20s)
generated with the exact same neural
[18:32] (1112.40s)
architecture just with different
[18:34] (1114.32s)
parameter values discovered by different
[18:37] (1117.53s)
[Music]
[18:39] (1119.72s)
algorithms there is another unintuitive
[18:42] (1122.56s)
way to escape local Minima scale up add
[18:46] (1126.36s)
more parameters which adds more
[18:48] (1128.40s)
Dimensions to the Lost landscape what is
[18:50] (1130.96s)
a minimum in one dimension maybe a
[18:53] (1133.48s)
maximum in another dimension this is a
[18:56] (1136.36s)
saddle point and it means that there is
[18:58] (1138.32s)
another way downhill research has shown
[19:01] (1141.44s)
that as you increase the dimensionality
[19:03] (1143.52s)
of parameter space it becomes less and
[19:06] (1146.20s)
less common to find true local Minima
[19:08] (1148.76s)
where not a single parameter can be
[19:10] (1150.56s)
improved on the hyperdimensional Lost
[19:13] (1153.16s)
landscape There is almost always another
[19:15] (1155.76s)
way down the mountain gradient descent
[19:18] (1158.76s)
is well poised to take advantage of this
[19:21] (1161.24s)
the gradient will usually point to the
[19:23] (1163.44s)
optimal direction to escape the wouldbe
[19:25] (1165.68s)
local Minima evolutionary algorithms can
[19:28] (1168.76s)
can benefit from this but they're not
[19:30] (1170.64s)
able to take full advantage because they
[19:33] (1173.16s)
must guess at parameters to mutate I.E
[19:36] (1176.08s)
directions to travel they are
[19:38] (1178.12s)
precipitously less likely to guess the
[19:40] (1180.48s)
right direction as the number of
[19:42] (1182.44s)
directions grows this is a more General
[19:45] (1185.72s)
problem with the evolutionary approach
[19:47] (1187.92s)
as you scale up the number of parameters
[19:50] (1190.36s)
parameter space grows exponentially it
[19:53] (1193.44s)
becomes explosively unlikely that you
[19:55] (1195.64s)
will guess the right values to mutate
[19:58] (1198.08s)
you will have to generate a lot of
[19:59] (1199.80s)
failed networks however note that it
[20:02] (1202.48s)
never becomes impossible just very
[20:05] (1205.28s)
improbable given enough time
[20:07] (1207.52s)
improvements will still accumulate it
[20:09] (1209.60s)
just might take literally billions of
[20:11] (1211.56s)
years we meet the curse of
[20:13] (1213.92s)
dimensionality again it has killed many
[20:16] (1216.72s)
other optimization algorithms here too
[20:19] (1219.64s)
the primary and unique advantage of
[20:22] (1222.04s)
gradient descent is scale no matter how
[20:25] (1225.20s)
many parameters you have you can always
[20:27] (1227.68s)
compute the gradient in linear time and
[20:30] (1230.20s)
need take only one step in the optimal
[20:32] (1232.76s)
Direction This lets you take advantage
[20:35] (1235.00s)
of massive neural networks with billions
[20:37] (1237.32s)
of parameters without exploding the
[20:39] (1239.72s)
number of computations it is the
[20:42] (1242.24s)
blessing of dimensionality and it is why
[20:44] (1244.72s)
gradient descent is the state of the
[20:48] (1248.80s)
art but this video is deeply unfair to
[20:52] (1252.08s)
evolutionary methods I am comparing a
[20:54] (1254.76s)
heavily optimized gradient descent
[20:56] (1256.72s)
implementation with a homegrown own dumb
[20:59] (1259.16s)
guy baby time local search I will
[21:01] (1261.76s)
definitely make a follow-up video on The
[21:03] (1263.88s)
state-of-the-art evolutionary methods I
[21:06] (1266.40s)
think there may be untapped advantages
[21:08] (1268.28s)
to algorithms other than gradient
[21:10] (1270.16s)
descent the main limitation of gradient
[21:12] (1272.80s)
descent is that to calculate the
[21:14] (1274.56s)
gradient it must be that the Lost
[21:16] (1276.56s)
landscape function the loss function and
[21:19] (1279.12s)
the neural network itself are continuous
[21:22] (1282.20s)
and differentiable or connected and
[21:25] (1285.12s)
smooth no holes no breakes no gaps no
[21:28] (1288.88s)
Jagged edges and no weird fractally bits
[21:31] (1291.48s)
that you can't differentiate the
[21:33] (1293.28s)
gradients must flow this applies to
[21:36] (1296.04s)
activation functions too like the
[21:38] (1298.44s)
reu uh this Jagged point is not
[21:41] (1301.44s)
differentiable but we'll just say its
[21:43] (1303.52s)
derivative is one and this works
[21:45] (1305.12s)
perfectly fine in practice what are you
[21:46] (1306.96s)
going to do about it math nerd however
[21:49] (1309.36s)
you cannot back propop through a binary
[21:51] (1311.44s)
step function like this but you can
[21:54] (1314.24s)
evolve
[21:56] (1316.16s)
it pretty neat and you can use it for
[21:59] (1319.40s)
images too and make some really creepy
[22:01] (1321.64s)
looking smiley
[22:03] (1323.36s)
faces you can't do this with gradient
[22:05] (1325.92s)
descent it is possible to evolve very
[22:09] (1329.16s)
weird neural networks that are broken
[22:11] (1331.56s)
Jagged or fractally that you can't train
[22:14] (1334.40s)
with gradient descent and this might be
[22:16] (1336.64s)
useful it's a neat idea and it makes an
[22:19] (1339.28s)
interesting image but this is not really
[22:21] (1341.24s)
a better
[22:22] (1342.36s)
approximation there is a common view
[22:24] (1344.64s)
that evolution is just an inferior
[22:26] (1346.80s)
optimization algorithm and well that is
[22:29] (1349.40s)
true to some extent there's a reason we
[22:31] (1351.44s)
don't train language models with genetic
[22:33] (1353.64s)
algorithms but I do think this view is
[22:36] (1356.20s)
missing something important true
[22:38] (1358.48s)
biological evolution has something that
[22:40] (1360.64s)
none of these algorithms have even the
[22:42] (1362.76s)
genetic ones if you actually ran any of
[22:45] (1365.72s)
these programs for a billion years you'd
[22:48] (1368.32s)
end up with just a better and better
[22:50] (1370.88s)
approximation if you run true biological
[22:53] (1373.80s)
evolution for a billion years you end up
[22:56] (1376.24s)
with eyes and wings and brains and pill
[22:59] (1379.68s)
bugs Evolution diverges not just
[23:03] (1383.12s)
converges there's a lot more to say here
[23:05] (1385.60s)
but alas we're outside the scope of the
[23:07] (1387.72s)
video right now gradient descent is the
[23:10] (1390.64s)
most optimized Optimizer by far it is
[23:14] (1394.24s)
the best tool for the job and its scales
[23:17] (1397.00s)
which turns out to be kind of a big deal
[23:19] (1399.48s)
large language models trained with
[23:21] (1401.16s)
gradient descent helped write the code
[23:23] (1403.12s)
and check the script for this video it's
[23:25] (1405.40s)
cool that this algorithm can encode some
[23:27] (1407.84s)
weird understanding of itself into the
[23:30] (1410.76s)
very neurons that it
[23:32] (1412.84s)
trains don't worry the script was human
[23:35] (1415.24s)
written and human fact checked too I
[23:37] (1417.52s)
don't really like the way AI writes and
[23:39] (1419.72s)
I always use human music check out my
[23:41] (1421.84s)
music guys they're great okay video's
[23:44] (1424.52s)
over get out of here