Gradient Descent vs Evolution | How Neural Networks Learn

[00:00] (0.00s)

[Music]

[00:03] (3.24s)

you are currently watching an artificial

[00:05] (5.68s)

neural network

[00:07] (7.28s)

learn in this video as we watch neural

[00:10] (10.00s)

networks learn you will learn how they

[00:12] (12.12s)

learn this is a continuation of my

[00:14] (14.52s)

previous videos about neural networks as

[00:16] (16.72s)

universal function approximators and

[00:19] (19.08s)

today I will finally explain the

[00:21] (21.28s)

learning algorithm itself and already

[00:24] (24.24s)

there are 500 pedantic comments saying

[00:26] (26.60s)

uh it's not real learning cuz it's not

[00:28] (28.72s)

conscious and it's just matri

[00:30] (30.24s)

multiplications and AI is neither a nor

[00:33] (33.20s)

I that sounds smart yeah yeah whatever

[00:35] (35.88s)

nerds nobody cares by learning I mean

[00:39] (39.36s)

optimizing we will compare two different

[00:41] (41.84s)

optimization algorithms for neural

[00:43] (43.88s)

networks the state-of-the-art stochastic

[00:46] (46.52s)

gradient descent and for contrast a

[00:49] (49.08s)

simple evolutionary algorithm we will

[00:51] (51.56s)

watch real neural networks learn simple

[00:53] (53.72s)

functions that let us visualize the

[00:55] (55.64s)

actual training process in action I will

[00:58] (58.44s)

also be using a little interact web toy

[01:00] (60.72s)

that I made you can play with this right

[01:02] (62.44s)

now it is a free and open- Source

[01:04] (64.52s)

website I hope to provide some insight

[01:07] (67.00s)

for both newbies and experts and show

[01:09] (69.48s)

the similarities and differences the

[01:11] (71.60s)

advantages and disadvantages of these

[01:14] (74.16s)

algorithms and ultimately explain why

[01:16] (76.64s)

one of them is the most optimized

[01:20] (80.24s)

Optimizer again I recommend you watch my

[01:22] (82.84s)

other neural network videos but in short

[01:25] (85.24s)

functions are input output machines

[01:27] (87.88s)

numbers in numbers out neur networks are

[01:30] (90.80s)

also functions and can reconstruct or

[01:33] (93.72s)

approximate any other function to any

[01:36] (96.28s)

degree of precision given only a sample

[01:38] (98.64s)

of its data points they are Universal

[01:41] (101.84s)

function approximators and this makes

[01:43] (103.72s)

them extremely general purpose for real

[01:46] (106.04s)

world tasks because functions describe

[01:50] (110.00s)

the

[01:51] (111.00s)

world yes the network shape as a

[01:54] (114.56s)

function is defined by its set of

[01:56] (116.56s)

parameters the values of its weights and

[01:58] (118.96s)

biases parameters are designed to be

[02:01] (121.80s)

changeable tunable and different values

[02:04] (124.24s)

produce different shapes more parameters

[02:06] (126.68s)

let you build more complex functions

[02:09] (129.20s)

with four parameters I can fit an

[02:10] (130.76s)

elephant and with five I can make him

[02:12] (132.72s)

wiggle his

[02:13] (133.76s)

trunk this web toy has an extremely

[02:16] (136.52s)

simple network with two parameters A and

[02:18] (138.92s)

B which we can control down here the

[02:22] (142.00s)

green line is the approximation and the

[02:24] (144.32s)

blue line is the target function and

[02:26] (146.28s)

data set it's a sine wave the neural

[02:29] (149.00s)

Network's actual function definition is

[02:31] (151.24s)

here it's using a tan H activation and

[02:34] (154.84s)

as we change the values we can see this

[02:37] (157.16s)

Green Point moving around down here this

[02:40] (160.04s)

is parameter space the space of all

[02:42] (162.52s)

possible combinations of all possible

[02:44] (164.72s)

values of A and B between 5 and POS 5 it

[02:49] (169.20s)

is a 2d plane where the parameters are

[02:52] (172.00s)

the coordinates 2D for 2 params each

[02:55] (175.72s)

point represents a different network and

[02:58] (178.04s)

collectively this is the space of all

[03:00] (180.36s)

possible networks with this particular

[03:02] (182.36s)

architecture with two parameters wired

[03:04] (184.40s)

up in this specific way note that this

[03:06] (186.80s)

space would be very difficult to

[03:08] (188.32s)

visualize if we had lots of parameters

[03:10] (190.64s)

like hundreds this problem will haunt us

[03:13] (193.00s)

throughout the video but we will keep it

[03:14] (194.60s)

simple for now so given a Target

[03:17] (197.52s)

function it is clear that some of these

[03:19] (199.40s)

parameters fit the target function

[03:21] (201.08s)

better than others we can search the

[03:23] (203.36s)

space manually to find good ones but how

[03:25] (205.56s)

do we automatically find the best ones

[03:28] (208.72s)

this is the job of our optimization

[03:31] (211.00s)

algorithm which is fundamentally solving

[03:33] (213.40s)

a search problem it must search the

[03:36] (216.00s)

space for the best set of parameters

[03:38] (218.12s)

that produces the best

[03:40] (220.16s)

approximation in order to see which is

[03:42] (222.44s)

best we need a way to evaluate a network

[03:45] (225.44s)

given a set of parameters tell me how

[03:47] (227.48s)

well a network with those parameters

[03:49] (229.64s)

fits the target function remember that

[03:51] (231.92s)

we don't actually know what the target

[03:53] (233.40s)

function is we only have a sample of

[03:55] (235.44s)

data points so let's take three data

[03:57] (237.28s)

points from a sine wave we can now give

[03:59] (239.84s)

our Network each input from the data set

[04:02] (242.08s)

and ask it to predict each output and

[04:04] (244.36s)

then compare the predicted outputs to

[04:06] (246.36s)

the true outputs we measure the

[04:08] (248.44s)

difference between these two values

[04:10] (250.00s)

using what's called a loss function in

[04:12] (252.48s)

this case we will use mean squared error

[04:14] (254.76s)

but there are many other loss functions

[04:16] (256.32s)

for different tasks we calculate loss

[04:18] (258.80s)

for all predicted versus true outputs

[04:21] (261.52s)

and then take the average across the

[04:23] (263.12s)

entire data set which gives us a single

[04:25] (265.60s)

final score for a given Network it's

[04:28] (268.36s)

loss loss loss is a measure of error so

[04:31] (271.60s)

lower loss is better we're playing golf

[04:33] (273.84s)

here low loss means a small difference

[04:36] (276.20s)

between predicted and true outputs which

[04:38] (278.28s)

means you have a good approximation this

[04:40] (280.76s)

big calculation can be generalized into

[04:43] (283.28s)

a function the loss landscape function

[04:46] (286.04s)

which takes a set of parameters as input

[04:48] (288.64s)

and outputs the loss for those

[04:50] (290.64s)

parameters on our current data set now

[04:53] (293.40s)

remember parameter space using this

[04:55] (295.64s)

function we can visit every point in

[04:57] (297.72s)

this space and calculate loss for every

[05:00] (300.08s)

set of parameters every Network and

[05:02] (302.52s)

visualize the loss as the height of the

[05:04] (304.64s)

plane at that point this is the loss

[05:08] (308.76s)

landscape it tells us how good every

[05:11] (311.44s)

possible Network in this space is at

[05:13] (313.84s)

approximating the Target function the

[05:16] (316.48s)

Green Point represents our current

[05:18] (318.12s)

Network the highest points are the worst

[05:20] (320.20s)

networks the lowest points are the best

[05:22] (322.48s)

we want the lowest mathematically

[05:25] (325.08s)

optimization simply means you are either

[05:27] (327.08s)

minimizing or maximizing a function or

[05:30] (330.16s)

some kind of Min maxing but we're not

[05:31] (331.68s)

worrying about that in this video for

[05:33] (333.80s)

minimizing you're trying to find the

[05:35] (335.60s)

best inputs that produce the lowest

[05:37] (337.80s)

output we are trying to minimize the

[05:40] (340.24s)

loss landscape function whose inputs are

[05:43] (343.44s)

parameters the data set values are not

[05:46] (346.04s)

passed in as variables but rather

[05:47] (347.84s)

treated as constants numbers hardwired

[05:50] (350.32s)

into the function they don't change no

[05:52] (352.36s)

matter what your parameters are this

[05:54] (354.68s)

requires a conceptual rewiring of our

[05:57] (357.12s)

neural network such that what were

[05:59] (359.08s)

variables data set inputs are now

[06:01] (361.12s)

constants and what were constants

[06:03] (363.08s)

parameters are now variables thinking

[06:05] (365.68s)

this way allows us to optimize input

[06:07] (367.92s)

parameters to minimize output loss we

[06:11] (371.12s)

can now search parameter space using the

[06:13] (373.52s)

loss landscape as our map we can just

[06:16] (376.68s)

visually look at it and see that the

[06:18] (378.32s)

lowest point is here so these parameters

[06:20] (380.48s)

make for the best network problem solved

[06:22] (382.56s)

we're done except of course in order to

[06:24] (384.88s)

show this landscape I have to generate

[06:27] (387.00s)

and evaluate every possible variation of

[06:29] (389.32s)

the Network in this space at least with

[06:32] (392.00s)

two parameters this is easy but two

[06:34] (394.00s)

parameters is nothing even on easy

[06:36] (396.24s)

problems a good approximation requires

[06:38] (398.32s)

hundreds of parameters which is both

[06:40] (400.20s)

impossible to visualize and intractable

[06:42] (402.48s)

to compute you cannot realistically

[06:44] (404.56s)

generate every possible combination of

[06:46] (406.92s)

100 parameters in a 100 dimensional

[06:49] (409.28s)

hyperspace and that is still very small

[06:51] (411.36s)

for a neural network with that many

[06:53] (413.44s)

parameters the loss landscape function

[06:55] (415.40s)

can still be defined and the loss for an

[06:58] (418.08s)

individual Point can be count calculated

[07:00] (420.32s)

but the entire landscape can never be

[07:02] (422.48s)

explicitly calculated in the way that

[07:04] (424.28s)

we've done here so our optimization

[07:07] (427.36s)

algorithm must make do with this

[07:09] (429.32s)

limitation it must scale even with

[07:11] (431.92s)

absurdly high numbers of parameters

[07:14] (434.32s)

we've kept the dimensionality low so we

[07:16] (436.20s)

can visualize it but our algorithm must

[07:18] (438.72s)

navigate the landscape

[07:20] (440.30s)

[Music]

[07:24] (444.32s)

blind imagine you are near the top of a

[07:27] (447.04s)

mountain range the landscape is rugged

[07:29] (449.32s)

with with many Peaks and valleys the air

[07:31] (451.56s)

is cold and thin and you must find your

[07:34] (454.08s)

way down the mountain but there's a

[07:36] (456.72s)

hitch you are blind I'm going to

[07:39] (459.48s)

interview Eric Wan mayor who climbed the

[07:42] (462.48s)

highest mountain in the world Mount

[07:43] (463.92s)

Everest but he's gay I mean he's gay

[07:47] (467.28s)

excuse me he's blind so we'll hear about

[07:49] (469.68s)

that coming up okay that's you they're

[07:51] (471.32s)

talking about blind and gay and stuck at

[07:53] (473.76s)

the top of Lost Mountain how do you find

[07:56] (476.08s)

your way down well one way you can try

[07:58] (478.76s)

is to Simply step in a random Direction

[08:01] (481.16s)

and if you go down commit to the step

[08:03] (483.64s)

otherwise do another random Step better

[08:05] (485.92s)

yet take like 10 random steps and only

[08:08] (488.60s)

commit to the best one repeat this over

[08:11] (491.04s)

and over this is the evolutionary way or

[08:14] (494.88s)

you could just feel the slope of the

[08:17] (497.04s)

ground under your feet and step in

[08:19] (499.48s)

exactly the direction that it feels

[08:21] (501.56s)

steepest downhill that is the gradient

[08:24] (504.64s)

descent way both are hill climbing

[08:27] (507.64s)

optimization algorithms step by step

[08:30] (510.56s)

they climb down loss Mountain seeking

[08:33] (513.40s)

the lowest point on the entire landscape

[08:35] (515.96s)

the global

[08:37] (517.72s)

minimum let's start with the

[08:39] (519.72s)

evolutionary way random guessing and

[08:42] (522.00s)

checking we'll use an algorithm known as

[08:44] (524.60s)

a local search and it is just about the

[08:46] (526.88s)

most Bare Bones genetic algorithm that

[08:49] (529.04s)

you could ask for start with a random

[08:51] (531.44s)

Network a random point and parameter

[08:53] (533.64s)

space copy it and mutate the copy's

[08:56] (536.20s)

parameters by adding a small random

[08:58] (538.48s)

value to all or some of them this moves

[09:00] (540.92s)

the point around a bit do this several

[09:03] (543.08s)

times so generate a population of

[09:05] (545.28s)

candidate random steps calculate loss

[09:07] (547.80s)

for all and choose the one with the

[09:09] (549.64s)

lowest loss for the next round and

[09:12] (552.00s)

repeat repeat repeat gradually step by

[09:15] (555.08s)

step its descendants descend the

[09:17] (557.48s)

mountain the parameters are the Genome

[09:20] (560.44s)

of the neural network and our loss

[09:22] (562.32s)

function is the fitness function or

[09:24] (564.60s)

anti- Fitness function as biological

[09:27] (567.52s)

life seeks the highest point on the the

[09:29] (569.52s)

fitness landscape we seek the lowest

[09:31] (571.88s)

point on the loss landscape this website

[09:34] (574.80s)

is actually performing a local search

[09:36] (576.88s)

which works pretty well on this problem

[09:38] (578.76s)

it can quickly find the lowest point on

[09:40] (580.76s)

the landscape we say it converges on the

[09:43] (583.80s)

global minimum it reaches a point where

[09:45] (585.76s)

it just keeps getting closer and closer

[09:47] (587.68s)

and closer it is not guaranteed to

[09:50] (590.28s)

converge on the global minimum I will

[09:52] (592.16s)

explain more about that in a bit either

[09:54] (594.28s)

way it finds the best approximation but

[09:56] (596.56s)

it still isn't a great approximation and

[09:58] (598.56s)

of course it's not two parameters does

[10:00] (600.64s)

not make for a very useful neural

[10:02] (602.40s)

network so let's switch over to a real

[10:05] (605.08s)

pytorch neural network and evolve it

[10:07] (607.20s)

using a local search this network has

[10:11] (611.08s)

1,741 parameters so we can no longer

[10:13] (613.96s)

observe the Lost landscape directly but

[10:16] (616.12s)

we can observe the Learned function it

[10:18] (618.60s)

mutates a random subset of parameters at

[10:21] (621.08s)

a time and you can actually see it

[10:22] (622.96s)

adjusting them as it learns the mutation

[10:26] (626.08s)

rate gets smaller over time so it can

[10:28] (628.44s)

hopefully learn f finer and finer

[10:30] (630.28s)

details I ran this for 150 rounds each

[10:33] (633.84s)

with a population of a th000 so I had to

[10:36] (636.48s)

generate 150,000 networks for this

[10:38] (638.96s)

footage that is a sacrifice I'm willing

[10:41] (641.24s)

to make that's pretty good now let's try

[10:44] (644.76s)

a more complex thing an image we'll use

[10:48] (648.16s)

a smiley face refresher on how this

[10:50] (650.56s)

works the network takes the pixel

[10:52] (652.60s)

coordinates as input and outputs a

[10:54] (654.92s)

single Pixel value two inputs one output

[10:58] (658.36s)

this image is made of of every output

[11:00] (660.52s)

for every pixel you treat every pixel in

[11:03] (663.52s)

the Target image as a data point in your

[11:05] (665.68s)

data set and approximate it every frame

[11:09] (669.08s)

is a successful mutation a strict

[11:11] (671.52s)

Improvement on the frame before loss can

[11:14] (674.44s)

only go down or stay the same this task

[11:17] (677.88s)

is harder than the previous one it takes

[11:20] (680.16s)

a while but it does eventually

[11:23] (683.20s)

converge now if we give it a proper

[11:25] (685.48s)

image like this picture of Mr Darwin

[11:27] (687.52s)

himself we can see just just how much it

[11:30] (690.12s)

struggles to optimize I had to tune the

[11:33] (693.20s)

algorithm quite a bit to get it to

[11:34] (694.72s)

perform even decently the values that I

[11:37] (697.48s)

tune the population size the number of

[11:39] (699.92s)

rounds the number of neurons these

[11:42] (702.00s)

values are called hyperparameters they

[11:44] (704.32s)

are not learned during training but set

[11:46] (706.64s)

beforehand you can tune them by hand as

[11:48] (708.84s)

I have or use a meta optimization

[11:51] (711.40s)

algorithm to find good hyperparameters

[11:53] (713.96s)

we're not doing that

[11:56] (716.24s)

here so it seems to have hit a wall this

[11:59] (719.56s)

may be due to the local search getting

[12:01] (721.56s)

stuck in one of the many shallow valleys

[12:04] (724.16s)

that are still high up on loss Mountain

[12:06] (726.92s)

these are local Minima and they are

[12:09] (729.44s)

where hill climbers go to die our local

[12:12] (732.52s)

search cannot climb up and out it can

[12:14] (734.96s)

only go down loss always goes down

[12:18] (738.64s)

likewise real genetic fitness always

[12:20] (740.92s)

goes up and it too can get stuck on low

[12:23] (743.80s)

hilltops it can only make do with what

[12:26] (746.20s)

it's got it cannot go back to the

[12:28] (748.08s)

drawing board our algorithm might be

[12:30] (750.76s)

getting stuck in one of these local

[12:32] (752.56s)

Minima but if you just keep running it

[12:35] (755.60s)

it does keep improving very slowly

[12:39] (759.60s)

gradualism baby we could improve this by

[12:42] (762.44s)

keeping a diverse population of good

[12:44] (764.64s)

candidates throughout the entire run

[12:46] (766.60s)

mixing their parameters with crossover

[12:48] (768.44s)

mutations we could be mutating the

[12:50] (770.72s)

neural architecture itself and other

[12:52] (772.96s)

hyper parameters along the way I will

[12:55] (775.40s)

explore these in a future video but for

[12:57] (777.36s)

now we must move on a local search works

[13:00] (780.00s)

pretty well if you have a billion years

[13:02] (782.12s)

but if you don't want to die before it

[13:04] (784.00s)

converges you will need something like

[13:06] (786.08s)

gradient descent stochastic gradient

[13:09] (789.08s)

descent or SGD stochastic means it

[13:12] (792.04s)

involves Randomness descent it's going

[13:14] (794.16s)

down the hill and the gradient is our

[13:16] (796.56s)

secret weapon the gradient is a vector a

[13:19] (799.84s)

list of numbers each holding the slope

[13:22] (802.04s)

of the corresponding parameter on loss

[13:24] (804.40s)

Mountain if you have two parameters you

[13:26] (806.84s)

have two gradient values Collective ly

[13:29] (809.44s)

they make a line seen here actually that

[13:32] (812.32s)

is the line tangent to the surface which

[13:34] (814.24s)

is not quite the same thing but it

[13:35] (815.92s)

includes the gradient the gradient

[13:37] (817.96s)

Vector points in the direction of the

[13:39] (819.84s)

steepest slope from the current position

[13:42] (822.16s)

on the landscape give it a negative sign

[13:44] (824.92s)

and it points to the steepest slope

[13:47] (827.28s)

downhill the gradient is our compass on

[13:50] (830.04s)

L Mountain it is the slope of the ground

[13:52] (832.64s)

that we can feel under our feet

[13:55] (835.00s)

calculating it requires some calculus

[13:57] (837.24s)

multivariable calc I defined the Lost

[14:00] (840.08s)

landscape function a bit more generally

[14:02] (842.00s)

here we need to take its derivative with

[14:04] (844.64s)

respect to each parameter and we can do

[14:07] (847.12s)

this because our lost landscape function

[14:09] (849.28s)

is always well defined even though our

[14:11] (851.52s)

Target function is unknown so we start

[14:14] (854.72s)

by using the derivative law known as use

[14:16] (856.92s)

wolf from Alpha and here's the

[14:18] (858.56s)

derivatives go watch three blue one

[14:20] (860.40s)

Brown's video for the mathy math there's

[14:22] (862.16s)

a chain rule in there somewhere that

[14:24] (864.16s)

gives us a partial derivative function

[14:26] (866.08s)

for each parameter which combined gives

[14:28] (868.48s)

us our great radiant Vector these

[14:30] (870.52s)

functions tell us how loss changes as

[14:32] (872.96s)

each parameter changes this gets

[14:35] (875.40s)

complicated for large neural networks

[14:37] (877.32s)

but it can be fully automated now we can

[14:40] (880.48s)

calculate the gradient for any point and

[14:42] (882.64s)

step directly towards the steepest

[14:44] (884.80s)

downhill Direction by simply subtracting

[14:47] (887.76s)

the gradient from the parameters except

[14:50] (890.84s)

that's probably too much you will

[14:52] (892.44s)

overshoot it you want to scale the

[14:54] (894.72s)

gradient down with a learning rate

[14:56] (896.68s)

hyperparameter something like 0 2 the

[14:59] (899.68s)

gradient tells you the optimal Direction

[15:01] (901.92s)

but not the optimal step size you're

[15:03] (903.96s)

still blind you don't know what the

[15:05] (905.88s)

function will look like in the next step

[15:08] (908.48s)

there are clever ways of choosing better

[15:10] (910.32s)

step sizes we'll talk more about that

[15:12] (912.20s)

later just know that because the

[15:14] (914.12s)

gradient tells you the best direction to

[15:16] (916.04s)

step there is no guessing and checking

[15:18] (918.64s)

this is a crucial Advantage we once

[15:21] (921.68s)

again randomly initialize our Network

[15:23] (923.88s)

and evaluate loss with a forward pass

[15:26] (926.88s)

then we go back throughout the entire

[15:28] (928.60s)

network calculate the gradient for each

[15:30] (930.88s)

parameter and apply the step which moves

[15:33] (933.36s)

the point this is the backwards pass AKA

[15:36] (936.84s)

back propagation and repeat this results

[15:40] (940.12s)

in a much smoother more precise stepping

[15:42] (942.92s)

algorithm than Evolution it is still a

[15:45] (945.56s)

gradualistic hill climber but more

[15:47] (947.56s)

direct and less Spazzy although it can

[15:49] (949.96s)

get jumpy on steep slopes it is not

[15:52] (952.44s)

guaranteed to always reduce loss the

[15:55] (955.84s)

stochastic part comes from the random

[15:58] (958.00s)

initialization but also something called

[16:00] (960.28s)

random batching rather than training on

[16:02] (962.64s)

the entire data set all at once we train

[16:05] (965.08s)

on small random subsets of the data

[16:07] (967.48s)

called batches we go through every batch

[16:09] (969.96s)

in the data set taking a step for each

[16:12] (972.28s)

and then randomize the batches for the

[16:14] (974.28s)

next pass you can simulate this effect

[16:16] (976.96s)

with simulate batching which purely

[16:19] (979.16s)

randomizes the data on every step

[16:21] (981.84s)

because the landscape is defined by the

[16:24] (984.12s)

data set it changes with every batch but

[16:26] (986.88s)

in the long run these variations will

[16:28] (988.96s)

average out to the same landscape

[16:31] (991.52s)

mathematically it is equivalent to train

[16:33] (993.52s)

on the entire data set all at once as it

[16:36] (996.12s)

is to train on small random batches

[16:38] (998.48s)

though in practice hyperparameters must

[16:40] (1000.40s)

be tuned batching is extremely useful

[16:43] (1003.28s)

because it allows us to spread out the

[16:45] (1005.20s)

expensive back propagation algorithm

[16:47] (1007.48s)

over more compute

[16:51] (1011.60s)

time so let's train a real network using

[16:55] (1015.20s)

SGD this is using the exact same

[16:57] (1017.76s)

architecture as our search but this time

[17:00] (1020.32s)

with pytorch's SGD

[17:04] (1024.52s)

Optimizer this works good it is smoother

[17:07] (1027.44s)

than the local search and able to refine

[17:09] (1029.88s)

details to get a slightly lower

[17:12] (1032.52s)

loss if we train it on an image we can

[17:15] (1035.40s)

see that it is also smoother and really

[17:18] (1038.32s)

pretty looks like a

[17:20] (1040.12s)

candle and be aware that this visual

[17:22] (1042.52s)

comparison is not scientific or Fair

[17:25] (1045.04s)

we're just eyeballing it though SGD did

[17:27] (1047.72s)

take significantly less compute time to

[17:29] (1049.96s)

end up with a lower loss but for the

[17:32] (1052.68s)

image of Darwin we seem to run into a

[17:34] (1054.88s)

similar problem it converges very slowly

[17:37] (1057.80s)

and not very well actually worse than

[17:40] (1060.04s)

the local search again we might be

[17:42] (1062.72s)

getting stuck in local Minima which are

[17:45] (1065.00s)

just as much of a problem for gradient

[17:46] (1066.72s)

descent as Evolution to help we can add

[17:49] (1069.76s)

momentum to the final step which helps

[17:51] (1071.80s)

it roll out of little goalies but we've

[17:54] (1074.64s)

actually already been using momentum a

[17:57] (1077.12s)

much more radical Improvement comes from

[17:59] (1079.20s)

using a variant of SGD called atom atom

[18:03] (1083.20s)

is basically momentum on steroids it

[18:05] (1085.96s)

improves upon the final step size by

[18:08] (1088.04s)

using the first and second moments of

[18:10] (1090.32s)

the gradient I won't explain more about

[18:12] (1092.68s)

Adam but I will sing its Praises atom

[18:15] (1095.32s)

rocks it makes the training process way

[18:18] (1098.00s)

more efficient way faster and ultimately

[18:20] (1100.52s)

more capable the final learned functions

[18:22] (1102.88s)

are much more precise hello Mr Darwin

[18:26] (1106.32s)

looking a little creepy today

[18:28] (1108.88s)

now remember that these are all

[18:30] (1110.20s)

generated with the exact same neural

[18:32] (1112.40s)

architecture just with different

[18:34] (1114.32s)

parameter values discovered by different

[18:37] (1117.53s)

[Music]

[18:39] (1119.72s)

algorithms there is another unintuitive

[18:42] (1122.56s)

way to escape local Minima scale up add

[18:46] (1126.36s)

more parameters which adds more

[18:48] (1128.40s)

Dimensions to the Lost landscape what is

[18:50] (1130.96s)

a minimum in one dimension maybe a

[18:53] (1133.48s)

maximum in another dimension this is a

[18:56] (1136.36s)

saddle point and it means that there is

[18:58] (1138.32s)

another way downhill research has shown

[19:01] (1141.44s)

that as you increase the dimensionality

[19:03] (1143.52s)

of parameter space it becomes less and

[19:06] (1146.20s)

less common to find true local Minima

[19:08] (1148.76s)

where not a single parameter can be

[19:10] (1150.56s)

improved on the hyperdimensional Lost

[19:13] (1153.16s)

landscape There is almost always another

[19:15] (1155.76s)

way down the mountain gradient descent

[19:18] (1158.76s)

is well poised to take advantage of this

[19:21] (1161.24s)

the gradient will usually point to the

[19:23] (1163.44s)

optimal direction to escape the wouldbe

[19:25] (1165.68s)

local Minima evolutionary algorithms can

[19:28] (1168.76s)

can benefit from this but they're not

[19:30] (1170.64s)

able to take full advantage because they

[19:33] (1173.16s)

must guess at parameters to mutate I.E

[19:36] (1176.08s)

directions to travel they are

[19:38] (1178.12s)

precipitously less likely to guess the

[19:40] (1180.48s)

right direction as the number of

[19:42] (1182.44s)

directions grows this is a more General

[19:45] (1185.72s)

problem with the evolutionary approach

[19:47] (1187.92s)

as you scale up the number of parameters

[19:50] (1190.36s)

parameter space grows exponentially it

[19:53] (1193.44s)

becomes explosively unlikely that you

[19:55] (1195.64s)

will guess the right values to mutate

[19:58] (1198.08s)

you will have to generate a lot of

[19:59] (1199.80s)

failed networks however note that it

[20:02] (1202.48s)

never becomes impossible just very

[20:05] (1205.28s)

improbable given enough time

[20:07] (1207.52s)

improvements will still accumulate it

[20:09] (1209.60s)

just might take literally billions of

[20:11] (1211.56s)

years we meet the curse of

[20:13] (1213.92s)

dimensionality again it has killed many

[20:16] (1216.72s)

other optimization algorithms here too

[20:19] (1219.64s)

the primary and unique advantage of

[20:22] (1222.04s)

gradient descent is scale no matter how

[20:25] (1225.20s)

many parameters you have you can always

[20:27] (1227.68s)

compute the gradient in linear time and

[20:30] (1230.20s)

need take only one step in the optimal

[20:32] (1232.76s)

Direction This lets you take advantage

[20:35] (1235.00s)

of massive neural networks with billions

[20:37] (1237.32s)

of parameters without exploding the

[20:39] (1239.72s)

number of computations it is the

[20:42] (1242.24s)

blessing of dimensionality and it is why

[20:44] (1244.72s)

gradient descent is the state of the

[20:48] (1248.80s)

art but this video is deeply unfair to

[20:52] (1252.08s)

evolutionary methods I am comparing a

[20:54] (1254.76s)

heavily optimized gradient descent

[20:56] (1256.72s)

implementation with a homegrown own dumb

[20:59] (1259.16s)

guy baby time local search I will

[21:01] (1261.76s)

definitely make a follow-up video on The

[21:03] (1263.88s)

state-of-the-art evolutionary methods I

[21:06] (1266.40s)

think there may be untapped advantages

[21:08] (1268.28s)

to algorithms other than gradient

[21:10] (1270.16s)

descent the main limitation of gradient

[21:12] (1272.80s)

descent is that to calculate the

[21:14] (1274.56s)

gradient it must be that the Lost

[21:16] (1276.56s)

landscape function the loss function and

[21:19] (1279.12s)

the neural network itself are continuous

[21:22] (1282.20s)

and differentiable or connected and

[21:25] (1285.12s)

smooth no holes no breakes no gaps no

[21:28] (1288.88s)

Jagged edges and no weird fractally bits

[21:31] (1291.48s)

that you can't differentiate the

[21:33] (1293.28s)

gradients must flow this applies to

[21:36] (1296.04s)

activation functions too like the

[21:38] (1298.44s)

reu uh this Jagged point is not

[21:41] (1301.44s)

differentiable but we'll just say its

[21:43] (1303.52s)

derivative is one and this works

[21:45] (1305.12s)

perfectly fine in practice what are you

[21:46] (1306.96s)

going to do about it math nerd however

[21:49] (1309.36s)

you cannot back propop through a binary

[21:51] (1311.44s)

step function like this but you can

[21:54] (1314.24s)

evolve

[21:56] (1316.16s)

it pretty neat and you can use it for

[21:59] (1319.40s)

images too and make some really creepy

[22:01] (1321.64s)

looking smiley

[22:03] (1323.36s)

faces you can't do this with gradient

[22:05] (1325.92s)

descent it is possible to evolve very

[22:09] (1329.16s)

weird neural networks that are broken

[22:11] (1331.56s)

Jagged or fractally that you can't train

[22:14] (1334.40s)

with gradient descent and this might be

[22:16] (1336.64s)

useful it's a neat idea and it makes an

[22:19] (1339.28s)

interesting image but this is not really

[22:21] (1341.24s)

a better

[22:22] (1342.36s)

approximation there is a common view

[22:24] (1344.64s)

that evolution is just an inferior

[22:26] (1346.80s)

optimization algorithm and well that is

[22:29] (1349.40s)

true to some extent there's a reason we

[22:31] (1351.44s)

don't train language models with genetic

[22:33] (1353.64s)

algorithms but I do think this view is

[22:36] (1356.20s)

missing something important true

[22:38] (1358.48s)

biological evolution has something that

[22:40] (1360.64s)

none of these algorithms have even the

[22:42] (1362.76s)

genetic ones if you actually ran any of

[22:45] (1365.72s)

these programs for a billion years you'd

[22:48] (1368.32s)

end up with just a better and better

[22:50] (1370.88s)

approximation if you run true biological

[22:53] (1373.80s)

evolution for a billion years you end up

[22:56] (1376.24s)

with eyes and wings and brains and pill

[22:59] (1379.68s)

bugs Evolution diverges not just

[23:03] (1383.12s)

converges there's a lot more to say here

[23:05] (1385.60s)

but alas we're outside the scope of the

[23:07] (1387.72s)

video right now gradient descent is the

[23:10] (1390.64s)

most optimized Optimizer by far it is

[23:14] (1394.24s)

the best tool for the job and its scales

[23:17] (1397.00s)

which turns out to be kind of a big deal

[23:19] (1399.48s)

large language models trained with

[23:21] (1401.16s)

gradient descent helped write the code

[23:23] (1403.12s)

and check the script for this video it's

[23:25] (1405.40s)

cool that this algorithm can encode some

[23:27] (1407.84s)

weird understanding of itself into the

[23:30] (1410.76s)

very neurons that it

[23:32] (1412.84s)

trains don't worry the script was human

[23:35] (1415.24s)

written and human fact checked too I

[23:37] (1417.52s)

don't really like the way AI writes and

[23:39] (1419.72s)

I always use human music check out my

[23:41] (1421.84s)

music guys they're great okay video's

[23:44] (1424.52s)

over get out of here

YouTube Deep Summary

Gradient Descent vs Evolution | How Neural Networks Learn

📚 Chapter Summaries (8)

🤖 AI-Generated Summary:

Summary History

Overview

Main Topics Covered

Key Takeaways & Insights

Actionable Strategies

Specific Details & Examples

Warnings & Common Mistakes

Resources & Next Steps

📝 Transcript Chapters (8 chapters):

📝 Transcript (585 entries):