[00:00] (0.21s)
- How many times do you need
to shuffle a deck of cards
[00:03] (3.06s)
to make them truly random?
[00:05] (5.22s)
How much uranium does it
take to build a nuclear bomb?
[00:08] (8.67s)
(explosion booming)
[00:10] (10.02s)
How can you predict the
next word in a sentence?
[00:12] (12.87s)
And how does Google know which page
[00:15] (15.21s)
you're actually searching for?
[00:17] (17.04s)
Well, the reason we know the answer
[00:18] (18.21s)
to all of these questions
[00:19] (19.59s)
is because of a strange
math feud in Russia
[00:22] (22.50s)
that took place over 100 years ago.
[00:26] (26.67s)
In 1905, socialist
groups all across Russia
[00:29] (29.79s)
rose up against the Tsar,
the ruler of the empire.
[00:33] (33.15s)
They demanded a complete political reform,
[00:35] (35.64s)
or failing that, that he stepped
down from power entirely.
[00:39] (39.45s)
- This divided the nation into two.
[00:41] (41.43s)
So on one side you got
the Tsarists, right?
[00:44] (44.04s)
They wanted to defend the status quo
[00:46] (46.14s)
and keep the Tsar in power.
[00:48] (48.06s)
But then on the other side,
[00:49] (49.08s)
you had the socialists
[00:50] (50.37s)
who wanted this complete political reform.
[00:53] (53.31s)
And this division was so bad
[00:54] (54.75s)
that it crept into every part of society
[00:57] (57.48s)
to the point where even mathematicians
[00:59] (59.25s)
started picking sides.
[01:00] (60.96s)
- [Derek] On the side of
the Tsar was Pavel Nekrasov,
[01:03] (63.84s)
unofficially called the
Tsar of Probability.
[01:06] (66.74s)
Nekrasov was a deeply
religious and powerful man,
[01:10] (70.26s)
and he used his status to
argue that math could be used
[01:13] (73.11s)
to explain free will and the will of God.
[01:16] (76.83s)
- His intellectual nemesis
on the socialist side
[01:19] (79.35s)
was Andrey Markov, also
known as Andrey The Furious.
[01:24] (84.36s)
Markov was an atheist
[01:25] (85.71s)
and he had no patience for people
[01:27] (87.36s)
who were being unrigorous,
[01:29] (89.16s)
something he considered Nekrasov to be,
[01:31] (91.38s)
because in his eyes,
[01:32] (92.37s)
math had nothing to do
with free will or religion.
[01:35] (95.34s)
So he publicly criticized Nekrasov's work,
[01:38] (98.04s)
listing it among "the
abuses of mathematics."
[01:41] (101.52s)
Their feud centered on the main idea
[01:43] (103.44s)
people had used to do probability
for the last 200 years.
[01:46] (106.95s)
And we can illustrate this
with a simple coin flip.
[01:50] (110.28s)
When I flip the coin 10 times,
[01:51] (111.72s)
I get six times heads
and four times tails,
[01:54] (114.45s)
which is obviously not
the 50/50 you'd expect.
[01:57] (117.39s)
But if I keep flipping the coin,
[01:59] (119.13s)
then at first the ratio
jumps all over the place.
[02:02] (122.07s)
But after a large number of flips,
[02:04] (124.14s)
we see that it slowly settles
down and approaches 50/50.
[02:08] (128.01s)
And in this case, after 100 flips,
[02:10] (130.53s)
we end up on 51 heads and 49 tails,
[02:14] (134.07s)
which is almost exactly
what you would expect.
[02:17] (137.37s)
This behavior that the average outcome
[02:19] (139.50s)
gets closer and closer
to the expected value
[02:22] (142.11s)
as you run more and
more independent trials
[02:24] (144.54s)
is known as the law of large numbers.
[02:27] (147.21s)
It was first proven by
Jacob Bernoulli in 1713,
[02:30] (150.84s)
and it was the key concept
[02:32] (152.19s)
at the heart of probability theory
[02:34] (154.20s)
right up until Markov and Nekrasov.
[02:36] (156.75s)
But Bernoulli only proved
[02:38] (158.46s)
that it worked for independent events
[02:40] (160.29s)
like a fair coin flip,
[02:42] (162.06s)
or when you ask people to guess
[02:43] (163.74s)
how much they think an item is worth,
[02:45] (165.90s)
where one event doesn't
influence the others.
[02:49] (169.29s)
But now imagine that instead
of asking each person
[02:52] (172.29s)
to submit their guess individually,
[02:54] (174.42s)
you ask people to shout
out their answer in public.
[02:57] (177.90s)
Well, in this case,
[02:58] (178.98s)
the first person might think
[03:00] (180.27s)
it's an extraordinarily valuable item,
[03:02] (182.49s)
and say it's worth around $2,000,
[03:05] (185.58s)
but now all the other people in the room
[03:07] (187.95s)
are influenced by this value,
[03:09] (189.63s)
and so, their guesses
have become dependent.
[03:12] (192.84s)
And now the average doesn't
converge to the true value,
[03:16] (196.14s)
but instead it clusters
around a higher amount.
[03:19] (199.56s)
- And so, for 200 years,
[03:21] (201.42s)
probability had relied
on this key assumption,
[03:24] (204.30s)
that you need independence
[03:26] (206.04s)
to observe the law of large numbers.
[03:28] (208.68s)
And this was the idea
[03:30] (210.03s)
that sparked Nekrasov and Markov's feud.
[03:32] (212.82s)
See, Nekrasov agreed with Bernoulli
[03:34] (214.83s)
that you need independence to
get the law of large numbers.
[03:37] (217.80s)
But he took it one step further.
[03:39] (219.78s)
He said, if you see the
law of large numbers,
[03:42] (222.30s)
you can infer that the underlying events
[03:44] (224.88s)
must be independent.
[03:47] (227.55s)
- Take this table of Belgian
marriages from 1841 to 1845.
[03:52] (232.38s)
Now you see that every year
the average is about 29,000.
[03:55] (235.50s)
And so, it seems like the values converge
[03:58] (238.29s)
and therefore that they follow
the law of large numbers.
[04:01] (241.14s)
And when Nekrasov looked
at other social statistics
[04:03] (243.72s)
like crime rates and birth rates,
[04:05] (245.82s)
he noticed a similar pattern.
[04:07] (247.98s)
But now think about where
all this data is coming from.
[04:10] (250.65s)
It's coming from decisions to get married,
[04:12] (252.99s)
decisions to commit crimes,
[04:14] (254.52s)
and decisions to have babies,
[04:16] (256.14s)
at least for the most part.
[04:18] (258.21s)
So Nekrasov reasoned that
because these statistics
[04:20] (260.49s)
followed the law of large numbers,
[04:22] (262.14s)
the decisions causing
them must be independent.
[04:24] (264.81s)
In other words,
[04:25] (265.77s)
he argued that they must
be acts of free will.
[04:29] (269.04s)
So to him, free will wasn't
just something philosophical,
[04:32] (272.19s)
it was something you could measure.
[04:34] (274.05s)
It was scientific.
[04:37] (277.11s)
- [Derek] But to Markov,
Nekrasov was delusional.
[04:40] (280.74s)
He thought it was absurd
[04:41] (281.88s)
to link mathematical
independence to free will.
[04:45] (285.72s)
So Markov set out to prove
that dependent events
[04:48] (288.75s)
could also follow the
law of large numbers,
[04:51] (291.45s)
and that you can still do
probability with dependent events.
[04:56] (296.43s)
- To do this, he needed something
[04:57] (297.93s)
where one event clearly
depended on what came before,
[05:01] (301.32s)
and he got the idea that
this is what happens in text.
[05:04] (304.65s)
Whether your next letter
is a consonant or a vowel
[05:07] (307.50s)
depends heavily on what
the current letter is.
[05:10] (310.53s)
So to test this, Markov turned to a poem
[05:13] (313.14s)
at the heart of Russian literature,
[05:15] (315.07s)
"Eugene Onegin" by Alexander Pushkin.
[05:19] (319.05s)
- He took the first 20,000
letters of the poem,
[05:21] (321.54s)
stripped out all punctuation and spaces,
[05:24] (324.09s)
and pushed them together into
one long string of characters.
[05:27] (327.81s)
He counted the letters
[05:28] (328.86s)
and found that 43% were vowels
and 57% were consonants.
[05:33] (333.48s)
Then Markov broke the string
into overlapping pairs,
[05:37] (337.11s)
that gave him four possible combinations,
[05:39] (339.72s)
vowel-vowel, consonant-consonant,
[05:41] (341.70s)
vowel-consonant, or consonant-vowel.
[05:44] (344.65s)
Now, if the letters were independent,
[05:46] (346.86s)
the probability of a vowel-vowel pair
[05:48] (348.72s)
would just be the
probability of a vowel twice,
[05:51] (351.60s)
which is about 0.18 or an 18% chance.
[05:56] (356.16s)
But when Markov actually counted,
[05:58] (358.41s)
he found vowel-vowel pairs
only show up 6% of the time,
[06:03] (363.15s)
way less than if they were independent.
[06:06] (366.24s)
And when he checked the other pairs,
[06:07] (367.89s)
he found that all actual
values differed greatly
[06:10] (370.92s)
from what the independent
case would predict.
[06:13] (373.86s)
So Markov had shown that
the letters were dependent.
[06:18] (378.09s)
And to beat Nekrasov, all
he needed to do now was show
[06:21] (381.03s)
that these letters still followed
the law of large numbers.
[06:24] (384.03s)
So he created a prediction
machine of sorts.
[06:28] (388.05s)
He started by drawing two circles,
[06:30] (390.06s)
one for a vowel and one for a consonant.
[06:32] (392.43s)
These were his states.
[06:34] (394.41s)
Now, say you're at a vowel,
[06:36] (396.15s)
then the next letter could
either be a vowel or a consonant.
[06:39] (399.66s)
So he drew two arrows to
represent these transitions.
[06:42] (402.94s)
But what are these
transition probabilities?
[06:46] (406.44s)
Well, Markov knew that if you
pick a random starting point,
[06:49] (409.32s)
there is a 43% chance
that it'll be a vowel.
[06:52] (412.44s)
He also knew that vowel-vowel pairs
[06:54] (414.45s)
occur about 6% of the time.
[06:57] (417.15s)
So to find the probability
[06:58] (418.26s)
of going from a vowel to another vowel,
[07:00] (420.45s)
he divided 0.06 by 0.43
[07:03] (423.45s)
to find a transition
probability of about 13%.
[07:07] (427.14s)
And since there is a 100% chance
[07:09] (429.42s)
that another letter comes next,
[07:11] (431.43s)
all the arrows going from the same state
[07:13] (433.62s)
need to add up to one.
[07:15] (435.36s)
So the chance of going to a consonant
[07:17] (437.19s)
is one minus 0.13, or 87%.
[07:21] (441.57s)
He repeated this process
for the consonants
[07:23] (443.52s)
to complete his predictive machine.
[07:25] (445.98s)
So let's see how it works.
[07:28] (448.89s)
We'll start at a vowel.
[07:31] (451.17s)
Next, we generate a random
number between zero and one.
[07:34] (454.71s)
If it's below 0.13, we get another vowel,
[07:37] (457.47s)
if it's above, we get a consonant.
[07:39] (459.84s)
We got 0.78, so we get a consonant,
[07:42] (462.66s)
then we generate another number,
[07:43] (463.98s)
and check if it's above or below 0.67,
[07:46] (466.90s)
0.21, so we get a vowel.
[07:50] (470.01s)
Now, we can keep doing this
[07:51] (471.39s)
and keep track of the ratio
of vowels to consonants.
[07:54] (474.57s)
At first, the ratio
jumps all over the place,
[07:57] (477.21s)
but after a while, it
converges to a steady value,
[08:00] (480.96s)
43% vowels and 57% consonants,
[08:04] (484.23s)
the exact split Markov
had counted by hand.
[08:09] (489.21s)
So Markov had built a dependent system,
[08:11] (491.82s)
a literal chain of events,
[08:13] (493.89s)
and he showed that it still followed
[08:15] (495.69s)
the law of large numbers,
[08:17] (497.25s)
which meant that observing
convergence in social statistics
[08:20] (500.04s)
didn't prove that the underlying
decisions were independent.
[08:23] (503.16s)
In other words,
[08:24] (504.15s)
those statistics don't
prove free will at all.
[08:27] (507.33s)
Markov had shattered Nekrasov's
argument, and he knew it.
[08:31] (511.20s)
So he ended his paper with
one final dig at his rival.
[08:34] (514.96s)
"Thus, free will is not
necessary to do probability."
[08:39] (519.18s)
In fact, independence
[08:40] (520.77s)
isn't even necessary to do probability.
[08:43] (523.17s)
With this Markov chain,
as it came to be known,
[08:45] (525.84s)
he found a way to do probability
with dependent events.
[08:49] (529.59s)
This should have been a huge breakthrough,
[08:51] (531.63s)
because in the real world,
[08:53] (533.19s)
almost everything is
dependent on something else.
[08:56] (536.16s)
I mean, the weather tomorrow
[08:57] (537.75s)
depends on the conditions today.
[09:00] (540.24s)
How a disease spreads
[09:01] (541.32s)
depends on who's infected right now,
[09:03] (543.45s)
and the behavior of particles
[09:05] (545.40s)
depends on the behavior
of particles around them.
[09:08] (548.82s)
Many of these processes
[09:10] (550.20s)
could be modeled using Markov chains.
[09:13] (553.59s)
Do people think it was
like a mic drop moment
[09:15] (555.39s)
and like, "Oh, Nekrasov's
out, like, Markov's the man"?
[09:19] (559.31s)
Or people didn't really
notice, or it was obscure, or?
[09:22] (562.74s)
- I feel like people didn't really notice,
[09:25] (565.56s)
like, it wasn't a really big thing.
[09:28] (568.77s)
And Markov himself
seemingly didn't care much
[09:31] (571.53s)
about how it might be
applied to practical events.
[09:34] (574.56s)
He wrote, "I'm concerned
only with questions
[09:37] (577.23s)
of pure analysis.
[09:38] (578.76s)
I refer to the question
of the applicability
[09:41] (581.25s)
with indifference."
[09:43] (583.74s)
Little did he know that this
new form of probability theory
[09:47] (587.25s)
would soon play a major role
[09:49] (589.05s)
in one of the most important developments
[09:51] (591.18s)
of the 20th century.
[09:54] (594.30s)
On the morning of the 16th of July, 1945,
[09:58] (598.11s)
the United States detonated The Gadget,
[10:01] (601.23s)
the world's first nuclear bomb.
[10:04] (604.86s)
The six kilogram plutonium
bomb created an explosion
[10:08] (608.16s)
that was equivalent to
nearly 25,000 tons of TNT.
[10:12] (612.69s)
This was the culmination
[10:14] (614.34s)
of the top secret Manhattan Project,
[10:16] (616.95s)
a three-year long effort
[10:18] (618.33s)
by some of the smartest people alive,
[10:20] (620.82s)
including people like
J. Robert Oppenheimer,
[10:23] (623.19s)
John von Neumann,
[10:24] (624.75s)
and a little known mathematician
named Stanislaw Ulam.
[10:29] (629.82s)
- [Derek] Even after the war ended,
[10:31] (631.29s)
Ulam continued trying to figure out
[10:33] (633.06s)
how neutrons behave inside a nuclear bomb.
[10:35] (635.91s)
Now, a nuclear bomb works
something like this.
[10:38] (638.55s)
Say you have a core of uranium-235,
[10:41] (641.43s)
then when a neutron hits a U-235 nucleus,
[10:44] (644.64s)
the nucleus splits releasing energy
[10:46] (646.89s)
and, crucially, two or
three more neutrons.
[10:50] (650.37s)
If, on average, those
new neutrons go on to hit
[10:52] (652.86s)
and split more than one
other U-235 nucleus,
[10:56] (656.07s)
you get a runaway chain reaction,
[10:58] (658.38s)
so you have a nuclear bomb.
[11:00] (660.66s)
But uranium-235, the fissile
fuel needed for the bombs
[11:03] (663.96s)
was really hard to get.
[11:05] (665.55s)
So one of the key questions
was just how much of it
[11:08] (668.40s)
do you need to build a bomb?
[11:10] (670.32s)
And this is why Ulam wanted to understand
[11:12] (672.57s)
how the neutrons behave.
[11:15] (675.45s)
- [Casper] But then in January of 1946,
[11:18] (678.09s)
everything came to a halt.
[11:20] (680.95s)
Ulam was struck by a sudden and
severe case of encephalitis,
[11:24] (684.72s)
an inflammation of the brain,
that nearly killed him.
[11:28] (688.71s)
His recovery was long and slow,
[11:30] (690.69s)
with Ulam spending most
of his time in beds.
[11:34] (694.02s)
To pass the time, he played a
simple card game, Solitaire.
[11:38] (698.31s)
But as he played countless games,
[11:40] (700.20s)
winning some, losing others,
[11:42] (702.57s)
one question kept nagging at him,
[11:45] (705.03s)
what are the chances
[11:46] (706.08s)
that a randomly-shuffled game
of Solitaire could be won?
[11:50] (710.16s)
It was a deceivingly
difficult problem to solve.
[11:53] (713.58s)
Ulam played with all 52 cards
[11:55] (715.77s)
where each arrangement
created a unique game,
[11:58] (718.41s)
so the total number of possible
games was 52 factorial,
[12:02] (722.34s)
or about eight times 10 to 67.
[12:06] (726.18s)
So solving this analytically was hopeless.
[12:10] (730.20s)
But then Ulam had a flash of insight,
[12:12] (732.60s)
what if I just play hundreds of games
[12:14] (734.55s)
and count how many could be won?
[12:16] (736.83s)
That would give him
[12:17] (737.66s)
some sort of statistical
approximation of the answer.
[12:21] (741.75s)
Back at Los Alamos, the
remaining scientists
[12:23] (743.97s)
grappled with much harder
problems than Solitaire,
[12:26] (746.85s)
like figuring out how neutrons
behave inside a nuclear core.
[12:31] (751.41s)
In a nuclear core,
[12:32] (752.28s)
there are trillions and
trillions of neutrons
[12:34] (754.47s)
all interacting with their surroundings.
[12:36] (756.81s)
So the number of possible
outcomes is immense,
[12:39] (759.30s)
and computing it directly
seemed impossible.
[12:42] (762.66s)
- [Derek] But when Ulam returned to work,
[12:44] (764.40s)
he had a sudden revelation.
[12:46] (766.71s)
What if we could simulate these systems
[12:48] (768.45s)
by generating lots of random outcomes
[12:50] (770.82s)
like I did with Solitaire?
[12:52] (772.86s)
He shared this idea with von Neumann,
[12:54] (774.78s)
who immediately recognized its power,
[12:57] (777.27s)
but also spotted a key problem.
[13:00] (780.39s)
- See, in Solitaire,
each game is independent.
[13:03] (783.24s)
How the cards are dealt in one game
[13:05] (785.01s)
have no effect on the next,
but neutrons aren't like that.
[13:09] (789.10s)
A neutron's behavior
depends on where it is
[13:11] (791.52s)
and what it has done before.
[13:14] (794.52s)
So you couldn't just
sample random outcomes
[13:17] (797.04s)
like in Solitaire.
[13:18] (798.33s)
Instead, you needed to model
a whole chain of events
[13:21] (801.57s)
where each step influenced the next.
[13:24] (804.33s)
What von Neumann realized
[13:25] (805.92s)
is that you needed a Markov chain.
[13:28] (808.89s)
So they made one
[13:30] (810.15s)
and a much simplified version of it
[13:32] (812.04s)
works something like this.
[13:34] (814.11s)
Now, the starting state
[13:35] (815.22s)
is just a neutron
traveling through the core,
[13:37] (817.74s)
and from there, three things can happen.
[13:39] (819.78s)
It can scatter off an
atom and keep traveling,
[13:42] (822.81s)
so that gives you an arrow
going back to itself.
[13:45] (825.69s)
It can leave the system
[13:46] (826.95s)
or get absorbed by a non-fissile material,
[13:49] (829.23s)
in which case it no longer takes
part in the chain reaction,
[13:52] (832.47s)
and so it ends its Markov chain,
[13:55] (835.17s)
or it can strike another uranium-235 atom,
[13:58] (838.53s)
triggering a fission event
[14:00] (840.15s)
and releasing two or three more neutrons
[14:02] (842.49s)
that then start their own chains.
[14:05] (845.34s)
But in this chain,
[14:06] (846.51s)
the transition probabilities aren't fixed,
[14:08] (848.85s)
they depend on things like
the neutron's position,
[14:11] (851.25s)
velocity and energy,
[14:12] (852.75s)
as well as the overall
configuration and mass of uranium.
[14:16] (856.95s)
So a fast-moving neutron might
have a 30% chance to scatter,
[14:20] (860.31s)
a 50% chance to be absorbed or leave,
[14:22] (862.59s)
and a 20% chance to cause fission.
[14:25] (865.08s)
But a slower-moving neutron
[14:26] (866.49s)
would have different probabilities.
[14:29] (869.58s)
Next, they ran this chain
[14:31] (871.08s)
on the world's first
electronic computer, the ENIAC.
[14:34] (874.59s)
The computer started
[14:35] (875.52s)
by randomly generating a
neutron starting conditions
[14:38] (878.16s)
and stepped through the chain
[14:39] (879.30s)
to keep track of how many neutrons
[14:40] (880.95s)
were produced on average per run,
[14:43] (883.50s)
known as the multiplication factor k.
[14:46] (886.41s)
So if, on average, one neutron
[14:48] (888.42s)
produces another two neutrons,
then k is equal to two.
[14:52] (892.38s)
And if on average every two
neutrons produce three neutrons,
[14:55] (895.95s)
then k is equal to three
over two, and so on.
[14:59] (899.40s)
Then, after stepping
through the full chain
[15:01] (901.23s)
for a specified number of steps,
[15:03] (903.12s)
we collect the average k-value
[15:04] (904.89s)
and record that number in a histogram.
[15:07] (907.50s)
This process was then
repeated hundreds of times,
[15:10] (910.11s)
and the results tallied up,
[15:11] (911.79s)
giving you a statistical
distribution of the outcome.
[15:15] (915.28s)
If you find that in most
cases, k is less than one,
[15:18] (918.39s)
the reaction dies down.
[15:19] (919.89s)
If it's equal to one,
[15:21] (921.27s)
there's a self-sustaining chain reaction,
[15:23] (923.40s)
but it doesn't grow.
[15:24] (924.84s)
And if k is larger than one,
[15:26] (926.40s)
the reaction grows exponentially
[15:28] (928.32s)
and you've got a bomb.
[15:31] (931.29s)
- With it, von Neumann and
Ulam had a statistical way
[15:34] (934.14s)
to figure out how many
neutrons were produced
[15:36] (936.69s)
without having to do
any exact calculations.
[15:39] (939.81s)
In other words,
[15:40] (940.64s)
they could approximate
differential equations
[15:42] (942.60s)
that were too hard to solve analytically.
[15:45] (945.66s)
All that was needed was a
name for the new method.
[15:48] (948.96s)
Now, Ulam's uncle was a gambler,
[15:51] (951.09s)
and the random sampling
[15:52] (952.20s)
and high stakes reminded Ulam
[15:54] (954.18s)
of the Monte Carlo Casino in
Monaco, and the name stuck.
[15:58] (958.23s)
The Monte
Carlo method was born.
[16:01] (961.65s)
The method was so successful
[16:03] (963.51s)
that it didn't stay secret for long.
[16:05] (965.94s)
By the end of 1948,
[16:07] (967.53s)
scientists at another
lab, Argonne, in Chicago,
[16:10] (970.47s)
used it to study nuclear reactor designs,
[16:13] (973.14s)
and from there, the idea spread quickly.
[16:16] (976.43s)
Ulam later remarked,
[16:18] (978.19s)
"It is still an unending
source of surprise for me
[16:21] (981.06s)
to see how a few scribbles on a blackboard
[16:23] (983.40s)
could change the course of human affairs."
[16:27] (987.18s)
And it wouldn't be the last time
[16:28] (988.77s)
Markov chain based method
[16:30] (990.09s)
changed the course of human affairs.
[16:32] (992.68s)
(upbeat music)
[16:35] (995.88s)
- In 1993, the internet
was open to the public,
[16:39] (999.09s)
and soon it exploded.
[16:41] (1001.28s)
By the mid-1990s, thousands of
new pages appeared every day,
[16:44] (1004.88s)
and that number was only growing.
[16:48] (1008.00s)
This created a new kind of problem.
[16:50] (1010.76s)
I mean, how do you find anything
[16:52] (1012.50s)
in this ever-expending sea of information?
[16:55] (1015.65s)
In 1994, two Stanford PhD students,
[16:58] (1018.71s)
Jerry Yang and David Filo,
[17:00] (1020.60s)
founded the search engine Yahoo,
[17:02] (1022.31s)
to address this issue,
but they needed money.
[17:06] (1026.45s)
So a year later, they arranged to meet
[17:08] (1028.70s)
with Japanese billionaire, Masayoshi Son,
[17:11] (1031.46s)
also known as the Bill Gates of Japan.
[17:13] (1033.98s)
(gong clangs)
[17:15] (1035.18s)
- They were looking to raise $5 million
[17:17] (1037.37s)
for their next startup,
but Son has other plans.
[17:22] (1042.17s)
He offers to invest a
full $100 million instead.
[17:26] (1046.01s)
That's 20 times more than
what the founders asked for.
[17:29] (1049.43s)
So Jerry Yang declines saying,
"We don't need that much,"
[17:33] (1053.12s)
but Son disagrees,
[17:35] (1055.20s)
"Jerry, everyone needs $100 million."
[17:38] (1058.42s)
(Son laughs)
[17:40] (1060.14s)
Before the founders get
a chance to respond,
[17:42] (1062.39s)
Son jumps in again and asks,
[17:44] (1064.11s)
"Who are your biggest competitors?"
[17:46] (1066.03s)
"Excite and lycos," the pair respond.
[17:49] (1069.32s)
Son orders his associate
to write those names down.
[17:51] (1071.93s)
And then he says, "If you
don't let me invest in Yahoo,
[17:54] (1074.75s)
I will invest in one of
them and I'll kill you."
[17:59] (1079.16s)
See, Son had realized something.
[18:01] (1081.38s)
None of the leading
search engines at the time
[18:03] (1083.63s)
had any superior technology.
[18:05] (1085.67s)
They didn't have a technological
advantage over the others.
[18:09] (1089.18s)
They all just ranked pages
[18:10] (1090.89s)
by how often a search term
appears on a given page.
[18:14] (1094.31s)
So the battle for the
number one search engine
[18:16] (1096.65s)
would be decided by who
could attract the most users,
[18:19] (1099.59s)
who could spend the most on marketing.
[18:21] (1101.84s)
- [Announcer 1] Lycos, go get it.
[18:23] (1103.70s)
- [Announcer 2] Get Lycos, or get lost.
[18:26] (1106.46s)
- [Announcer 3] This is revolution.
[18:28] (1108.62s)
(upbeat funky music)
[18:31] (1111.42s)
♪ Yahoo ♪
[18:33] (1113.60s)
- [Derek] And marketing
required a lot of money,
[18:35] (1115.70s)
money that Son had, so he
could decide who won the war.
[18:40] (1120.65s)
Yahoo's founders realized they
were left with no real choice
[18:43] (1123.50s)
but to accept Son's investment.
[18:46] (1126.20s)
- [Speaker] So here we are,
right in the middle of Yahoo.
[18:48] (1128.54s)
- [Derek] And within four years,
[18:49] (1129.77s)
Yahoo became the most
popular site on the planet.
[18:52] (1132.95s)
- [Reporter] In the time it
takes to say this sentence,
[18:55] (1135.50s)
Yahoo will answer 79,000
information requests worldwide,
[19:00] (1140.51s)
the two men are now
worth $120 million each.
[19:04] (1144.72s)
♪ Yahoo ♪
[19:07] (1147.71s)
- But Yahoo had a critical weakness.
[19:11] (1151.64s)
See, Yahoo's keyword
search was easy to trick.
[19:14] (1154.58s)
To get your page ranked highly,
[19:15] (1155.99s)
you could just repeat
keywords hundreds of times,
[19:18] (1158.60s)
hidden with white text
on a white background.
[19:21] (1161.99s)
- One thing they didn't
have in those early days
[19:25] (1165.08s)
was a notion of quality of the result.
[19:28] (1168.23s)
So they had a notion of relevance saying,
[19:31] (1171.35s)
does this document talk about the thing
[19:33] (1173.81s)
that you're interested in?
[19:35] (1175.61s)
But there wasn't really a
notion of which ones are better.
[19:39] (1179.18s)
- What they really needed
was a way to rank pages
[19:41] (1181.55s)
by both relevance and quality.
[19:44] (1184.13s)
But how do you measure
the quality of a webpage?
[19:46] (1186.77s)
Well, to understand that,
[19:48] (1188.09s)
we need to borrow an idea from libraries.
[19:50] (1190.49s)
- So I'm old enough that library books
[19:53] (1193.22s)
used to have a paper card in it
[19:55] (1195.26s)
that was a stamp of all the due dates
[19:57] (1197.24s)
of when it was due back.
[19:58] (1198.50s)
You took a book and if
it had a lot of those,
[20:00] (1200.21s)
you said, "Oh, this is
probably a good book."
[20:01] (1201.86s)
And if it didn't have any, you said,
[20:03] (1203.88s)
"Well, maybe this isn't the best book."
[20:06] (1206.15s)
- Stamps acted like endorsements.
[20:07] (1207.89s)
The more stamps, the
better the book must be.
[20:09] (1209.99s)
And the same idea can
be applied to the web.
[20:12] (1212.84s)
Over at Stanford, two PhD students,
[20:15] (1215.21s)
Sergey Brin and Larry Page,
[20:16] (1216.98s)
were working on this exact problem.
[20:19] (1219.32s)
Brin and Page realized
that each link to a page
[20:22] (1222.47s)
can be thought of as an endorsement.
[20:24] (1224.63s)
And the more links a page sends out,
[20:26] (1226.52s)
the less valuable each vote becomes.
[20:29] (1229.73s)
So what they realized is
that we can model the web
[20:32] (1232.31s)
as a Markov chain.
[20:34] (1234.68s)
- [Derek] To see how this works,
[20:36] (1236.12s)
imagine a toy internet
with just four webpages.
[20:39] (1239.30s)
Call them Amy, Ben, Chris, and Dan.
[20:42] (1242.57s)
These are our states.
[20:44] (1244.31s)
Typically, one webpage links to others,
[20:46] (1246.56s)
allowing you to move between them.
[20:48] (1248.33s)
These are our transitions.
[20:50] (1250.49s)
In this setup, Amy only links to Ben,
[20:52] (1252.86s)
so there's a 100% chance
of going from Amy to Ben.
[20:56] (1256.79s)
Ben links to Amy, Chris, and Dan,
[20:58] (1258.98s)
so there's a 33% chance of
going to any of those pages,
[21:02] (1262.67s)
and we can fill out the other
transition probabilities
[21:05] (1265.01s)
in the same way.
[21:07] (1267.05s)
So now we can run this Markov
chain and see what happens.
[21:10] (1270.89s)
Imagine you're a surfer on this web.
[21:13] (1273.11s)
You start on a random page, say, Amy,
[21:15] (1275.63s)
and you keep running the machine
[21:17] (1277.52s)
and keep track of the percentage of time
[21:19] (1279.29s)
you spend on each page.
[21:21] (1281.54s)
Over time, the ratio settles
[21:23] (1283.61s)
and the scores give us some measure
[21:25] (1285.38s)
of the relative importance of these pages.
[21:28] (1288.20s)
You spend the most time on Ben,
[21:29] (1289.64s)
so Ben is ranked first,
followed by Amy, then Dan,
[21:32] (1292.91s)
and lastly Chris.
[21:34] (1294.98s)
It might seem like there's an
easy way to beat the system,
[21:37] (1297.71s)
just make 100 pages all
linking to your website.
[21:40] (1300.74s)
Now you get 100 full votes
[21:42] (1302.48s)
and you'll always rank on
top, but that is not the case.
[21:47] (1307.22s)
While during their first few steps,
[21:48] (1308.87s)
they might make your page seem important,
[21:50] (1310.97s)
none of the other websites link to them.
[21:53] (1313.07s)
So over many steps, their
contributions don't matter.
[21:57] (1317.15s)
You might have many links,
[21:58] (1318.53s)
but they're not quality links,
[22:00] (1320.06s)
so they don't affect the algorithm.
[22:03] (1323.78s)
- But there is still one problem, though,
[22:05] (1325.82s)
not all pages are connected.
[22:07] (1327.71s)
In networks like this one,
[22:09] (1329.21s)
a random server can get stuck in a loop,
[22:11] (1331.40s)
never reaching the rest of the web.
[22:13] (1333.62s)
So to fix this,
[22:15] (1335.03s)
we can set a rule that 85% of the time,
[22:17] (1337.94s)
our random server just
follows a link like normal.
[22:20] (1340.85s)
But then for about 15% of the time,
[22:22] (1342.95s)
they just jump to a page at random.
[22:25] (1345.65s)
This damping factor makes sure
[22:27] (1347.33s)
that we explore all
possible parts of the web
[22:29] (1349.64s)
without ever getting stuck.
[22:32] (1352.40s)
By using Markov chains,
[22:34] (1354.02s)
Page and Brin had built
a better search engine,
[22:36] (1356.66s)
and they called it PageRank.
[22:38] (1358.55s)
- Because it's talking
about how pages react,
[22:42] (1362.42s)
webpages react with each other
[22:43] (1363.77s)
and also 'cause the
founder's name is Larry Page,
[22:46] (1366.53s)
so he snuck that in.
[22:48] (1368.24s)
- With PageRank,
[22:49] (1369.14s)
Google got much better search results,
[22:51] (1371.36s)
often getting you to the site
[22:52] (1372.74s)
you were looking for in one go.
[22:54] (1374.93s)
Although, to some, this
sounded like a terrible idea.
[22:58] (1378.15s)
- Others said, "Oh, well you're
telling me you get a search
[23:00] (1380.87s)
that will get the right
result on the first answer?
[23:04] (1384.17s)
Well, I don't want that
[23:05] (1385.01s)
because if it takes them
three or four chances,
[23:08] (1388.91s)
searches to get the right answer,
[23:10] (1390.50s)
then I have three or
four chances to show ads,
[23:13] (1393.26s)
and if you get 'em the answer right away,
[23:15] (1395.27s)
I'm just gonna lose them.
[23:16] (1396.29s)
So, you know, I don't see
why better search is better."
[23:20] (1400.46s)
- But Page and Brin disagreed.
[23:22] (1402.20s)
They were convinced that if
their product was far superior,
[23:24] (1404.96s)
then people would flock to it.
[23:26] (1406.73s)
- I would say it actually
is a democracy that works.
[23:30] (1410.42s)
If all pages were equal,
[23:32] (1412.20s)
anybody can manufacture as
many pages as they want.
[23:35] (1415.13s)
I can set up a billion
pages in my server tomorrow.
[23:38] (1418.40s)
We shouldn't treat them all as equal.
[23:40] (1420.20s)
Just looking at the data out of curiosity,
[23:42] (1422.99s)
we found that we had technology
[23:44] (1424.28s)
to do a better job of search,
[23:45] (1425.78s)
and we realized how impactful
having great search can be.
[23:49] (1429.71s)
- And so, in 1998, they
launched their new search engine
[23:53] (1433.04s)
to take on Yahoo.
[23:54] (1434.51s)
Initially, they called it BackRub,
[23:56] (1436.37s)
after the backlinks it analyzed,
[23:58] (1438.65s)
but then they realized
[23:59] (1439.48s)
that maybe that's not
the most attractive name.
[24:02] (1442.25s)
Now, their ambitions were big
[24:03] (1443.81s)
to essentially index all
the pages on the internet,
[24:06] (1446.93s)
and they needed a name equally as big.
[24:09] (1449.42s)
So they thought of the largest
number they could think of,
[24:12] (1452.45s)
10 to the power of 100, a googol.
[24:15] (1455.51s)
But then when trying to
register their domain,
[24:17] (1457.73s)
they accidentally misspelled it.
[24:19] (1459.32s)
And so, Google was born.
[24:21] (1461.84s)
(dramatic music)
[24:26] (1466.65s)
Over the next four years,
[24:27] (1467.99s)
Google overthrew Yahoo
[24:29] (1469.49s)
to become the most used search engine.
[24:31] (1471.53s)
- Everyone who knows the internet
[24:32] (1472.97s)
almost certainly knows Google.
[24:34] (1474.83s)
- Googling is like oxygen to teenagers.
[24:37] (1477.89s)
- [Casper] And today, Alphabet,
[24:39] (1479.27s)
which is Google's parent company,
[24:40] (1480.89s)
is worth around $2 trillion.
[24:43] (1483.29s)
- When Google makes even
the slightest change
[24:45] (1485.81s)
in its algorithms, it
can have huge effects.
[24:48] (1488.41s)
- Google.
- Google.
[24:49] (1489.41s)
- Google.
- Google.
[24:51] (1491.78s)
- They're on fire.
[24:52] (1492.74s)
And the reason why they're on fire
[24:53] (1493.97s)
is because they're focused
[24:55] (1495.08s)
and they're more focused
than Yahoo who does search,
[24:57] (1497.06s)
they're more focused than Microsoft
[24:58] (1498.38s)
who does search with Bing.
[24:59] (1499.58s)
Yahoo has lots of
traffic, they always have,
[25:01] (1501.59s)
they have some really great properties,
[25:03] (1503.24s)
but I don't think Yahoo is
the go-to place, you know.
[25:06] (1506.48s)
- And at the heart of this
trillion dollar algorithm
[25:09] (1509.33s)
is a Markov chain, which only
looks at the current state
[25:12] (1512.90s)
to predict what's going to happen next.
[25:15] (1515.99s)
But in the 1940s, Claude Shannon,
[25:18] (1518.78s)
the father of information theory,
[25:20] (1520.64s)
started asking a different question.
[25:23] (1523.46s)
He went back to Markov's
original idea of predicting text,
[25:26] (1526.85s)
but instead of just using
vowels and consonants,
[25:29] (1529.25s)
he focused on individual letters.
[25:31] (1531.89s)
And he wondered,
[25:32] (1532.79s)
what if instead of looking
[25:34] (1534.02s)
at only the last letter as a predictor,
[25:36] (1536.06s)
I look at the last two?
[25:37] (1537.95s)
Well, with that, he got
text that looked like this.
[25:41] (1541.13s)
Now, it doesn't make much sense,
[25:42] (1542.45s)
but there are some recognizable words
[25:44] (1544.43s)
like "whey", "of", and "the".
[25:47] (1547.70s)
But Shannon was convinced
he could do better.
[25:49] (1549.89s)
So next, instead of looking at letters,
[25:52] (1552.05s)
he wondered, what if I use
entire words as predictors?
[25:55] (1555.74s)
That gave him sentences like this,
[25:57] (1557.88s)
"The head and in frontal
attack on an English writer
[26:00] (1560.78s)
that the character of this point
[26:02] (1562.28s)
is therefore another
method for the letters
[26:04] (1564.68s)
that the time of who ever told the problem
[26:07] (1567.32s)
for an unexpected."
[26:09] (1569.63s)
Now, clearly, this doesn't make any sense,
[26:12] (1572.27s)
but Shannon did notice that
sequences of four words or so
[26:15] (1575.48s)
generally did make sense.
[26:17] (1577.22s)
For instance, "attack
on an English writer"
[26:19] (1579.86s)
kind of makes sense.
[26:21] (1581.45s)
So Shannon learned that
you can make better
[26:23] (1583.40s)
and better predictions
[26:24] (1584.48s)
about what the next word is going to be
[26:26] (1586.43s)
by taking into account more
and more of the previous words.
[26:30] (1590.60s)
It's kind of like what Gmail does
[26:32] (1592.31s)
when it predicts what
you're going to type next.
[26:35] (1595.49s)
And this is no coincidence,
[26:37] (1597.29s)
the algorithms that make these predictions
[26:39] (1599.45s)
are based on Markov chains.
[26:41] (1601.52s)
- They're not necessarily
using letters, you know,
[26:45] (1605.30s)
they use what they call tokens,
[26:47] (1607.31s)
some of which are letters,
some of which are words,
[26:50] (1610.34s)
marks of punctuation, whatever.
[26:52] (1612.68s)
So it's a bigger set
than just the alphabet.
[26:56] (1616.19s)
The game is simply, we
have this string of tokens
[27:00] (1620.39s)
that, you know, might be 30 long,
[27:04] (1624.17s)
and we're asking what are the odds
[27:06] (1626.63s)
that the next token is
this or this or this?
[27:09] (1629.45s)
- [Derek] But today's
large language models
[27:10] (1630.83s)
don't treat all those tokens equally,
[27:13] (1633.23s)
because unlike simple Markov chains,
[27:15] (1635.03s)
they also use something called attention,
[27:17] (1637.07s)
which tells the model
what to pay attention to.
[27:20] (1640.04s)
So in the phrase, "the
structure of the cell,"
[27:22] (1642.50s)
the model can use previous context
[27:24] (1644.57s)
like blood and mitochondria
[27:25] (1645.95s)
to know the cell most likely refers
[27:28] (1648.23s)
to biology rather than a prison cell.
[27:30] (1650.45s)
And it uses that to tune its prediction.
[27:33] (1653.75s)
But as large language models
become more widespread,
[27:36] (1656.39s)
one concern is that the text they produce
[27:38] (1658.43s)
ends up on the internet
[27:39] (1659.57s)
and that becomes training
data for future models.
[27:43] (1663.86s)
- When you start doing that,
[27:47] (1667.07s)
the game is very soon over.
[27:48] (1668.99s)
You come, in this case, to
us, a very dull, stable state,
[27:52] (1672.53s)
it just says the same thing
[27:53] (1673.67s)
over and over and over again forever.
[27:55] (1675.47s)
The language models are
vulnerable to this process.
[27:59] (1679.43s)
- And any system like this
where we have a feedback loop,
[28:02] (1682.28s)
will become hard to model
using Markov chains.
[28:05] (1685.76s)
Take global warming, for instance,
[28:07] (1687.38s)
as we increase the amount of
carbon dioxide in the air,
[28:09] (1689.69s)
the average temperature
of the Earth increases.
[28:12] (1692.33s)
But as the temperature increases,
[28:13] (1693.68s)
the atmosphere can hold more water vapor,
[28:15] (1695.81s)
which is an incredibly
powerful greenhouse gas.
[28:18] (1698.39s)
And with more water vapor,
[28:19] (1699.80s)
the temperature increases further
[28:21] (1701.36s)
allowing for even more water vapor.
[28:23] (1703.37s)
So you get this positive feedback loop,
[28:25] (1705.35s)
which makes it hard to predict
what's going to happen next.
[28:28] (1708.41s)
So there are some systems
where Markov chains don't work,
[28:31] (1711.74s)
but for many other dependent systems,
[28:33] (1713.54s)
they offer a way of doing probability.
[28:36] (1716.51s)
- But what's fascinating
[28:37] (1717.71s)
is that all these systems
have extremely long histories.
[28:41] (1721.25s)
I mean, you could trace back
all the letters in a text,
[28:43] (1723.89s)
trace back all the interactions
of what a neutron did,
[28:46] (1726.47s)
or trace back the weather for weeks.
[28:49] (1729.11s)
But the beautiful thing
Markov and others found
[28:51] (1731.60s)
is that for many of these systems
[28:53] (1733.46s)
you can ignore almost all of that.
[28:55] (1735.74s)
You can just look at the current state
[28:57] (1737.96s)
and forget about the rest,
[29:00] (1740.24s)
that makes these systems memoryless.
[29:02] (1742.94s)
And it's this memoryless property
[29:05] (1745.34s)
that makes Markov chains so powerful
[29:07] (1747.83s)
because it's what allows you
[29:09] (1749.09s)
to take these extremely complex systems
[29:11] (1751.58s)
and simplify them a lot
[29:13] (1753.35s)
to still make meaningful predictions.
[29:16] (1756.23s)
- [Derek] As one paper
put it, "Problem-solving
[29:18] (1758.06s)
is often a matter of cooking up
[29:19] (1759.71s)
an appropriate Markov chain."
[29:22] (1762.23s)
- It's kind of ridiculous to me
[29:23] (1763.97s)
that this basic fact of mathematics
[29:26] (1766.64s)
would come out of a fight like that,
[29:28] (1768.62s)
which, you know, really
had nothing to do with it.
[29:31] (1771.23s)
But all the evidence suggests
[29:33] (1773.57s)
that it really was this determination
[29:36] (1776.69s)
to show up Nekrasov that
led Markov to do the work.
[29:41] (1781.52s)
- But there's one question
we still haven't answered.
[29:44] (1784.64s)
When playing Solitaire,
[29:46] (1786.23s)
how did Ulam know his cards
were perfectly shuffled?
[29:49] (1789.44s)
I mean, how many shuffles does it take
[29:51] (1791.99s)
to get a completely random
arrangement of cards?
[29:56] (1796.10s)
- If you have a deck of cards,
[29:57] (1797.81s)
you need to shuffle it, right?
[30:00] (1800.06s)
How often, if you're
shuffling, like, you know,
[30:02] (1802.10s)
you split it in half, and then you do the
[30:04] (1804.11s)
(cards riffling).
[30:05] (1805.52s)
How often do you have to shuffle it
[30:07] (1807.59s)
to make it completely random?
[30:09] (1809.51s)
- Two.
- Two?
[30:11] (1811.16s)
I'm going with 26.
[30:12] (1812.24s)
- Yeah, four times.
- Four times?
[30:13] (1813.86s)
- I don't know, 52 times?
[30:15] (1815.84s)
- Okay. Okay.
[30:16] (1816.67s)
It's not a bad guess.
[30:18] (1818.22s)
- Seven?
[30:19] (1819.24s)
- It is seven.
[30:21] (1821.03s)
- Really?
- Yeah.
[30:22] (1822.05s)
So you can think of card
shuffling as a Markov chain
[30:24] (1824.42s)
where each deck arrangement is a state,
[30:26] (1826.94s)
and then each shuffle is a step.
[30:28] (1828.80s)
And so for a deck of 52 cards,
[30:30] (1830.44s)
if you riffle shuffle it seven times,
[30:32] (1832.79s)
then every arrangement of the
deck is about equally likely,
[30:35] (1835.67s)
so it's basically random.
[30:38] (1838.88s)
But I can't shuffle like that.
[30:40] (1840.59s)
So for me, what I do is I do it like this.
[30:43] (1843.77s)
How many times do you think
you have to shuffle like this
[30:45] (1845.81s)
to get it random?
[30:47] (1847.27s)
(beeping)
[30:48] (1848.30s)
- What do you think?
[30:49] (1849.65s)
And perhaps more importantly,
[30:51] (1851.39s)
how would you go about working it out?
[30:54] (1854.09s)
Well, that's where today's
sponsor Brilliant comes in.
[30:56] (1856.58s)
Brilliant is a learning app
[30:57] (1857.90s)
that gets you hands-on with
problems just like this.
[31:01] (1861.29s)
Whether it's math, physics, programming,
[31:03] (1863.69s)
or even AI, Brilliant's
interactive lessons and challenges
[31:07] (1867.14s)
let you play your way to a sharper mind.
[31:09] (1869.90s)
You can discover how large
language models actually work
[31:12] (1872.93s)
from basic Markov chains
to complex neural networks,
[31:16] (1876.20s)
or dig into the math behind
this shuffling question.
[31:19] (1879.65s)
It's a fun way to build
knowledge and skills
[31:21] (1881.69s)
that help you solve all kinds of problems,
[31:24] (1884.15s)
which brings us back to our shuffle.
[31:26] (1886.52s)
So Casper, what actually is the answer?
[31:29] (1889.58s)
- It's actually over 2,000.
[31:31] (1891.50s)
- What?
- Over-
[31:32] (1892.33s)
- Crazy, right?
- Yeah.
[31:33] (1893.42s)
- So the next time someone
offers to shuffle before a game,
[31:36] (1896.09s)
make sure they're doing it right,
[31:37] (1897.59s)
seven riffles or it doesn't count.
[31:40] (1900.17s)
But the interesting part
[31:41] (1901.25s)
isn't just knowing that,
it's understanding why
[31:43] (1903.95s)
and seeing how a simple question
[31:45] (1905.45s)
can lead you to some
surprisingly complex mathematics.
[31:49] (1909.23s)
And that's what Brilliant is all about.
[31:52] (1912.03s)
So to try everything Brilliant
has to offer for free
[31:55] (1915.29s)
for a full 30 days,
[31:56] (1916.49s)
visit brilliant.org/veritasium,
[31:58] (1918.89s)
click that link in the description
[32:00] (1920.39s)
or scan this handy QR code.
[32:02] (1922.79s)
And if you sign up,
[32:03] (1923.69s)
you'll also get 20% off their
annual premium subscription.
[32:07] (1927.01s)
So I wanna thank Brilliant
for sponsoring this video
[32:09] (1929.33s)
and I wanna thank you for watching.
[32:11] (1931.63s)
(upbeat music)
[32:21] (1941.15s)
- Easy.
[32:22] (1942.10s)
(light music)