The Strange Math That Predicts (Almost) Anything

[00:00] (0.21s)

- How many times do you need to shuffle a deck of cards

[00:03] (3.06s)

to make them truly random?

[00:05] (5.22s)

How much uranium does it take to build a nuclear bomb?

[00:08] (8.67s)

(explosion booming)

[00:10] (10.02s)

How can you predict the next word in a sentence?

[00:12] (12.87s)

And how does Google know which page

[00:15] (15.21s)

you're actually searching for?

[00:17] (17.04s)

Well, the reason we know the answer

[00:18] (18.21s)

to all of these questions

[00:19] (19.59s)

is because of a strange math feud in Russia

[00:22] (22.50s)

that took place over 100 years ago.

[00:26] (26.67s)

In 1905, socialist groups all across Russia

[00:29] (29.79s)

rose up against the Tsar, the ruler of the empire.

[00:33] (33.15s)

They demanded a complete political reform,

[00:35] (35.64s)

or failing that, that he stepped down from power entirely.

[00:39] (39.45s)

- This divided the nation into two.

[00:41] (41.43s)

So on one side you got the Tsarists, right?

[00:44] (44.04s)

They wanted to defend the status quo

[00:46] (46.14s)

and keep the Tsar in power.

[00:48] (48.06s)

But then on the other side,

[00:49] (49.08s)

you had the socialists

[00:50] (50.37s)

who wanted this complete political reform.

[00:53] (53.31s)

And this division was so bad

[00:54] (54.75s)

that it crept into every part of society

[00:57] (57.48s)

to the point where even mathematicians

[00:59] (59.25s)

started picking sides.

[01:00] (60.96s)

- [Derek] On the side of the Tsar was Pavel Nekrasov,

[01:03] (63.84s)

unofficially called the Tsar of Probability.

[01:06] (66.74s)

Nekrasov was a deeply religious and powerful man,

[01:10] (70.26s)

and he used his status to argue that math could be used

[01:13] (73.11s)

to explain free will and the will of God.

[01:16] (76.83s)

- His intellectual nemesis on the socialist side

[01:19] (79.35s)

was Andrey Markov, also known as Andrey The Furious.

[01:24] (84.36s)

Markov was an atheist

[01:25] (85.71s)

and he had no patience for people

[01:27] (87.36s)

who were being unrigorous,

[01:29] (89.16s)

something he considered Nekrasov to be,

[01:31] (91.38s)

because in his eyes,

[01:32] (92.37s)

math had nothing to do with free will or religion.

[01:35] (95.34s)

So he publicly criticized Nekrasov's work,

[01:38] (98.04s)

listing it among "the abuses of mathematics."

[01:41] (101.52s)

Their feud centered on the main idea

[01:43] (103.44s)

people had used to do probability for the last 200 years.

[01:46] (106.95s)

And we can illustrate this with a simple coin flip.

[01:50] (110.28s)

When I flip the coin 10 times,

[01:51] (111.72s)

I get six times heads and four times tails,

[01:54] (114.45s)

which is obviously not the 50/50 you'd expect.

[01:57] (117.39s)

But if I keep flipping the coin,

[01:59] (119.13s)

then at first the ratio jumps all over the place.

[02:02] (122.07s)

But after a large number of flips,

[02:04] (124.14s)

we see that it slowly settles down and approaches 50/50.

[02:08] (128.01s)

And in this case, after 100 flips,

[02:10] (130.53s)

we end up on 51 heads and 49 tails,

[02:14] (134.07s)

which is almost exactly what you would expect.

[02:17] (137.37s)

This behavior that the average outcome

[02:19] (139.50s)

gets closer and closer to the expected value

[02:22] (142.11s)

as you run more and more independent trials

[02:24] (144.54s)

is known as the law of large numbers.

[02:27] (147.21s)

It was first proven by Jacob Bernoulli in 1713,

[02:30] (150.84s)

and it was the key concept

[02:32] (152.19s)

at the heart of probability theory

[02:34] (154.20s)

right up until Markov and Nekrasov.

[02:36] (156.75s)

But Bernoulli only proved

[02:38] (158.46s)

that it worked for independent events

[02:40] (160.29s)

like a fair coin flip,

[02:42] (162.06s)

or when you ask people to guess

[02:43] (163.74s)

how much they think an item is worth,

[02:45] (165.90s)

where one event doesn't influence the others.

[02:49] (169.29s)

But now imagine that instead of asking each person

[02:52] (172.29s)

to submit their guess individually,

[02:54] (174.42s)

you ask people to shout out their answer in public.

[02:57] (177.90s)

Well, in this case,

[02:58] (178.98s)

the first person might think

[03:00] (180.27s)

it's an extraordinarily valuable item,

[03:02] (182.49s)

and say it's worth around $2,000,

[03:05] (185.58s)

but now all the other people in the room

[03:07] (187.95s)

are influenced by this value,

[03:09] (189.63s)

and so, their guesses have become dependent.

[03:12] (192.84s)

And now the average doesn't converge to the true value,

[03:16] (196.14s)

but instead it clusters around a higher amount.

[03:19] (199.56s)

- And so, for 200 years,

[03:21] (201.42s)

probability had relied on this key assumption,

[03:24] (204.30s)

that you need independence

[03:26] (206.04s)

to observe the law of large numbers.

[03:28] (208.68s)

And this was the idea

[03:30] (210.03s)

that sparked Nekrasov and Markov's feud.

[03:32] (212.82s)

See, Nekrasov agreed with Bernoulli

[03:34] (214.83s)

that you need independence to get the law of large numbers.

[03:37] (217.80s)

But he took it one step further.

[03:39] (219.78s)

He said, if you see the law of large numbers,

[03:42] (222.30s)

you can infer that the underlying events

[03:44] (224.88s)

must be independent.

[03:47] (227.55s)

- Take this table of Belgian marriages from 1841 to 1845.

[03:52] (232.38s)

Now you see that every year the average is about 29,000.

[03:55] (235.50s)

And so, it seems like the values converge

[03:58] (238.29s)

and therefore that they follow the law of large numbers.

[04:01] (241.14s)

And when Nekrasov looked at other social statistics

[04:03] (243.72s)

like crime rates and birth rates,

[04:05] (245.82s)

he noticed a similar pattern.

[04:07] (247.98s)

But now think about where all this data is coming from.

[04:10] (250.65s)

It's coming from decisions to get married,

[04:12] (252.99s)

decisions to commit crimes,

[04:14] (254.52s)

and decisions to have babies,

[04:16] (256.14s)

at least for the most part.

[04:18] (258.21s)

So Nekrasov reasoned that because these statistics

[04:20] (260.49s)

followed the law of large numbers,

[04:22] (262.14s)

the decisions causing them must be independent.

[04:24] (264.81s)

In other words,

[04:25] (265.77s)

he argued that they must be acts of free will.

[04:29] (269.04s)

So to him, free will wasn't just something philosophical,

[04:32] (272.19s)

it was something you could measure.

[04:34] (274.05s)

It was scientific.

[04:37] (277.11s)

- [Derek] But to Markov, Nekrasov was delusional.

[04:40] (280.74s)

He thought it was absurd

[04:41] (281.88s)

to link mathematical independence to free will.

[04:45] (285.72s)

So Markov set out to prove that dependent events

[04:48] (288.75s)

could also follow the law of large numbers,

[04:51] (291.45s)

and that you can still do probability with dependent events.

[04:56] (296.43s)

- To do this, he needed something

[04:57] (297.93s)

where one event clearly depended on what came before,

[05:01] (301.32s)

and he got the idea that this is what happens in text.

[05:04] (304.65s)

Whether your next letter is a consonant or a vowel

[05:07] (307.50s)

depends heavily on what the current letter is.

[05:10] (310.53s)

So to test this, Markov turned to a poem

[05:13] (313.14s)

at the heart of Russian literature,

[05:15] (315.07s)

"Eugene Onegin" by Alexander Pushkin.

[05:19] (319.05s)

- He took the first 20,000 letters of the poem,

[05:21] (321.54s)

stripped out all punctuation and spaces,

[05:24] (324.09s)

and pushed them together into one long string of characters.

[05:27] (327.81s)

He counted the letters

[05:28] (328.86s)

and found that 43% were vowels and 57% were consonants.

[05:33] (333.48s)

Then Markov broke the string into overlapping pairs,

[05:37] (337.11s)

that gave him four possible combinations,

[05:39] (339.72s)

vowel-vowel, consonant-consonant,

[05:41] (341.70s)

vowel-consonant, or consonant-vowel.

[05:44] (344.65s)

Now, if the letters were independent,

[05:46] (346.86s)

the probability of a vowel-vowel pair

[05:48] (348.72s)

would just be the probability of a vowel twice,

[05:51] (351.60s)

which is about 0.18 or an 18% chance.

[05:56] (356.16s)

But when Markov actually counted,

[05:58] (358.41s)

he found vowel-vowel pairs only show up 6% of the time,

[06:03] (363.15s)

way less than if they were independent.

[06:06] (366.24s)

And when he checked the other pairs,

[06:07] (367.89s)

he found that all actual values differed greatly

[06:10] (370.92s)

from what the independent case would predict.

[06:13] (373.86s)

So Markov had shown that the letters were dependent.

[06:18] (378.09s)

And to beat Nekrasov, all he needed to do now was show

[06:21] (381.03s)

that these letters still followed the law of large numbers.

[06:24] (384.03s)

So he created a prediction machine of sorts.

[06:28] (388.05s)

He started by drawing two circles,

[06:30] (390.06s)

one for a vowel and one for a consonant.

[06:32] (392.43s)

These were his states.

[06:34] (394.41s)

Now, say you're at a vowel,

[06:36] (396.15s)

then the next letter could either be a vowel or a consonant.

[06:39] (399.66s)

So he drew two arrows to represent these transitions.

[06:42] (402.94s)

But what are these transition probabilities?

[06:46] (406.44s)

Well, Markov knew that if you pick a random starting point,

[06:49] (409.32s)

there is a 43% chance that it'll be a vowel.

[06:52] (412.44s)

He also knew that vowel-vowel pairs

[06:54] (414.45s)

occur about 6% of the time.

[06:57] (417.15s)

So to find the probability

[06:58] (418.26s)

of going from a vowel to another vowel,

[07:00] (420.45s)

he divided 0.06 by 0.43

[07:03] (423.45s)

to find a transition probability of about 13%.

[07:07] (427.14s)

And since there is a 100% chance

[07:09] (429.42s)

that another letter comes next,

[07:11] (431.43s)

all the arrows going from the same state

[07:13] (433.62s)

need to add up to one.

[07:15] (435.36s)

So the chance of going to a consonant

[07:17] (437.19s)

is one minus 0.13, or 87%.

[07:21] (441.57s)

He repeated this process for the consonants

[07:23] (443.52s)

to complete his predictive machine.

[07:25] (445.98s)

So let's see how it works.

[07:28] (448.89s)

We'll start at a vowel.

[07:31] (451.17s)

Next, we generate a random number between zero and one.

[07:34] (454.71s)

If it's below 0.13, we get another vowel,

[07:37] (457.47s)

if it's above, we get a consonant.

[07:39] (459.84s)

We got 0.78, so we get a consonant,

[07:42] (462.66s)

then we generate another number,

[07:43] (463.98s)

and check if it's above or below 0.67,

[07:46] (466.90s)

0.21, so we get a vowel.

[07:50] (470.01s)

Now, we can keep doing this

[07:51] (471.39s)

and keep track of the ratio of vowels to consonants.

[07:54] (474.57s)

At first, the ratio jumps all over the place,

[07:57] (477.21s)

but after a while, it converges to a steady value,

[08:00] (480.96s)

43% vowels and 57% consonants,

[08:04] (484.23s)

the exact split Markov had counted by hand.

[08:09] (489.21s)

So Markov had built a dependent system,

[08:11] (491.82s)

a literal chain of events,

[08:13] (493.89s)

and he showed that it still followed

[08:15] (495.69s)

the law of large numbers,

[08:17] (497.25s)

which meant that observing convergence in social statistics

[08:20] (500.04s)

didn't prove that the underlying decisions were independent.

[08:23] (503.16s)

In other words,

[08:24] (504.15s)

those statistics don't prove free will at all.

[08:27] (507.33s)

Markov had shattered Nekrasov's argument, and he knew it.

[08:31] (511.20s)

So he ended his paper with one final dig at his rival.

[08:34] (514.96s)

"Thus, free will is not necessary to do probability."

[08:39] (519.18s)

In fact, independence

[08:40] (520.77s)

isn't even necessary to do probability.

[08:43] (523.17s)

With this Markov chain, as it came to be known,

[08:45] (525.84s)

he found a way to do probability with dependent events.

[08:49] (529.59s)

This should have been a huge breakthrough,

[08:51] (531.63s)

because in the real world,

[08:53] (533.19s)

almost everything is dependent on something else.

[08:56] (536.16s)

I mean, the weather tomorrow

[08:57] (537.75s)

depends on the conditions today.

[09:00] (540.24s)

How a disease spreads

[09:01] (541.32s)

depends on who's infected right now,

[09:03] (543.45s)

and the behavior of particles

[09:05] (545.40s)

depends on the behavior of particles around them.

[09:08] (548.82s)

Many of these processes

[09:10] (550.20s)

could be modeled using Markov chains.

[09:13] (553.59s)

Do people think it was like a mic drop moment

[09:15] (555.39s)

and like, "Oh, Nekrasov's out, like, Markov's the man"?

[09:19] (559.31s)

Or people didn't really notice, or it was obscure, or?

[09:22] (562.74s)

- I feel like people didn't really notice,

[09:25] (565.56s)

like, it wasn't a really big thing.

[09:28] (568.77s)

And Markov himself seemingly didn't care much

[09:31] (571.53s)

about how it might be applied to practical events.

[09:34] (574.56s)

He wrote, "I'm concerned only with questions

[09:37] (577.23s)

of pure analysis.

[09:38] (578.76s)

I refer to the question of the applicability

[09:41] (581.25s)

with indifference."

[09:43] (583.74s)

Little did he know that this new form of probability theory

[09:47] (587.25s)

would soon play a major role

[09:49] (589.05s)

in one of the most important developments

[09:51] (591.18s)

of the 20th century.

[09:54] (594.30s)

On the morning of the 16th of July, 1945,

[09:58] (598.11s)

the United States detonated The Gadget,

[10:01] (601.23s)

the world's first nuclear bomb.

[10:04] (604.86s)

The six kilogram plutonium bomb created an explosion

[10:08] (608.16s)

that was equivalent to nearly 25,000 tons of TNT.

[10:12] (612.69s)

This was the culmination

[10:14] (614.34s)

of the top secret Manhattan Project,

[10:16] (616.95s)

a three-year long effort

[10:18] (618.33s)

by some of the smartest people alive,

[10:20] (620.82s)

including people like J. Robert Oppenheimer,

[10:23] (623.19s)

John von Neumann,

[10:24] (624.75s)

and a little known mathematician named Stanislaw Ulam.

[10:29] (629.82s)

- [Derek] Even after the war ended,

[10:31] (631.29s)

Ulam continued trying to figure out

[10:33] (633.06s)

how neutrons behave inside a nuclear bomb.

[10:35] (635.91s)

Now, a nuclear bomb works something like this.

[10:38] (638.55s)

Say you have a core of uranium-235,

[10:41] (641.43s)

then when a neutron hits a U-235 nucleus,

[10:44] (644.64s)

the nucleus splits releasing energy

[10:46] (646.89s)

and, crucially, two or three more neutrons.

[10:50] (650.37s)

If, on average, those new neutrons go on to hit

[10:52] (652.86s)

and split more than one other U-235 nucleus,

[10:56] (656.07s)

you get a runaway chain reaction,

[10:58] (658.38s)

so you have a nuclear bomb.

[11:00] (660.66s)

But uranium-235, the fissile fuel needed for the bombs

[11:03] (663.96s)

was really hard to get.

[11:05] (665.55s)

So one of the key questions was just how much of it

[11:08] (668.40s)

do you need to build a bomb?

[11:10] (670.32s)

And this is why Ulam wanted to understand

[11:12] (672.57s)

how the neutrons behave.

[11:15] (675.45s)

- [Casper] But then in January of 1946,

[11:18] (678.09s)

everything came to a halt.

[11:20] (680.95s)

Ulam was struck by a sudden and severe case of encephalitis,

[11:24] (684.72s)

an inflammation of the brain, that nearly killed him.

[11:28] (688.71s)

His recovery was long and slow,

[11:30] (690.69s)

with Ulam spending most of his time in beds.

[11:34] (694.02s)

To pass the time, he played a simple card game, Solitaire.

[11:38] (698.31s)

But as he played countless games,

[11:40] (700.20s)

winning some, losing others,

[11:42] (702.57s)

one question kept nagging at him,

[11:45] (705.03s)

what are the chances

[11:46] (706.08s)

that a randomly-shuffled game of Solitaire could be won?

[11:50] (710.16s)

It was a deceivingly difficult problem to solve.

[11:53] (713.58s)

Ulam played with all 52 cards

[11:55] (715.77s)

where each arrangement created a unique game,

[11:58] (718.41s)

so the total number of possible games was 52 factorial,

[12:02] (722.34s)

or about eight times 10 to 67.

[12:06] (726.18s)

So solving this analytically was hopeless.

[12:10] (730.20s)

But then Ulam had a flash of insight,

[12:12] (732.60s)

what if I just play hundreds of games

[12:14] (734.55s)

and count how many could be won?

[12:16] (736.83s)

That would give him

[12:17] (737.66s)

some sort of statistical approximation of the answer.

[12:21] (741.75s)

Back at Los Alamos, the remaining scientists

[12:23] (743.97s)

grappled with much harder problems than Solitaire,

[12:26] (746.85s)

like figuring out how neutrons behave inside a nuclear core.

[12:31] (751.41s)

In a nuclear core,

[12:32] (752.28s)

there are trillions and trillions of neutrons

[12:34] (754.47s)

all interacting with their surroundings.

[12:36] (756.81s)

So the number of possible outcomes is immense,

[12:39] (759.30s)

and computing it directly seemed impossible.

[12:42] (762.66s)

- [Derek] But when Ulam returned to work,

[12:44] (764.40s)

he had a sudden revelation.

[12:46] (766.71s)

What if we could simulate these systems

[12:48] (768.45s)

by generating lots of random outcomes

[12:50] (770.82s)

like I did with Solitaire?

[12:52] (772.86s)

He shared this idea with von Neumann,

[12:54] (774.78s)

who immediately recognized its power,

[12:57] (777.27s)

but also spotted a key problem.

[13:00] (780.39s)

- See, in Solitaire, each game is independent.

[13:03] (783.24s)

How the cards are dealt in one game

[13:05] (785.01s)

have no effect on the next, but neutrons aren't like that.

[13:09] (789.10s)

A neutron's behavior depends on where it is

[13:11] (791.52s)

and what it has done before.

[13:14] (794.52s)

So you couldn't just sample random outcomes

[13:17] (797.04s)

like in Solitaire.

[13:18] (798.33s)

Instead, you needed to model a whole chain of events

[13:21] (801.57s)

where each step influenced the next.

[13:24] (804.33s)

What von Neumann realized

[13:25] (805.92s)

is that you needed a Markov chain.

[13:28] (808.89s)

So they made one

[13:30] (810.15s)

and a much simplified version of it

[13:32] (812.04s)

works something like this.

[13:34] (814.11s)

Now, the starting state

[13:35] (815.22s)

is just a neutron traveling through the core,

[13:37] (817.74s)

and from there, three things can happen.

[13:39] (819.78s)

It can scatter off an atom and keep traveling,

[13:42] (822.81s)

so that gives you an arrow going back to itself.

[13:45] (825.69s)

It can leave the system

[13:46] (826.95s)

or get absorbed by a non-fissile material,

[13:49] (829.23s)

in which case it no longer takes part in the chain reaction,

[13:52] (832.47s)

and so it ends its Markov chain,

[13:55] (835.17s)

or it can strike another uranium-235 atom,

[13:58] (838.53s)

triggering a fission event

[14:00] (840.15s)

and releasing two or three more neutrons

[14:02] (842.49s)

that then start their own chains.

[14:05] (845.34s)

But in this chain,

[14:06] (846.51s)

the transition probabilities aren't fixed,

[14:08] (848.85s)

they depend on things like the neutron's position,

[14:11] (851.25s)

velocity and energy,

[14:12] (852.75s)

as well as the overall configuration and mass of uranium.

[14:16] (856.95s)

So a fast-moving neutron might have a 30% chance to scatter,

[14:20] (860.31s)

a 50% chance to be absorbed or leave,

[14:22] (862.59s)

and a 20% chance to cause fission.

[14:25] (865.08s)

But a slower-moving neutron

[14:26] (866.49s)

would have different probabilities.

[14:29] (869.58s)

Next, they ran this chain

[14:31] (871.08s)

on the world's first electronic computer, the ENIAC.

[14:34] (874.59s)

The computer started

[14:35] (875.52s)

by randomly generating a neutron starting conditions

[14:38] (878.16s)

and stepped through the chain

[14:39] (879.30s)

to keep track of how many neutrons

[14:40] (880.95s)

were produced on average per run,

[14:43] (883.50s)

known as the multiplication factor k.

[14:46] (886.41s)

So if, on average, one neutron

[14:48] (888.42s)

produces another two neutrons, then k is equal to two.

[14:52] (892.38s)

And if on average every two neutrons produce three neutrons,

[14:55] (895.95s)

then k is equal to three over two, and so on.

[14:59] (899.40s)

Then, after stepping through the full chain

[15:01] (901.23s)

for a specified number of steps,

[15:03] (903.12s)

we collect the average k-value

[15:04] (904.89s)

and record that number in a histogram.

[15:07] (907.50s)

This process was then repeated hundreds of times,

[15:10] (910.11s)

and the results tallied up,

[15:11] (911.79s)

giving you a statistical distribution of the outcome.

[15:15] (915.28s)

If you find that in most cases, k is less than one,

[15:18] (918.39s)

the reaction dies down.

[15:19] (919.89s)

If it's equal to one,

[15:21] (921.27s)

there's a self-sustaining chain reaction,

[15:23] (923.40s)

but it doesn't grow.

[15:24] (924.84s)

And if k is larger than one,

[15:26] (926.40s)

the reaction grows exponentially

[15:28] (928.32s)

and you've got a bomb.

[15:31] (931.29s)

- With it, von Neumann and Ulam had a statistical way

[15:34] (934.14s)

to figure out how many neutrons were produced

[15:36] (936.69s)

without having to do any exact calculations.

[15:39] (939.81s)

In other words,

[15:40] (940.64s)

they could approximate differential equations

[15:42] (942.60s)

that were too hard to solve analytically.

[15:45] (945.66s)

All that was needed was a name for the new method.

[15:48] (948.96s)

Now, Ulam's uncle was a gambler,

[15:51] (951.09s)

and the random sampling

[15:52] (952.20s)

and high stakes reminded Ulam

[15:54] (954.18s)

of the Monte Carlo Casino in Monaco, and the name stuck.

[15:58] (958.23s)

The Monte Carlo method was born.

[16:01] (961.65s)

The method was so successful

[16:03] (963.51s)

that it didn't stay secret for long.

[16:05] (965.94s)

By the end of 1948,

[16:07] (967.53s)

scientists at another lab, Argonne, in Chicago,

[16:10] (970.47s)

used it to study nuclear reactor designs,

[16:13] (973.14s)

and from there, the idea spread quickly.

[16:16] (976.43s)

Ulam later remarked,

[16:18] (978.19s)

"It is still an unending source of surprise for me

[16:21] (981.06s)

to see how a few scribbles on a blackboard

[16:23] (983.40s)

could change the course of human affairs."

[16:27] (987.18s)

And it wouldn't be the last time

[16:28] (988.77s)

Markov chain based method

[16:30] (990.09s)

changed the course of human affairs.

[16:32] (992.68s)

(upbeat music)

[16:35] (995.88s)

- In 1993, the internet was open to the public,

[16:39] (999.09s)

and soon it exploded.

[16:41] (1001.28s)

By the mid-1990s, thousands of new pages appeared every day,

[16:44] (1004.88s)

and that number was only growing.

[16:48] (1008.00s)

This created a new kind of problem.

[16:50] (1010.76s)

I mean, how do you find anything

[16:52] (1012.50s)

in this ever-expending sea of information?

[16:55] (1015.65s)

In 1994, two Stanford PhD students,

[16:58] (1018.71s)

Jerry Yang and David Filo,

[17:00] (1020.60s)

founded the search engine Yahoo,

[17:02] (1022.31s)

to address this issue, but they needed money.

[17:06] (1026.45s)

So a year later, they arranged to meet

[17:08] (1028.70s)

with Japanese billionaire, Masayoshi Son,

[17:11] (1031.46s)

also known as the Bill Gates of Japan.

[17:13] (1033.98s)

(gong clangs)

[17:15] (1035.18s)

- They were looking to raise $5 million

[17:17] (1037.37s)

for their next startup, but Son has other plans.

[17:22] (1042.17s)

He offers to invest a full $100 million instead.

[17:26] (1046.01s)

That's 20 times more than what the founders asked for.

[17:29] (1049.43s)

So Jerry Yang declines saying, "We don't need that much,"

[17:33] (1053.12s)

but Son disagrees,

[17:35] (1055.20s)

"Jerry, everyone needs $100 million."

[17:38] (1058.42s)

(Son laughs)

[17:40] (1060.14s)

Before the founders get a chance to respond,

[17:42] (1062.39s)

Son jumps in again and asks,

[17:44] (1064.11s)

"Who are your biggest competitors?"

[17:46] (1066.03s)

"Excite and lycos," the pair respond.

[17:49] (1069.32s)

Son orders his associate to write those names down.

[17:51] (1071.93s)

And then he says, "If you don't let me invest in Yahoo,

[17:54] (1074.75s)

I will invest in one of them and I'll kill you."

[17:59] (1079.16s)

See, Son had realized something.

[18:01] (1081.38s)

None of the leading search engines at the time

[18:03] (1083.63s)

had any superior technology.

[18:05] (1085.67s)

They didn't have a technological advantage over the others.

[18:09] (1089.18s)

They all just ranked pages

[18:10] (1090.89s)

by how often a search term appears on a given page.

[18:14] (1094.31s)

So the battle for the number one search engine

[18:16] (1096.65s)

would be decided by who could attract the most users,

[18:19] (1099.59s)

who could spend the most on marketing.

[18:21] (1101.84s)

- [Announcer 1] Lycos, go get it.

[18:23] (1103.70s)

- [Announcer 2] Get Lycos, or get lost.

[18:26] (1106.46s)

- [Announcer 3] This is revolution.

[18:28] (1108.62s)

(upbeat funky music)

[18:31] (1111.42s)

♪ Yahoo ♪

[18:33] (1113.60s)

- [Derek] And marketing required a lot of money,

[18:35] (1115.70s)

money that Son had, so he could decide who won the war.

[18:40] (1120.65s)

Yahoo's founders realized they were left with no real choice

[18:43] (1123.50s)

but to accept Son's investment.

[18:46] (1126.20s)

- [Speaker] So here we are, right in the middle of Yahoo.

[18:48] (1128.54s)

- [Derek] And within four years,

[18:49] (1129.77s)

Yahoo became the most popular site on the planet.

[18:52] (1132.95s)

- [Reporter] In the time it takes to say this sentence,

[18:55] (1135.50s)

Yahoo will answer 79,000 information requests worldwide,

[19:00] (1140.51s)

the two men are now worth $120 million each.

[19:04] (1144.72s)

♪ Yahoo ♪

[19:07] (1147.71s)

- But Yahoo had a critical weakness.

[19:11] (1151.64s)

See, Yahoo's keyword search was easy to trick.

[19:14] (1154.58s)

To get your page ranked highly,

[19:15] (1155.99s)

you could just repeat keywords hundreds of times,

[19:18] (1158.60s)

hidden with white text on a white background.

[19:21] (1161.99s)

- One thing they didn't have in those early days

[19:25] (1165.08s)

was a notion of quality of the result.

[19:28] (1168.23s)

So they had a notion of relevance saying,

[19:31] (1171.35s)

does this document talk about the thing

[19:33] (1173.81s)

that you're interested in?

[19:35] (1175.61s)

But there wasn't really a notion of which ones are better.

[19:39] (1179.18s)

- What they really needed was a way to rank pages

[19:41] (1181.55s)

by both relevance and quality.

[19:44] (1184.13s)

But how do you measure the quality of a webpage?

[19:46] (1186.77s)

Well, to understand that,

[19:48] (1188.09s)

we need to borrow an idea from libraries.

[19:50] (1190.49s)

- So I'm old enough that library books

[19:53] (1193.22s)

used to have a paper card in it

[19:55] (1195.26s)

that was a stamp of all the due dates

[19:57] (1197.24s)

of when it was due back.

[19:58] (1198.50s)

You took a book and if it had a lot of those,

[20:00] (1200.21s)

you said, "Oh, this is probably a good book."

[20:01] (1201.86s)

And if it didn't have any, you said,

[20:03] (1203.88s)

"Well, maybe this isn't the best book."

[20:06] (1206.15s)

- Stamps acted like endorsements.

[20:07] (1207.89s)

The more stamps, the better the book must be.

[20:09] (1209.99s)

And the same idea can be applied to the web.

[20:12] (1212.84s)

Over at Stanford, two PhD students,

[20:15] (1215.21s)

Sergey Brin and Larry Page,

[20:16] (1216.98s)

were working on this exact problem.

[20:19] (1219.32s)

Brin and Page realized that each link to a page

[20:22] (1222.47s)

can be thought of as an endorsement.

[20:24] (1224.63s)

And the more links a page sends out,

[20:26] (1226.52s)

the less valuable each vote becomes.

[20:29] (1229.73s)

So what they realized is that we can model the web

[20:32] (1232.31s)

as a Markov chain.

[20:34] (1234.68s)

- [Derek] To see how this works,

[20:36] (1236.12s)

imagine a toy internet with just four webpages.

[20:39] (1239.30s)

Call them Amy, Ben, Chris, and Dan.

[20:42] (1242.57s)

These are our states.

[20:44] (1244.31s)

Typically, one webpage links to others,

[20:46] (1246.56s)

allowing you to move between them.

[20:48] (1248.33s)

These are our transitions.

[20:50] (1250.49s)

In this setup, Amy only links to Ben,

[20:52] (1252.86s)

so there's a 100% chance of going from Amy to Ben.

[20:56] (1256.79s)

Ben links to Amy, Chris, and Dan,

[20:58] (1258.98s)

so there's a 33% chance of going to any of those pages,

[21:02] (1262.67s)

and we can fill out the other transition probabilities

[21:05] (1265.01s)

in the same way.

[21:07] (1267.05s)

So now we can run this Markov chain and see what happens.

[21:10] (1270.89s)

Imagine you're a surfer on this web.

[21:13] (1273.11s)

You start on a random page, say, Amy,

[21:15] (1275.63s)

and you keep running the machine

[21:17] (1277.52s)

and keep track of the percentage of time

[21:19] (1279.29s)

you spend on each page.

[21:21] (1281.54s)

Over time, the ratio settles

[21:23] (1283.61s)

and the scores give us some measure

[21:25] (1285.38s)

of the relative importance of these pages.

[21:28] (1288.20s)

You spend the most time on Ben,

[21:29] (1289.64s)

so Ben is ranked first, followed by Amy, then Dan,

[21:32] (1292.91s)

and lastly Chris.

[21:34] (1294.98s)

It might seem like there's an easy way to beat the system,

[21:37] (1297.71s)

just make 100 pages all linking to your website.

[21:40] (1300.74s)

Now you get 100 full votes

[21:42] (1302.48s)

and you'll always rank on top, but that is not the case.

[21:47] (1307.22s)

While during their first few steps,

[21:48] (1308.87s)

they might make your page seem important,

[21:50] (1310.97s)

none of the other websites link to them.

[21:53] (1313.07s)

So over many steps, their contributions don't matter.

[21:57] (1317.15s)

You might have many links,

[21:58] (1318.53s)

but they're not quality links,

[22:00] (1320.06s)

so they don't affect the algorithm.

[22:03] (1323.78s)

- But there is still one problem, though,

[22:05] (1325.82s)

not all pages are connected.

[22:07] (1327.71s)

In networks like this one,

[22:09] (1329.21s)

a random server can get stuck in a loop,

[22:11] (1331.40s)

never reaching the rest of the web.

[22:13] (1333.62s)

So to fix this,

[22:15] (1335.03s)

we can set a rule that 85% of the time,

[22:17] (1337.94s)

our random server just follows a link like normal.

[22:20] (1340.85s)

But then for about 15% of the time,

[22:22] (1342.95s)

they just jump to a page at random.

[22:25] (1345.65s)

This damping factor makes sure

[22:27] (1347.33s)

that we explore all possible parts of the web

[22:29] (1349.64s)

without ever getting stuck.

[22:32] (1352.40s)

By using Markov chains,

[22:34] (1354.02s)

Page and Brin had built a better search engine,

[22:36] (1356.66s)

and they called it PageRank.

[22:38] (1358.55s)

- Because it's talking about how pages react,

[22:42] (1362.42s)

webpages react with each other

[22:43] (1363.77s)

and also 'cause the founder's name is Larry Page,

[22:46] (1366.53s)

so he snuck that in.

[22:48] (1368.24s)

- With PageRank,

[22:49] (1369.14s)

Google got much better search results,

[22:51] (1371.36s)

often getting you to the site

[22:52] (1372.74s)

you were looking for in one go.

[22:54] (1374.93s)

Although, to some, this sounded like a terrible idea.

[22:58] (1378.15s)

- Others said, "Oh, well you're telling me you get a search

[23:00] (1380.87s)

that will get the right result on the first answer?

[23:04] (1384.17s)

Well, I don't want that

[23:05] (1385.01s)

because if it takes them three or four chances,

[23:08] (1388.91s)

searches to get the right answer,

[23:10] (1390.50s)

then I have three or four chances to show ads,

[23:13] (1393.26s)

and if you get 'em the answer right away,

[23:15] (1395.27s)

I'm just gonna lose them.

[23:16] (1396.29s)

So, you know, I don't see why better search is better."

[23:20] (1400.46s)

- But Page and Brin disagreed.

[23:22] (1402.20s)

They were convinced that if their product was far superior,

[23:24] (1404.96s)

then people would flock to it.

[23:26] (1406.73s)

- I would say it actually is a democracy that works.

[23:30] (1410.42s)

If all pages were equal,

[23:32] (1412.20s)

anybody can manufacture as many pages as they want.

[23:35] (1415.13s)

I can set up a billion pages in my server tomorrow.

[23:38] (1418.40s)

We shouldn't treat them all as equal.

[23:40] (1420.20s)

Just looking at the data out of curiosity,

[23:42] (1422.99s)

we found that we had technology

[23:44] (1424.28s)

to do a better job of search,

[23:45] (1425.78s)

and we realized how impactful having great search can be.

[23:49] (1429.71s)

- And so, in 1998, they launched their new search engine

[23:53] (1433.04s)

to take on Yahoo.

[23:54] (1434.51s)

Initially, they called it BackRub,

[23:56] (1436.37s)

after the backlinks it analyzed,

[23:58] (1438.65s)

but then they realized

[23:59] (1439.48s)

that maybe that's not the most attractive name.

[24:02] (1442.25s)

Now, their ambitions were big

[24:03] (1443.81s)

to essentially index all the pages on the internet,

[24:06] (1446.93s)

and they needed a name equally as big.

[24:09] (1449.42s)

So they thought of the largest number they could think of,

[24:12] (1452.45s)

10 to the power of 100, a googol.

[24:15] (1455.51s)

But then when trying to register their domain,

[24:17] (1457.73s)

they accidentally misspelled it.

[24:19] (1459.32s)

And so, Google was born.

[24:21] (1461.84s)

(dramatic music)

[24:26] (1466.65s)

Over the next four years,

[24:27] (1467.99s)

Google overthrew Yahoo

[24:29] (1469.49s)

to become the most used search engine.

[24:31] (1471.53s)

- Everyone who knows the internet

[24:32] (1472.97s)

almost certainly knows Google.

[24:34] (1474.83s)

- Googling is like oxygen to teenagers.

[24:37] (1477.89s)

- [Casper] And today, Alphabet,

[24:39] (1479.27s)

which is Google's parent company,

[24:40] (1480.89s)

is worth around $2 trillion.

[24:43] (1483.29s)

- When Google makes even the slightest change

[24:45] (1485.81s)

in its algorithms, it can have huge effects.

[24:48] (1488.41s)

- Google. - Google.

[24:49] (1489.41s)

- Google. - Google.

[24:51] (1491.78s)

- They're on fire.

[24:52] (1492.74s)

And the reason why they're on fire

[24:53] (1493.97s)

is because they're focused

[24:55] (1495.08s)

and they're more focused than Yahoo who does search,

[24:57] (1497.06s)

they're more focused than Microsoft

[24:58] (1498.38s)

who does search with Bing.

[24:59] (1499.58s)

Yahoo has lots of traffic, they always have,

[25:01] (1501.59s)

they have some really great properties,

[25:03] (1503.24s)

but I don't think Yahoo is the go-to place, you know.

[25:06] (1506.48s)

- And at the heart of this trillion dollar algorithm

[25:09] (1509.33s)

is a Markov chain, which only looks at the current state

[25:12] (1512.90s)

to predict what's going to happen next.

[25:15] (1515.99s)

But in the 1940s, Claude Shannon,

[25:18] (1518.78s)

the father of information theory,

[25:20] (1520.64s)

started asking a different question.

[25:23] (1523.46s)

He went back to Markov's original idea of predicting text,

[25:26] (1526.85s)

but instead of just using vowels and consonants,

[25:29] (1529.25s)

he focused on individual letters.

[25:31] (1531.89s)

And he wondered,

[25:32] (1532.79s)

what if instead of looking

[25:34] (1534.02s)

at only the last letter as a predictor,

[25:36] (1536.06s)

I look at the last two?

[25:37] (1537.95s)

Well, with that, he got text that looked like this.

[25:41] (1541.13s)

Now, it doesn't make much sense,

[25:42] (1542.45s)

but there are some recognizable words

[25:44] (1544.43s)

like "whey", "of", and "the".

[25:47] (1547.70s)

But Shannon was convinced he could do better.

[25:49] (1549.89s)

So next, instead of looking at letters,

[25:52] (1552.05s)

he wondered, what if I use entire words as predictors?

[25:55] (1555.74s)

That gave him sentences like this,

[25:57] (1557.88s)

"The head and in frontal attack on an English writer

[26:00] (1560.78s)

that the character of this point

[26:02] (1562.28s)

is therefore another method for the letters

[26:04] (1564.68s)

that the time of who ever told the problem

[26:07] (1567.32s)

for an unexpected."

[26:09] (1569.63s)

Now, clearly, this doesn't make any sense,

[26:12] (1572.27s)

but Shannon did notice that sequences of four words or so

[26:15] (1575.48s)

generally did make sense.

[26:17] (1577.22s)

For instance, "attack on an English writer"

[26:19] (1579.86s)

kind of makes sense.

[26:21] (1581.45s)

So Shannon learned that you can make better

[26:23] (1583.40s)

and better predictions

[26:24] (1584.48s)

about what the next word is going to be

[26:26] (1586.43s)

by taking into account more and more of the previous words.

[26:30] (1590.60s)

It's kind of like what Gmail does

[26:32] (1592.31s)

when it predicts what you're going to type next.

[26:35] (1595.49s)

And this is no coincidence,

[26:37] (1597.29s)

the algorithms that make these predictions

[26:39] (1599.45s)

are based on Markov chains.

[26:41] (1601.52s)

- They're not necessarily using letters, you know,

[26:44] (1604.43s)

-Yeah

[26:45] (1605.30s)

they use what they call tokens,

[26:47] (1607.31s)

some of which are letters, some of which are words,

[26:50] (1610.34s)

marks of punctuation, whatever.

[26:52] (1612.68s)

So it's a bigger set than just the alphabet.

[26:56] (1616.19s)

The game is simply, we have this string of tokens

[27:00] (1620.39s)

that, you know, might be 30 long,

[27:04] (1624.17s)

and we're asking what are the odds

[27:06] (1626.63s)

that the next token is this or this or this?

[27:09] (1629.45s)

- [Derek] But today's large language models

[27:10] (1630.83s)

don't treat all those tokens equally,

[27:13] (1633.23s)

because unlike simple Markov chains,

[27:15] (1635.03s)

they also use something called attention,

[27:17] (1637.07s)

which tells the model what to pay attention to.

[27:20] (1640.04s)

So in the phrase, "the structure of the cell,"

[27:22] (1642.50s)

the model can use previous context

[27:24] (1644.57s)

like blood and mitochondria

[27:25] (1645.95s)

to know the cell most likely refers

[27:28] (1648.23s)

to biology rather than a prison cell.

[27:30] (1650.45s)

And it uses that to tune its prediction.

[27:33] (1653.75s)

But as large language models become more widespread,

[27:36] (1656.39s)

one concern is that the text they produce

[27:38] (1658.43s)

ends up on the internet

[27:39] (1659.57s)

and that becomes training data for future models.

[27:43] (1663.86s)

- When you start doing that,

[27:47] (1667.07s)

the game is very soon over.

[27:48] (1668.99s)

You come, in this case, to us, a very dull, stable state,

[27:52] (1672.53s)

it just says the same thing

[27:53] (1673.67s)

over and over and over again forever.

[27:55] (1675.47s)

The language models are vulnerable to this process.

[27:59] (1679.43s)

- And any system like this where we have a feedback loop,

[28:02] (1682.28s)

will become hard to model using Markov chains.

[28:05] (1685.76s)

Take global warming, for instance,

[28:07] (1687.38s)

as we increase the amount of carbon dioxide in the air,

[28:09] (1689.69s)

the average temperature of the Earth increases.

[28:12] (1692.33s)

But as the temperature increases,

[28:13] (1693.68s)

the atmosphere can hold more water vapor,

[28:15] (1695.81s)

which is an incredibly powerful greenhouse gas.

[28:18] (1698.39s)

And with more water vapor,

[28:19] (1699.80s)

the temperature increases further

[28:21] (1701.36s)

allowing for even more water vapor.

[28:23] (1703.37s)

So you get this positive feedback loop,

[28:25] (1705.35s)

which makes it hard to predict what's going to happen next.

[28:28] (1708.41s)

So there are some systems where Markov chains don't work,

[28:31] (1711.74s)

but for many other dependent systems,

[28:33] (1713.54s)

they offer a way of doing probability.

[28:36] (1716.51s)

- But what's fascinating

[28:37] (1717.71s)

is that all these systems have extremely long histories.

[28:41] (1721.25s)

I mean, you could trace back all the letters in a text,

[28:43] (1723.89s)

trace back all the interactions of what a neutron did,

[28:46] (1726.47s)

or trace back the weather for weeks.

[28:49] (1729.11s)

But the beautiful thing Markov and others found

[28:51] (1731.60s)

is that for many of these systems

[28:53] (1733.46s)

you can ignore almost all of that.

[28:55] (1735.74s)

You can just look at the current state

[28:57] (1737.96s)

and forget about the rest,

[29:00] (1740.24s)

that makes these systems memoryless.

[29:02] (1742.94s)

And it's this memoryless property

[29:05] (1745.34s)

that makes Markov chains so powerful

[29:07] (1747.83s)

because it's what allows you

[29:09] (1749.09s)

to take these extremely complex systems

[29:11] (1751.58s)

and simplify them a lot

[29:13] (1753.35s)

to still make meaningful predictions.

[29:16] (1756.23s)

- [Derek] As one paper put it, "Problem-solving

[29:18] (1758.06s)

is often a matter of cooking up

[29:19] (1759.71s)

an appropriate Markov chain."

[29:22] (1762.23s)

- It's kind of ridiculous to me

[29:23] (1763.97s)

that this basic fact of mathematics

[29:26] (1766.64s)

would come out of a fight like that,

[29:28] (1768.62s)

which, you know, really had nothing to do with it.

[29:31] (1771.23s)

But all the evidence suggests

[29:33] (1773.57s)

that it really was this determination

[29:36] (1776.69s)

to show up Nekrasov that led Markov to do the work.

[29:41] (1781.52s)

- But there's one question we still haven't answered.

[29:44] (1784.64s)

When playing Solitaire,

[29:46] (1786.23s)

how did Ulam know his cards were perfectly shuffled?

[29:49] (1789.44s)

I mean, how many shuffles does it take

[29:51] (1791.99s)

to get a completely random arrangement of cards?

[29:56] (1796.10s)

- If you have a deck of cards,

[29:57] (1797.81s)

you need to shuffle it, right?

[30:00] (1800.06s)

How often, if you're shuffling, like, you know,

[30:02] (1802.10s)

you split it in half, and then you do the

[30:04] (1804.11s)

(cards riffling).

[30:05] (1805.52s)

How often do you have to shuffle it

[30:07] (1807.59s)

to make it completely random?

[30:09] (1809.51s)

- Two. - Two?

[30:11] (1811.16s)

I'm going with 26.

[30:12] (1812.24s)

- Yeah, four times. - Four times?

[30:13] (1813.86s)

- I don't know, 52 times?

[30:15] (1815.84s)

- Okay. Okay.

[30:16] (1816.67s)

It's not a bad guess.

[30:18] (1818.22s)

- Seven?

[30:19] (1819.24s)

- It is seven.

[30:21] (1821.03s)

- Really? - Yeah.

[30:22] (1822.05s)

So you can think of card shuffling as a Markov chain

[30:24] (1824.42s)

where each deck arrangement is a state,

[30:26] (1826.94s)

and then each shuffle is a step.

[30:28] (1828.80s)

And so for a deck of 52 cards,

[30:30] (1830.44s)

if you riffle shuffle it seven times,

[30:32] (1832.79s)

then every arrangement of the deck is about equally likely,

[30:35] (1835.67s)

so it's basically random.

[30:38] (1838.88s)

But I can't shuffle like that.

[30:40] (1840.59s)

So for me, what I do is I do it like this.

[30:43] (1843.77s)

How many times do you think you have to shuffle like this

[30:45] (1845.81s)

to get it random?

[30:47] (1847.27s)

(beeping)

[30:48] (1848.30s)

- What do you think?

[30:49] (1849.65s)

And perhaps more importantly,

[30:51] (1851.39s)

how would you go about working it out?

[30:54] (1854.09s)

Well, that's where today's sponsor Brilliant comes in.

[30:56] (1856.58s)

Brilliant is a learning app

[30:57] (1857.90s)

that gets you hands-on with problems just like this.

[31:01] (1861.29s)

Whether it's math, physics, programming,

[31:03] (1863.69s)

or even AI, Brilliant's interactive lessons and challenges

[31:07] (1867.14s)

let you play your way to a sharper mind.

[31:09] (1869.90s)

You can discover how large language models actually work

[31:12] (1872.93s)

from basic Markov chains to complex neural networks,

[31:16] (1876.20s)

or dig into the math behind this shuffling question.

[31:19] (1879.65s)

It's a fun way to build knowledge and skills

[31:21] (1881.69s)

that help you solve all kinds of problems,

[31:24] (1884.15s)

which brings us back to our shuffle.

[31:26] (1886.52s)

So Casper, what actually is the answer?

[31:29] (1889.58s)

- It's actually over 2,000.

[31:31] (1891.50s)

- What? - Over-

[31:32] (1892.33s)

- Crazy, right? - Yeah.

[31:33] (1893.42s)

- So the next time someone offers to shuffle before a game,

[31:36] (1896.09s)

make sure they're doing it right,

[31:37] (1897.59s)

seven riffles or it doesn't count.

[31:40] (1900.17s)

But the interesting part

[31:41] (1901.25s)

isn't just knowing that, it's understanding why

[31:43] (1903.95s)

and seeing how a simple question

[31:45] (1905.45s)

can lead you to some surprisingly complex mathematics.

[31:49] (1909.23s)

And that's what Brilliant is all about.

[31:52] (1912.03s)

So to try everything Brilliant has to offer for free

[31:55] (1915.29s)

for a full 30 days,

[31:56] (1916.49s)

visit brilliant.org/veritasium,

[31:58] (1918.89s)

click that link in the description

[32:00] (1920.39s)

or scan this handy QR code.

[32:02] (1922.79s)

And if you sign up,

[32:03] (1923.69s)

you'll also get 20% off their annual premium subscription.

[32:07] (1927.01s)

So I wanna thank Brilliant for sponsoring this video

[32:09] (1929.33s)

and I wanna thank you for watching.

[32:11] (1931.63s)

(upbeat music)

[32:21] (1941.15s)

- Easy.

[32:22] (1942.10s)

(light music)

YouTube Deep Summary

The Strange Math That Predicts (Almost) Anything

📚 Chapter Summaries (10)

🤖 AI-Generated Summary:

Summary History

The Great Russian Math Feud: Free Will vs. Probability

From Poetry to Nuclear Physics: The Monte Carlo Method

Markov Chains and the Rise of Google

Beyond Search: Markov Chains in Language and AI

The Memoryless Magic of Markov Chains

How Many Shuffles to Randomize a Deck?

Conclusion

📝 Transcript Chapters (10 chapters):

📝 Transcript (783 entries):