YouTube Deep SummaryYouTube Deep Summary

Star Extract content that makes a tangible impact on your life

Video thumbnail

The Strange Math That Predicts (Almost) Anything

Veritasium • 32:32 minutes • YouTube

📚 Chapter Summaries (10)

🤖 AI-Generated Summary:

From Russian Math Feuds to Google Search: The Surprising Power of Markov Chains

Have you ever wondered how many times you need to shuffle a deck of cards to truly randomize it? Or how scientists figured out the amount of uranium needed to build a nuclear bomb? Or even how Google knows exactly which page you’re searching for? The answers to these seemingly unrelated questions all trace back to a fascinating mathematical breakthrough that began over a century ago in Russia—a story of rivalry, insight, and innovation centered around something called Markov chains.

The Great Russian Math Feud: Free Will vs. Probability

In 1905, Russia was deeply divided politically between the Tsarists, who wanted to maintain the status quo, and socialists pushing for reform. This rift extended even into the world of mathematics, where two prominent mathematicians took opposing sides:

  • Pavel Nekrasov (“the Tsar of Probability”), a religious man who believed mathematics could explain free will and divine will.
  • Andrey Markov (“Andrey the Furious”), an atheist who argued that math had nothing to do with free will and criticized Nekrasov's approach as unrigorous.

Their debate centered on the law of large numbers, a principle stating that as you repeat independent trials (like flipping a fair coin many times), the average outcome converges to the expected probability (roughly 50/50 for heads or tails). Nekrasov believed that seeing this convergence implied the events must be independent—essentially, acts of free will.

Markov challenged this by showing that even dependent events—where one outcome influences the next—can still exhibit this convergence. To prove it, he analyzed the sequence of letters in a famous Russian poem, Eugene Onegin. Letters in text are clearly dependent on preceding letters (e.g., vowels and consonants tend to follow certain patterns), yet Markov demonstrated that these dependent sequences still obeyed the law of large numbers.

This led to the birth of Markov chains, mathematical models that describe systems where the next state depends only on the current state, not the full history. Markov’s conclusion was clear and powerful: free will is not necessary to do probability.

From Poetry to Nuclear Physics: The Monte Carlo Method

Fast forward to World War II, when scientists faced a daunting problem: understanding how neutrons behave inside a nuclear bomb. The complex interactions of trillions of neutrons were impossible to calculate exactly.

Enter Stanislaw Ulam, a mathematician who, while recovering from illness, pondered a simpler question: What are the chances of winning a random game of Solitaire? Realizing an analytical solution was impossible, he proposed playing many games and counting wins to approximate the probability—a method of statistical sampling.

Inspired by this idea, Ulam and John von Neumann applied the concept to simulate neutron behavior by generating many random chains of neutron interactions, using Markov chains to handle the dependencies between steps. Running these simulations on early computers like the ENIAC allowed them to estimate whether a chain reaction would sustain or explode—a critical insight for nuclear weapon design.

They named this approach the Monte Carlo method, after the famous casino, highlighting the role of randomness and chance in their computations. This method revolutionized physics and engineering, enabling scientists to tackle otherwise intractable problems through probabilistic simulation.

Markov Chains and the Rise of Google

Jumping ahead to the explosion of the internet in the 1990s, the challenge shifted to finding relevant information amid millions of webpages. Early search engines like Yahoo ranked pages by keyword frequency, but this was easily manipulated and provided poor quality results.

Two Stanford PhD students, Sergey Brin and Larry Page, had a breakthrough: they realized that links between webpages act like endorsements. Just like library books that are popular have many checkout stamps, webpages linked by many others are more important.

They modeled the web as a Markov chain, where each webpage is a state, and links represent transitions. By simulating a “random surfer” clicking links, they calculated the steady-state probability of being on each page, giving a ranking called PageRank. Importantly, their algorithm discounted low-quality link farms, preventing easy manipulation.

To avoid the surfer getting stuck in loops, they introduced a damping factor allowing random jumps, ensuring the model explored the entire web. This innovation powered Google, launched in 1998, which rapidly surpassed competitors by delivering far better search results.

Beyond Search: Markov Chains in Language and AI

Markov’s original idea of predicting letters depending only on the previous letter evolved over time. Claude Shannon, the father of information theory, expanded the concept to sequences of words, showing that longer histories improved prediction quality.

Modern language models, like those behind Gmail’s autocomplete or advanced AI chatbots, build on these principles but add mechanisms like attention to weigh the importance of different parts of the text context, enabling nuanced understanding and generation of human language.

The Memoryless Magic of Markov Chains

What makes Markov chains so powerful is their memoryless property: the future depends only on the present state, not the entire past. This simplification allows incredibly complex systems—from neutron physics to web surfing to language processing—to be modeled and understood.

While not all systems fit this model perfectly (feedback loops in climate change, for example, complicate predictions), Markov chains have become a cornerstone of probability and applied mathematics.

How Many Shuffles to Randomize a Deck?

Returning to the original question: how many shuffles does it take to randomize a deck of 52 cards? Surprisingly, the answer is just seven riffle shuffles to achieve near-perfect randomness.

This can be understood by modeling each shuffle as a step in a Markov chain, where each arrangement of the deck is a state. After seven shuffles, the distribution of deck states is approximately uniform.

Conclusion

From a century-old feud between Russian mathematicians to the heart of nuclear physics, internet search, and AI language models, Markov chains have quietly transformed our world. They reveal how dependent, complex systems can still be understood through probability, and how simple mathematical insights can lead to revolutionary technologies.

So next time you shuffle cards, search the web, or use predictive text, remember the fascinating chain of ideas—pun intended—that made it all possible.


Want to dive deeper into these concepts?
Check out Brilliant.org for interactive lessons on math, probability, AI, and more. From understanding Markov chains to exploring the math behind shuffling cards, Brilliant makes learning engaging and fun. Visit brilliant.org/veritasium for a free 30-day trial and 20% off annual premium subscriptions.


References: Inspired by the work of Andrey Markov, Stanislaw Ulam, John von Neumann, Sergey Brin, Larry Page, Claude Shannon, and many others who have shaped the mathematical tools we rely on every day.


📝 Transcript Chapters (10 chapters):

📝 Transcript (783 entries):

- How many times do you need to shuffle a deck of cards to make them truly random? How much uranium does it take to build a nuclear bomb? (explosion booming) How can you predict the next word in a sentence? And how does Google know which page you're actually searching for? Well, the reason we know the answer to all of these questions is because of a strange math feud in Russia that took place over 100 years ago. In 1905, socialist groups all across Russia rose up against the Tsar, the ruler of the empire. They demanded a complete political reform, or failing that, that he stepped down from power entirely. - This divided the nation into two. So on one side you got the Tsarists, right? They wanted to defend the status quo and keep the Tsar in power. But then on the other side, you had the socialists who wanted this complete political reform. And this division was so bad that it crept into every part of society to the point where even mathematicians started picking sides. - [Derek] On the side of the Tsar was Pavel Nekrasov, unofficially called the Tsar of Probability. Nekrasov was a deeply religious and powerful man, and he used his status to argue that math could be used to explain free will and the will of God. - His intellectual nemesis on the socialist side was Andrey Markov, also known as Andrey The Furious. Markov was an atheist and he had no patience for people who were being unrigorous, something he considered Nekrasov to be, because in his eyes, math had nothing to do with free will or religion. So he publicly criticized Nekrasov's work, listing it among "the abuses of mathematics." Their feud centered on the main idea people had used to do probability for the last 200 years. And we can illustrate this with a simple coin flip. When I flip the coin 10 times, I get six times heads and four times tails, which is obviously not the 50/50 you'd expect. But if I keep flipping the coin, then at first the ratio jumps all over the place. But after a large number of flips, we see that it slowly settles down and approaches 50/50. And in this case, after 100 flips, we end up on 51 heads and 49 tails, which is almost exactly what you would expect. This behavior that the average outcome gets closer and closer to the expected value as you run more and more independent trials is known as the law of large numbers. It was first proven by Jacob Bernoulli in 1713, and it was the key concept at the heart of probability theory right up until Markov and Nekrasov. But Bernoulli only proved that it worked for independent events like a fair coin flip, or when you ask people to guess how much they think an item is worth, where one event doesn't influence the others. But now imagine that instead of asking each person to submit their guess individually, you ask people to shout out their answer in public. Well, in this case, the first person might think it's an extraordinarily valuable item, and say it's worth around $2,000, but now all the other people in the room are influenced by this value, and so, their guesses have become dependent. And now the average doesn't converge to the true value, but instead it clusters around a higher amount. - And so, for 200 years, probability had relied on this key assumption, that you need independence to observe the law of large numbers. And this was the idea that sparked Nekrasov and Markov's feud. See, Nekrasov agreed with Bernoulli that you need independence to get the law of large numbers. But he took it one step further. He said, if you see the law of large numbers, you can infer that the underlying events must be independent. - Take this table of Belgian marriages from 1841 to 1845. Now you see that every year the average is about 29,000. And so, it seems like the values converge and therefore that they follow the law of large numbers. And when Nekrasov looked at other social statistics like crime rates and birth rates, he noticed a similar pattern. But now think about where all this data is coming from. It's coming from decisions to get married, decisions to commit crimes, and decisions to have babies, at least for the most part. So Nekrasov reasoned that because these statistics followed the law of large numbers, the decisions causing them must be independent. In other words, he argued that they must be acts of free will. So to him, free will wasn't just something philosophical, it was something you could measure. It was scientific. - [Derek] But to Markov, Nekrasov was delusional. He thought it was absurd to link mathematical independence to free will. So Markov set out to prove that dependent events could also follow the law of large numbers, and that you can still do probability with dependent events. - To do this, he needed something where one event clearly depended on what came before, and he got the idea that this is what happens in text. Whether your next letter is a consonant or a vowel depends heavily on what the current letter is. So to test this, Markov turned to a poem at the heart of Russian literature, "Eugene Onegin" by Alexander Pushkin. - He took the first 20,000 letters of the poem, stripped out all punctuation and spaces, and pushed them together into one long string of characters. He counted the letters and found that 43% were vowels and 57% were consonants. Then Markov broke the string into overlapping pairs, that gave him four possible combinations, vowel-vowel, consonant-consonant, vowel-consonant, or consonant-vowel. Now, if the letters were independent, the probability of a vowel-vowel pair would just be the probability of a vowel twice, which is about 0.18 or an 18% chance. But when Markov actually counted, he found vowel-vowel pairs only show up 6% of the time, way less than if they were independent. And when he checked the other pairs, he found that all actual values differed greatly from what the independent case would predict. So Markov had shown that the letters were dependent. And to beat Nekrasov, all he needed to do now was show that these letters still followed the law of large numbers. So he created a prediction machine of sorts. He started by drawing two circles, one for a vowel and one for a consonant. These were his states. Now, say you're at a vowel, then the next letter could either be a vowel or a consonant. So he drew two arrows to represent these transitions. But what are these transition probabilities? Well, Markov knew that if you pick a random starting point, there is a 43% chance that it'll be a vowel. He also knew that vowel-vowel pairs occur about 6% of the time. So to find the probability of going from a vowel to another vowel, he divided 0.06 by 0.43 to find a transition probability of about 13%. And since there is a 100% chance that another letter comes next, all the arrows going from the same state need to add up to one. So the chance of going to a consonant is one minus 0.13, or 87%. He repeated this process for the consonants to complete his predictive machine. So let's see how it works. We'll start at a vowel. Next, we generate a random number between zero and one. If it's below 0.13, we get another vowel, if it's above, we get a consonant. We got 0.78, so we get a consonant, then we generate another number, and check if it's above or below 0.67, 0.21, so we get a vowel. Now, we can keep doing this and keep track of the ratio of vowels to consonants. At first, the ratio jumps all over the place, but after a while, it converges to a steady value, 43% vowels and 57% consonants, the exact split Markov had counted by hand. So Markov had built a dependent system, a literal chain of events, and he showed that it still followed the law of large numbers, which meant that observing convergence in social statistics didn't prove that the underlying decisions were independent. In other words, those statistics don't prove free will at all. Markov had shattered Nekrasov's argument, and he knew it. So he ended his paper with one final dig at his rival. "Thus, free will is not necessary to do probability." In fact, independence isn't even necessary to do probability. With this Markov chain, as it came to be known, he found a way to do probability with dependent events. This should have been a huge breakthrough, because in the real world, almost everything is dependent on something else. I mean, the weather tomorrow depends on the conditions today. How a disease spreads depends on who's infected right now, and the behavior of particles depends on the behavior of particles around them. Many of these processes could be modeled using Markov chains. Do people think it was like a mic drop moment and like, "Oh, Nekrasov's out, like, Markov's the man"? Or people didn't really notice, or it was obscure, or? - I feel like people didn't really notice, like, it wasn't a really big thing. And Markov himself seemingly didn't care much about how it might be applied to practical events. He wrote, "I'm concerned only with questions of pure analysis. I refer to the question of the applicability with indifference." Little did he know that this new form of probability theory would soon play a major role in one of the most important developments of the 20th century. On the morning of the 16th of July, 1945, the United States detonated The Gadget, the world's first nuclear bomb. The six kilogram plutonium bomb created an explosion that was equivalent to nearly 25,000 tons of TNT. This was the culmination of the top secret Manhattan Project, a three-year long effort by some of the smartest people alive, including people like J. Robert Oppenheimer, John von Neumann, and a little known mathematician named Stanislaw Ulam. - [Derek] Even after the war ended, Ulam continued trying to figure out how neutrons behave inside a nuclear bomb. Now, a nuclear bomb works something like this. Say you have a core of uranium-235, then when a neutron hits a U-235 nucleus, the nucleus splits releasing energy and, crucially, two or three more neutrons. If, on average, those new neutrons go on to hit and split more than one other U-235 nucleus, you get a runaway chain reaction, so you have a nuclear bomb. But uranium-235, the fissile fuel needed for the bombs was really hard to get. So one of the key questions was just how much of it do you need to build a bomb? And this is why Ulam wanted to understand how the neutrons behave. - [Casper] But then in January of 1946, everything came to a halt. Ulam was struck by a sudden and severe case of encephalitis, an inflammation of the brain, that nearly killed him. His recovery was long and slow, with Ulam spending most of his time in beds. To pass the time, he played a simple card game, Solitaire. But as he played countless games, winning some, losing others, one question kept nagging at him, what are the chances that a randomly-shuffled game of Solitaire could be won? It was a deceivingly difficult problem to solve. Ulam played with all 52 cards where each arrangement created a unique game, so the total number of possible games was 52 factorial, or about eight times 10 to 67. So solving this analytically was hopeless. But then Ulam had a flash of insight, what if I just play hundreds of games and count how many could be won? That would give him some sort of statistical approximation of the answer. Back at Los Alamos, the remaining scientists grappled with much harder problems than Solitaire, like figuring out how neutrons behave inside a nuclear core. In a nuclear core, there are trillions and trillions of neutrons all interacting with their surroundings. So the number of possible outcomes is immense, and computing it directly seemed impossible. - [Derek] But when Ulam returned to work, he had a sudden revelation. What if we could simulate these systems by generating lots of random outcomes like I did with Solitaire? He shared this idea with von Neumann, who immediately recognized its power, but also spotted a key problem. - See, in Solitaire, each game is independent. How the cards are dealt in one game have no effect on the next, but neutrons aren't like that. A neutron's behavior depends on where it is and what it has done before. So you couldn't just sample random outcomes like in Solitaire. Instead, you needed to model a whole chain of events where each step influenced the next. What von Neumann realized is that you needed a Markov chain. So they made one and a much simplified version of it works something like this. Now, the starting state is just a neutron traveling through the core, and from there, three things can happen. It can scatter off an atom and keep traveling, so that gives you an arrow going back to itself. It can leave the system or get absorbed by a non-fissile material, in which case it no longer takes part in the chain reaction, and so it ends its Markov chain, or it can strike another uranium-235 atom, triggering a fission event and releasing two or three more neutrons that then start their own chains. But in this chain, the transition probabilities aren't fixed, they depend on things like the neutron's position, velocity and energy, as well as the overall configuration and mass of uranium. So a fast-moving neutron might have a 30% chance to scatter, a 50% chance to be absorbed or leave, and a 20% chance to cause fission. But a slower-moving neutron would have different probabilities. Next, they ran this chain on the world's first electronic computer, the ENIAC. The computer started by randomly generating a neutron starting conditions and stepped through the chain to keep track of how many neutrons were produced on average per run, known as the multiplication factor k. So if, on average, one neutron produces another two neutrons, then k is equal to two. And if on average every two neutrons produce three neutrons, then k is equal to three over two, and so on. Then, after stepping through the full chain for a specified number of steps, we collect the average k-value and record that number in a histogram. This process was then repeated hundreds of times, and the results tallied up, giving you a statistical distribution of the outcome. If you find that in most cases, k is less than one, the reaction dies down. If it's equal to one, there's a self-sustaining chain reaction, but it doesn't grow. And if k is larger than one, the reaction grows exponentially and you've got a bomb. - With it, von Neumann and Ulam had a statistical way to figure out how many neutrons were produced without having to do any exact calculations. In other words, they could approximate differential equations that were too hard to solve analytically. All that was needed was a name for the new method. Now, Ulam's uncle was a gambler, and the random sampling and high stakes reminded Ulam of the Monte Carlo Casino in Monaco, and the name stuck. The Monte Carlo method was born. The method was so successful that it didn't stay secret for long. By the end of 1948, scientists at another lab, Argonne, in Chicago, used it to study nuclear reactor designs, and from there, the idea spread quickly. Ulam later remarked, "It is still an unending source of surprise for me to see how a few scribbles on a blackboard could change the course of human affairs." And it wouldn't be the last time Markov chain based method changed the course of human affairs. (upbeat music) - In 1993, the internet was open to the public, and soon it exploded. By the mid-1990s, thousands of new pages appeared every day, and that number was only growing. This created a new kind of problem. I mean, how do you find anything in this ever-expending sea of information? In 1994, two Stanford PhD students, Jerry Yang and David Filo, founded the search engine Yahoo, to address this issue, but they needed money. So a year later, they arranged to meet with Japanese billionaire, Masayoshi Son, also known as the Bill Gates of Japan. (gong clangs) - They were looking to raise $5 million for their next startup, but Son has other plans. He offers to invest a full $100 million instead. That's 20 times more than what the founders asked for. So Jerry Yang declines saying, "We don't need that much," but Son disagrees, "Jerry, everyone needs $100 million." (Son laughs) Before the founders get a chance to respond, Son jumps in again and asks, "Who are your biggest competitors?" "Excite and lycos," the pair respond. Son orders his associate to write those names down. And then he says, "If you don't let me invest in Yahoo, I will invest in one of them and I'll kill you." See, Son had realized something. None of the leading search engines at the time had any superior technology. They didn't have a technological advantage over the others. They all just ranked pages by how often a search term appears on a given page. So the battle for the number one search engine would be decided by who could attract the most users, who could spend the most on marketing. - [Announcer 1] Lycos, go get it. - [Announcer 2] Get Lycos, or get lost. - [Announcer 3] This is revolution. (upbeat funky music) ♪ Yahoo ♪ - [Derek] And marketing required a lot of money, money that Son had, so he could decide who won the war. Yahoo's founders realized they were left with no real choice but to accept Son's investment. - [Speaker] So here we are, right in the middle of Yahoo. - [Derek] And within four years, Yahoo became the most popular site on the planet. - [Reporter] In the time it takes to say this sentence, Yahoo will answer 79,000 information requests worldwide, the two men are now worth $120 million each. ♪ Yahoo ♪ - But Yahoo had a critical weakness. See, Yahoo's keyword search was easy to trick. To get your page ranked highly, you could just repeat keywords hundreds of times, hidden with white text on a white background. - One thing they didn't have in those early days was a notion of quality of the result. So they had a notion of relevance saying, does this document talk about the thing that you're interested in? But there wasn't really a notion of which ones are better. - What they really needed was a way to rank pages by both relevance and quality. But how do you measure the quality of a webpage? Well, to understand that, we need to borrow an idea from libraries. - So I'm old enough that library books used to have a paper card in it that was a stamp of all the due dates of when it was due back. You took a book and if it had a lot of those, you said, "Oh, this is probably a good book." And if it didn't have any, you said, "Well, maybe this isn't the best book." - Stamps acted like endorsements. The more stamps, the better the book must be. And the same idea can be applied to the web. Over at Stanford, two PhD students, Sergey Brin and Larry Page, were working on this exact problem. Brin and Page realized that each link to a page can be thought of as an endorsement. And the more links a page sends out, the less valuable each vote becomes. So what they realized is that we can model the web as a Markov chain. - [Derek] To see how this works, imagine a toy internet with just four webpages. Call them Amy, Ben, Chris, and Dan. These are our states. Typically, one webpage links to others, allowing you to move between them. These are our transitions. In this setup, Amy only links to Ben, so there's a 100% chance of going from Amy to Ben. Ben links to Amy, Chris, and Dan, so there's a 33% chance of going to any of those pages, and we can fill out the other transition probabilities in the same way. So now we can run this Markov chain and see what happens. Imagine you're a surfer on this web. You start on a random page, say, Amy, and you keep running the machine and keep track of the percentage of time you spend on each page. Over time, the ratio settles and the scores give us some measure of the relative importance of these pages. You spend the most time on Ben, so Ben is ranked first, followed by Amy, then Dan, and lastly Chris. It might seem like there's an easy way to beat the system, just make 100 pages all linking to your website. Now you get 100 full votes and you'll always rank on top, but that is not the case. While during their first few steps, they might make your page seem important, none of the other websites link to them. So over many steps, their contributions don't matter. You might have many links, but they're not quality links, so they don't affect the algorithm. - But there is still one problem, though, not all pages are connected. In networks like this one, a random server can get stuck in a loop, never reaching the rest of the web. So to fix this, we can set a rule that 85% of the time, our random server just follows a link like normal. But then for about 15% of the time, they just jump to a page at random. This damping factor makes sure that we explore all possible parts of the web without ever getting stuck. By using Markov chains, Page and Brin had built a better search engine, and they called it PageRank. - Because it's talking about how pages react, webpages react with each other and also 'cause the founder's name is Larry Page, so he snuck that in. - With PageRank, Google got much better search results, often getting you to the site you were looking for in one go. Although, to some, this sounded like a terrible idea. - Others said, "Oh, well you're telling me you get a search that will get the right result on the first answer? Well, I don't want that because if it takes them three or four chances, searches to get the right answer, then I have three or four chances to show ads, and if you get 'em the answer right away, I'm just gonna lose them. So, you know, I don't see why better search is better." - But Page and Brin disagreed. They were convinced that if their product was far superior, then people would flock to it. - I would say it actually is a democracy that works. If all pages were equal, anybody can manufacture as many pages as they want. I can set up a billion pages in my server tomorrow. We shouldn't treat them all as equal. Just looking at the data out of curiosity, we found that we had technology to do a better job of search, and we realized how impactful having great search can be. - And so, in 1998, they launched their new search engine to take on Yahoo. Initially, they called it BackRub, after the backlinks it analyzed, but then they realized that maybe that's not the most attractive name. Now, their ambitions were big to essentially index all the pages on the internet, and they needed a name equally as big. So they thought of the largest number they could think of, 10 to the power of 100, a googol. But then when trying to register their domain, they accidentally misspelled it. And so, Google was born. (dramatic music) Over the next four years, Google overthrew Yahoo to become the most used search engine. - Everyone who knows the internet almost certainly knows Google. - Googling is like oxygen to teenagers. - [Casper] And today, Alphabet, which is Google's parent company, is worth around $2 trillion. - When Google makes even the slightest change in its algorithms, it can have huge effects. - Google. - Google. - Google. - Google. - They're on fire. And the reason why they're on fire is because they're focused and they're more focused than Yahoo who does search, they're more focused than Microsoft who does search with Bing. Yahoo has lots of traffic, they always have, they have some really great properties, but I don't think Yahoo is the go-to place, you know. - And at the heart of this trillion dollar algorithm is a Markov chain, which only looks at the current state to predict what's going to happen next. But in the 1940s, Claude Shannon, the father of information theory, started asking a different question. He went back to Markov's original idea of predicting text, but instead of just using vowels and consonants, he focused on individual letters. And he wondered, what if instead of looking at only the last letter as a predictor, I look at the last two? Well, with that, he got text that looked like this. Now, it doesn't make much sense, but there are some recognizable words like "whey", "of", and "the". But Shannon was convinced he could do better. So next, instead of looking at letters, he wondered, what if I use entire words as predictors? That gave him sentences like this, "The head and in frontal attack on an English writer that the character of this point is therefore another method for the letters that the time of who ever told the problem for an unexpected." Now, clearly, this doesn't make any sense, but Shannon did notice that sequences of four words or so generally did make sense. For instance, "attack on an English writer" kind of makes sense. So Shannon learned that you can make better and better predictions about what the next word is going to be by taking into account more and more of the previous words. It's kind of like what Gmail does when it predicts what you're going to type next. And this is no coincidence, the algorithms that make these predictions are based on Markov chains. - They're not necessarily using letters, you know, -Yeah they use what they call tokens, some of which are letters, some of which are words, marks of punctuation, whatever. So it's a bigger set than just the alphabet. The game is simply, we have this string of tokens that, you know, might be 30 long, and we're asking what are the odds that the next token is this or this or this? - [Derek] But today's large language models don't treat all those tokens equally, because unlike simple Markov chains, they also use something called attention, which tells the model what to pay attention to. So in the phrase, "the structure of the cell," the model can use previous context like blood and mitochondria to know the cell most likely refers to biology rather than a prison cell. And it uses that to tune its prediction. But as large language models become more widespread, one concern is that the text they produce ends up on the internet and that becomes training data for future models. - When you start doing that, the game is very soon over. You come, in this case, to us, a very dull, stable state, it just says the same thing over and over and over again forever. The language models are vulnerable to this process. - And any system like this where we have a feedback loop, will become hard to model using Markov chains. Take global warming, for instance, as we increase the amount of carbon dioxide in the air, the average temperature of the Earth increases. But as the temperature increases, the atmosphere can hold more water vapor, which is an incredibly powerful greenhouse gas. And with more water vapor, the temperature increases further allowing for even more water vapor. So you get this positive feedback loop, which makes it hard to predict what's going to happen next. So there are some systems where Markov chains don't work, but for many other dependent systems, they offer a way of doing probability. - But what's fascinating is that all these systems have extremely long histories. I mean, you could trace back all the letters in a text, trace back all the interactions of what a neutron did, or trace back the weather for weeks. But the beautiful thing Markov and others found is that for many of these systems you can ignore almost all of that. You can just look at the current state and forget about the rest, that makes these systems memoryless. And it's this memoryless property that makes Markov chains so powerful because it's what allows you to take these extremely complex systems and simplify them a lot to still make meaningful predictions. - [Derek] As one paper put it, "Problem-solving is often a matter of cooking up an appropriate Markov chain." - It's kind of ridiculous to me that this basic fact of mathematics would come out of a fight like that, which, you know, really had nothing to do with it. But all the evidence suggests that it really was this determination to show up Nekrasov that led Markov to do the work. - But there's one question we still haven't answered. When playing Solitaire, how did Ulam know his cards were perfectly shuffled? I mean, how many shuffles does it take to get a completely random arrangement of cards? - If you have a deck of cards, you need to shuffle it, right? How often, if you're shuffling, like, you know, you split it in half, and then you do the (cards riffling). How often do you have to shuffle it to make it completely random? - Two. - Two? I'm going with 26. - Yeah, four times. - Four times? - I don't know, 52 times? - Okay. Okay. It's not a bad guess. - Seven? - It is seven. - Really? - Yeah. So you can think of card shuffling as a Markov chain where each deck arrangement is a state, and then each shuffle is a step. And so for a deck of 52 cards, if you riffle shuffle it seven times, then every arrangement of the deck is about equally likely, so it's basically random. But I can't shuffle like that. So for me, what I do is I do it like this. How many times do you think you have to shuffle like this to get it random? (beeping) - What do you think? And perhaps more importantly, how would you go about working it out? Well, that's where today's sponsor Brilliant comes in. Brilliant is a learning app that gets you hands-on with problems just like this. Whether it's math, physics, programming, or even AI, Brilliant's interactive lessons and challenges let you play your way to a sharper mind. You can discover how large language models actually work from basic Markov chains to complex neural networks, or dig into the math behind this shuffling question. It's a fun way to build knowledge and skills that help you solve all kinds of problems, which brings us back to our shuffle. So Casper, what actually is the answer? - It's actually over 2,000. - What? - Over- - Crazy, right? - Yeah. - So the next time someone offers to shuffle before a game, make sure they're doing it right, seven riffles or it doesn't count. But the interesting part isn't just knowing that, it's understanding why and seeing how a simple question can lead you to some surprisingly complex mathematics. And that's what Brilliant is all about. So to try everything Brilliant has to offer for free for a full 30 days, visit brilliant.org/veritasium, click that link in the description or scan this handy QR code. And if you sign up, you'll also get 20% off their annual premium subscription. So I wanna thank Brilliant for sponsoring this video and I wanna thank you for watching. (upbeat music) - Easy. (light music)