The Equation That Deliberately Forgets Everything

There is a type of mathematical system that works by refusing to remember anything.

It does not care where you came from. It does not care what you did last week, last year, or ever. It looks only at where you are right now, and from that single fact, it calculates where you are going next.

This sounds like a crippling limitation. It is, in fact, one of the most powerful ideas in the history of applied mathematics.

The Markov chain, named after a Russian mathematician who spent years trying to pick a fight with the Orthodox Church, is the invisible scaffolding underneath Google's search rankings, the token-by-token logic of every large language model, the card shuffling problem that every casino in the world cares about, the flood prediction models used in disaster-prone river basins, and the nuclear physics simulations that made the hydrogen bomb possible.

All of it runs on a single, almost laughably simple principle: the next state depends only on the current state. Nothing else. The past is irrelevant.

This property has a name. Mathematicians call it the Markov property. The rest of us might call it structured amnesia. And it turns out that structured amnesia, applied correctly, can model almost anything.

The Theologian Who Started It All

The story of Markov chains begins not in a laboratory but in a theological argument, which is an unusual place for a major mathematical breakthrough to originate.

In the early 1900s, Russian mathematics was divided into two hostile camps. The Moscow Mathematical Society was deeply intertwined with the Russian Orthodox Church and conservative monarchist politics. The St. Petersburg school, where Andrey Markov worked, was secular, atheist, and politically progressive.

The flashpoint was a paper published in 1902 by Pavel Nekrasov, a Moscow mathematician and trained theologian. Nekrasov had noticed that broad social statistics, things like regional marriage rates, crime frequencies, and demographic patterns, reliably conformed to the Law of Large Numbers, the mathematical principle stating that averages stabilize over large sample sizes.

Nekrasov built an argument from this observation. The Law of Large Numbers, he wrote, required strict independence between events. Coin flips are independent. Dice rolls are independent. Each outcome has no effect on the next. Since human social decisions conformed to this same statistical law, and since the law required independence, therefore human decisions must be independent of one another. Therefore, humans were not controlled by determinism. Therefore, free will was real. Therefore, God existed.

It was a sophisticated piece of mathematical theology, and Andrey Markov despised it completely.

Markov vs. Nekrasov

Nekrasov argued that the Law of Large Numbers could only apply to independent events, so social statistics proving the law implied free will. Markov set out to demolish this by proving the law could apply to deeply dependent, causally linked events. A mathematical theorem became a proxy for a fight over God, determinism, and the soul.

Markov, described by his contemporaries as abrasive and confrontational (he would eventually be formally excommunicated from the Orthodox Church for requesting to be excommunicated, because he wanted it on record), identified the fatal vulnerability in Nekrasov's reasoning: the assumption that the Law of Large Numbers required independence was simply wrong. If he could prove that dependent events also converged to stable averages, Nekrasov's theological proof collapsed.

To do this, he needed a massive dataset of obviously dependent events. He chose Alexander Pushkin's novel in verse, Eugene Onegin, a text deeply embedded in Russian cultural identity. He manually analyzed the first 20,000 letters, categorizing each as a vowel or a consonant.

The dependency was obvious: in Russian, a consonant is highly likely to be followed by a vowel. Each letter influences the probability of the next. The events are explicitly chained together. And yet Markov proved that the overall frequency of vowels and consonants converged perfectly to a stable average, exactly as the Law of Large Numbers predicted.

He presented these findings to the Imperial Academy of Sciences in St. Petersburg on January 23, 1913.

He had proven that dependent, sequentially linked events produce stable, predictable statistical patterns. In doing so, he accidentally created the theoretical foundation for stochastic processes, and established the mathematical architecture that would eventually power the internet, artificial intelligence, and thermonuclear weapons.

	Pavel Nekrasov	Andrey Markov
Philosophical goal	Mathematical proof of free will and divine agency	Separation of probability theory from religious doctrine
Core claim	Law of Large Numbers requires strict independence	Law of Large Numbers holds for dependent, chained events
Data source	Sociological statistics: crime rates, marriage rates	Textual analysis: 20,000 letters of Eugene Onegin
Legacy	Marginalized theological interpretation of statistics	Founding framework of stochastic processes

Nekrasov's theological argument was not definitively refuted by this. It was mathematically invalidated.

The Playing Cards That Cracked Nuclear Physics

For several decades after 1913, the Markov chain was a theoretical curiosity with limited practical application. That changed during World War II, in the most consequential scientific project in history.

Stanislaw Ulam was a Polish-born mathematician working at Los Alamos on the Manhattan Project. He was recovering from brain surgery in 1946, spending his convalescence playing solitaire, when he had an insight that would define modern computation.

He tried to calculate the exact probability of winning a solitaire hand. He quickly realized this was impossible by conventional analytical methods. A standard 52-card deck can be arranged in approximately $8 \times 10^{67}$ distinct permutations. Writing equations that covered every possible path was not merely difficult. It was categorically intractable.

But then Ulam noticed something. If you just play the game 100 times and count how often you win, the proportion converges to the true probability. You do not need to solve the deterministic equations. You sample the stochastic process repeatedly and let the Law of Large Numbers do the work.

The idea was to try out thousands of such trials and simply observe the results. This could be done if sufficient numbers of trials were made by a mechanical process.

— Stanislaw Ulam

This was not just a shortcut for card games. The same logic applied to neutron diffusion in a nuclear reactor core. Predicting the precise behavior of an individual neutron as it ricochets through fissile material involves an impossible number of interacting variables. But each neutron's behavior depends only on its current state: its position and kinetic energy right now. What happened before is irrelevant. This is the Markov property.

Ulam recognized that you could model neutron chains the same way he modeled solitaire: simulate thousands of probabilistic trajectories and observe the aggregate behavior. He shared the idea with John von Neumann, who understood immediately that this required a computer.

Von Neumann, who was already working with the ENIAC at the time, alongside Nicholas Metropolis (who named the method "Monte Carlo" after the casinos his uncle frequented), began translating Markov's theoretical chains into machine logic.

In April and May of 1948, the first true Monte Carlo calculations ran on the ENIAC. These were also the first programs ever run on an electronic computer in the modern stored-program paradigm. The programming effort was led by Klára von Neumann, who spent 32 consecutive days rewiring and modifying the 27-ton machine to process the neutron diffusion calculations.

Milestone	Key People	What Happened
Conceptualization, 1946	Stanislaw Ulam	Solitaire problem revealed that statistical sampling could bypass intractable deterministic equations
Algorithm design	John von Neumann, Nicholas Metropolis	Translated the sampling insight into the Monte Carlo method
Machine execution, 1948	Klára von Neumann	Programmed the ENIAC to run the first stored-program stochastic simulations
Analog simulation, 1947	Enrico Fermi	Built the FERMIAC, a mechanical device that physically traced neutron probabilistic paths through materials

Fermi had independently arrived at similar ideas in the 1930s while studying neutron moderation in Rome, and had secretly built a mechanical device called the FERMIAC that drew two-dimensional visual traces of neutron "genealogies" through reactor materials using rotating brass drums.

The connection between Markov's analysis of Pushkin's letters in 1913 and the design of the hydrogen bomb in 1948 is not metaphorical. It is structural. The same mathematical framework, the same memoryless state transitions, the same convergence to stable distributions, runs through both.

How Google Turned the Internet Into a Markov Chain

By the late 1990s, the World Wide Web had become vast, chaotic, and nearly unusable. Early search engines ranked pages by keyword frequency, which made them trivially easy to manipulate. Anyone who stuffed a page with repeated words could outrank genuinely useful content. The internet was drowning in its own noise.

Larry Page and Sergey Brin at Stanford approached this as a problem in network theory and stochastic processes. They modeled the entire internet as a directed graph: billions of pages as nodes, hyperlinks as edges. Then they asked a question that turned out to be exactly the right question.

What if a person browsed the web completely at random, clicking any available link with equal probability? Where would they spend most of their time?

This is a Markov chain. The state space is every web page. The transition probability from page $i$ to page $j$ is $1/m_i$ , where $m_i$ is the total number of outgoing links from page $i$ . The surfer has no memory of where they have been. Only where they are now determines where they go next.

The authoritative value of any given page, what PageRank actually measures, is the stationary probability $\pi_i$ of this chain: the proportion of time the random surfer ends up on page $i$ over an infinite browsing horizon. Pages that attract many incoming links from other well-ranked pages accumulate a higher stationary probability. They become natural attractors in the probability flow.

$\pi_i = \frac{1 - \gamma}{N} + \gamma \sum_{j \in B_i} \frac{\pi_j}{m_j}$

Where $\gamma = 0.85$ is the damping factor, $N$ is the total number of pages, $B_i$ is the set of pages linking to page $i$ , and $m_j$ is the number of outgoing links from page $j$ .

The web created two immediate problems for this model. First, some pages had no outgoing links at all. A page with no exits is an absorbing state: once the surfer lands there, the probabilistic chain terminates. Probability mass drains out of the system. Second, tight clusters of interlinked pages could trap the surfer in recursive loops, artificially inflating their rank.

The damping factor solved both. By assigning a 15% probability that the surfer gets bored and teleports to a completely random page, Page and Brin ensured that every page remained reachable from every other page and that no cluster could monopolize the flow. Mathematically, this guaranteed the transition matrix was strictly positive, satisfying the conditions of the Perron-Frobenius theorem and ensuring the existence of a unique stationary distribution.

What PageRank Actually Computes

The rank of a web page is not a judgment about its quality. It is the stationary probability of a memoryless random walk across the internet. Quality emerges not because the algorithm evaluates content, but because human linking behavior, aggregated across billions of decisions, encodes quality into the network's topology. PageRank reads that topology as a Markov chain.

That mathematical mechanism, applied to what now exceeds 40 billion indexed pages, became the structural foundation of the modern digital economy.

ChatGPT Is a Very Large Markov Chain

Andrey Markov analyzed the sequential dependencies of letters in Eugene Onegin to prove a point about probability. He was manually computing, letter by letter, the probability that a vowel follows a consonant.

What modern large language models do is structurally identical to this, scaled by a factor of roughly $10^{50}$ .

Claude Shannon, the father of information theory, extended Markov's original insight in the 1940s by recognizing that natural language has deep, hierarchical statistical structure. The probability of the next word depends not just on the word immediately before it, but on the entire preceding context. The longer the context window you can retain, the more accurately you can predict what comes next.

Contemporary language models operate as autoregressive systems: they predict the next token (a word or sub-word fragment) given the full sequence of prior tokens in their context window. If you define the vocabulary size as $T$ (often exceeding 50,000 tokens) and the context window as $K$ , the model is traversing a state space of size $O(T^K)$ , each state defined by the complete sequence of tokens currently in context.

The transitions between states are the probability distributions the model assigns to every possible next token, computed by the neural network's final softmax layer. And because prediction at step $n+1$ depends strictly on the state at step $n$ , with no access to any memory outside the context window, the system is, by formal definition, a Markov chain.

$P(x_{n+1} \mid x_1, x_2, \ldots, x_n) = P(x_{n+1} \mid x_n, x_{n-1}, \ldots, x_{n-K+1})$

Why LLMs Repeat Themselves

When a language model gets stuck in a repetitive loop, generating the same phrase over and over, this is not a bug in the colloquial sense. It is the model falling into an absorbing cycle in its Markov chain: a set of states where the transition probabilities consistently point back to the same sequence. The mathematical phenomenon Markov described in 1913 manifests visibly in the output of a 2024 frontier AI model.

This framing is not merely academic. It provides a precise, testable mathematical vocabulary for understanding what language models can and cannot do, where their outputs come from, and what their failure modes look like. The apparent fluency and apparent reasoning of a large language model emerge from a stochastic process operating over an enormous but ultimately memoryless state space.

The Dark Side: When the Chain Eats Itself

The Markov property creates one catastrophic vulnerability when applied to recursive systems: if the output of a Markov chain becomes the input for training the next version of itself, information degrades with mathematical inevitability.

This is model collapse, and it is now one of the most serious structural problems facing the AI industry.

As generative models proliferate across the internet, they are increasingly consuming their own outputs. Human-generated text becomes harder to source. Training datasets fill with synthetic content. A model trained on synthetic data from a prior model produces outputs that are subtly degraded. Those outputs become training data for the next model. The degradation compounds.

Markov chains operating as approximation systems have an inherent bias toward high-probability outcomes. They capture the center of a distribution reliably. They systematically underrepresent the rare, unusual, and surprising events at the tails. When a model trained on human data generates synthetic outputs, it produces more average text and less exceptional text. When that synthetic text becomes training data, the next model inherits an even more compressed view of the distribution.

The Two Stages of Collapse

Early collapse: Performance metrics look stable or even improve. The model becomes hyper-optimized for common cases. Rare but important knowledge quietly disappears from the distribution.

Late collapse: The model begins confusing concepts, losing variance, producing confidently wrong outputs. The damage is irreversible. You cannot recover lost tail information from a model that has already forgotten it.

Claude Shannon's Data Processing Inequality makes this inevitable in any Markov chain passing through lossy channels. If you model each round of recursive training as a state transition through a noisy channel, the mutual information between the current model and the original human data distribution decays exponentially with each iteration. Mathematical analyses suggest this decay runs at a rate of 0.2 to 0.4 per iteration. Without at least 70% fresh human-generated data in each training round, the chain collapses.

The profound irony is that the most sophisticated AI systems ever built are fundamentally dependent on human creativity to remain functional. The chain cannot sustain itself. It requires continuous injection of unmapped, surprising, human-generated inputs to maintain its variance and its connection to reality.

The entities who produce that content are human artists, writers, programmers, and thinkers. If the Markovian chain of AI generation eats all of their work and produces only synthetic approximations going forward, the mathematical consequence is not superintelligence. It is a slow, dignified regression toward the mean.

Predicting Floods With a System That Cannot Remember Last Month

Markov chains are widely used in hydrology and climatology to model rainfall patterns, drought cycles, and flood risk. The application is straightforward: define discrete rainfall states (dry, light rain, heavy rain, extreme flood), build a transition probability matrix from decades of historical data, and use it to forecast what tomorrow's state is likely to be given today's state.

Studies of the Periyar River basin in Kerala, a region devastated by catastrophic flooding in August 2018, have used exactly this framework. By categorizing daily rainfall into states defined by Indian Meteorological Department intensity thresholds, researchers construct matrices that capture the conditional probability of each state following each other state. The steady-state distribution of these chains yields long-term equilibrium probabilities: across a sufficiently long horizon, roughly 20% flood years, 60% normal rainfall years, 20% drought years.

This is operationally valuable for disaster management authorities allocating resources and planning infrastructure.

But the Markov property's memorylessness, the feature that makes the model tractable, is also its critical failure mode in this exact context.

Whether a river floods tomorrow depends not just on how much it rained today. It depends on how saturated the soil is after three weeks of prior rain, how full the reservoirs are, what the snowmelt situation is in upstream catchments, and whether a recent cyclone has altered the atmospheric pressure gradient. A Markov chain, by construction, cannot see any of this. It sees only today's state.

Method	Memory	Strengths	Weaknesses
Discrete-Time Markov Chain	None (current state only)	Low cost, transparent, excellent baseline probabilities	Cannot account for soil saturation, cumulative history, or long-term accumulation
Recurrent Neural Network	Short-term	Better than standard ML for time-series	Vanishing gradient problem; fails on long-term dependencies
Long Short-Term Memory (LSTM)	Deep, long-term	Extremely accurate in complex non-linear environments	Computationally expensive, opaque, requires massive training data

Modern flood forecasting increasingly uses hybrid architectures: the deterministic outputs of traditional hydrological simulation software combined with LSTM networks that explicitly retain historical context through gated memory cells. For rainfall-induced landslide prediction in Idukki district, LSTM models have achieved 97.1% accuracy, dramatically outperforming static threshold approaches.

When to Use a Markov Chain for Weather

A Markov chain is the right tool when you need a computationally cheap, mathematically transparent baseline probability: what is the expected frequency of flood years over the next decade? It is the wrong tool when you need to know whether a specific river will flood next Tuesday, because that question is controlled by a month of history the Markov chain cannot see.

The lesson is not that Markov chains fail at hydrology. It is that memorylessness is a feature when you need aggregate statistics and a bug when you need situational awareness.

Seven Shuffles

The most elegant application of Markov chain theory might also be the most unexpected: proving exactly how many times you need to shuffle a deck of cards.

A standard deck of 52 cards can be arranged in $52!$ permutations, approximately $8 \times 10^{67}$ possibilities. The goal of shuffling is to traverse this state space and reach the stationary distribution, which is a uniformly random permutation where every arrangement is equally likely. This is a Markov chain problem: each shuffle is a stochastic transition between states (deck arrangements), and you want to know how many transitions you need before the chain reaches stationarity.

Mathematician Persi Diaconis, a former professional magician who became one of the world's leading probability theorists, solved this in 1992 alongside Dave Bayer. They modeled the riffle shuffle using the Gilbert-Shannon-Reeds model: cut the deck into two packets according to a binomial distribution, then drop cards from the bottom of either packet with probability proportional to the remaining stack size.

To measure distance from true randomness, they used total variation distance: the maximum difference in probability between the current deck distribution and the perfectly uniform one. Their analysis revealed what they called the "cutoff phenomenon."

Shuffles	Total Variation Distance	Randomness Level
1	~1.00	Completely ordered
3	~0.93	Still highly predictable
5	~0.92	Almost no improvement
6	~0.44	Sharp transition begins
7	~0.17	Effectively random
8+	Diminishing returns	Marginally more random

Five shuffles do almost nothing. Six shuffles cross a threshold. Seven shuffles produce genuine randomness. The mathematical maxim that emerged from this analysis is now standard across both casino gaming and cryptographic key generation: seven riffle shuffles are required to randomize a 52-card deck.

Seven shuffles are necessary and suffice. With six, the deck is far from random. With eight, it is barely more random than after seven.

— Persi Diaconis

A competing analysis by Lloyd Trefethen and Lloyd Trefethen in 2000 challenged this by reframing the problem through information theory. They treated the ordered deck as containing 225.58 bits of information, and each shuffle as an entropy injector. Under this framework, six shuffles reduce the information content to less than 1% of its original value, making six shuffles sufficient for practical human play.

The disagreement is illuminating. Both analyses are mathematically correct. They use different definitions of randomness, and those different definitions yield different answers. Randomness is not an objective physical state. It is a measurement relative to a criterion of what counts as random enough.

The messier alternative, the "smooshing" or washing method where cards are spread face-down on a table and swirled by hand, requires over 2,000 individual iterations, roughly one full minute of vigorous mixing, to reach the same total variation distance that seven structured riffle shuffles achieve in seconds.

Structured randomness, it turns out, is far more efficient at reaching true randomness than unstructured chaos.

The Mathematics of Forgetting

What unites all of these applications is the same counterintuitive insight that Andrey Markov proved against Nekrasov's theological argument in 1913: you do not need to remember where you came from to predict where you are going.

A system that focuses entirely on its present state and ignores its history can model nuclear physics, rank every page on the internet, generate human language, predict monsoon patterns, and solve card shuffling. The memorylessness is not a compromise with reality. For many systems, it is the precise mathematical structure of reality.

The limitations are equally important. When history genuinely matters, when soil saturation from three weeks ago determines whether a river floods tomorrow, when the entire context of a conversation shapes what a reasonable next sentence would be, the Markov property becomes a constraint that requires augmentation. LSTMs, attention mechanisms, longer context windows: these are all engineering attempts to extend how far back a Markov-like system can effectively see.

But the core insight holds across a century of applications. A sequence of events where only the present matters can still produce stable, predictable, statistically sound patterns. Andrey Markov proved this by counting vowels in Pushkin by hand, specifically to win an argument about God.

He won the argument. And the mathematics he developed in the process became the structural foundation of the modern world.

The Practical Takeaway

If you are modeling a system where only the present state determines the next state, and where you have enough observations to estimate transition probabilities reliably, a Markov chain is almost certainly the right starting point. It is computationally cheap, mathematically transparent, and has a century of theory behind it. If the system's history matters, reach for something that retains memory. But start with the Markov chain. Most of the time, it is already more powerful than it looks.