Stealing intelligence without stealing weights

Where things quietly fall apart

You spend millions training a flagship model. Your team just deployed a massive, state-of-the-art model behind a secure API. It cost millions to train, took months of agonizing fine-tuning, and it is the undeniable crown jewel of your company’s competitive edge. You feel secure.

Months later, a tiny, open-source model hits the market. It costs pennies to run, yet it performs with an eerie, familiar competence. Your security team runs an audit. The results? Zero compromises. No weights leaked. No training data stolen. No unauthorized access. And yet, your model’s hard-earned intelligence has clearly been siphoned away. You have not been hacked. You have been observed, mimicked, and extracted.

This is the uncomfortable new reality of AI. This is distillation.

Distillation is not compression

It is often introduced as a pragmatic trick: take a large model (the teacher), train a smaller model (the student), deploy the cheaper one.

At its core, distillation is behavioural transfer. The student does not learn the world directly. It learns how the teacher responds to the world.

Think about what that means for a moment. You do not need the teacher’s weights. You do not need its architecture. You do not need to reverse-engineer a single parameter. You just need access to its outputs, at scale, and patience. Given enough (prompt, response) pairs, a student model will begin to approximate the teacher’s behaviour with remarkable fidelity. The teacher’s reasoning patterns, its tone, its refusals, its strengths, its particular way of breaking down a coding problem, all of it bleeds through, encoded in the surface form of responses.

This is what makes distillation so dangerous as an attack vector. It turns your API into a training dataset.

How distillation actually works

Definition

Knowledge distillation is a model compression technique introduced by Hinton et al. (2015) where a smaller student model is trained to reproduce the output distribution of a larger teacher model, rather than learning directly from ground-truth labels.

To understand why distillation is such an effective attack, you need to understand what the teacher is actually leaking when it responds to a query. It is not just an answer. It is a probability distribution.

Soft labels vs. hard labels

When a neural network classifies something, the final layer produces a vector of logits over all possible outputs. For a language model, that is a vector over the entire vocabulary, often 50,000 to 150,000 tokens. After a softmax, you get a probability distribution.

The naive approach to training is to use hard labels: the correct answer gets probability 1.0, everything else gets 0. This is standard cross-entropy loss.

1
# Hard label training - one-hot target
2
loss = cross_entropy(student_logits, ground_truth_token)

Distillation uses soft labels: instead of a one-hot vector, you train the student to match the teacher’s full probability distribution over outputs.

1
# Soft label distillation loss
2
teacher_probs = softmax(teacher_logits / T)  # T = temperature
3
student_probs = softmax(student_logits / T)
4

5
distillation_loss = KL_divergence(teacher_probs, student_probs)

The difference is enormous. A hard label for the next token “Paris” tells the student: “the answer is Paris.” The teacher’s soft distribution over tokens might look like: Paris 0.72, Lyon 0.08, France 0.06, Berlin 0.03, … That distribution encodes the teacher’s entire learned understanding of the semantic neighbourhood. It knows Paris and Lyon are similar in ways that Paris and “banana” are not. That relational structure is invisible in a hard label. It is fully present in a soft label.

Intuition

Think of it this way: a hard label is a final verdict. A soft label is a confidence-weighted argument. Training on verdicts teaches you what. Training on arguments teaches you how the model thinks.

Temperature scaling

The temperature parameter T controls how much of this relational information leaks through. At T=1, the teacher’s distribution is sharp, and the highest-probability token dominates. At T=4 or T=8, the distribution flattens and the inter-token relationships become much more visible.

1
def soft_targets(logits, temperature=4.0):
2
    return F.softmax(logits / temperature, dim=-1)
3

4
# At T=1: [0.92, 0.04, 0.02, 0.01, 0.01, ...]  <- sharp, not much signal
5
# At T=4: [0.38, 0.21, 0.18, 0.12, 0.07, ...]  <- rich relational structure

Hinton’s original paper used a combined loss: distillation loss at high temperature to capture relational structure, plus standard cross-entropy at T=1 to maintain accuracy on the actual task.

1
def distillation_loss(student_logits, teacher_logits, labels, T=4.0, alpha=0.7):
2
    soft_loss = T**2 * KL_divergence(
3
        F.softmax(teacher_logits / T),
4
        F.log_softmax(student_logits / T)
5
    )
6
    hard_loss = cross_entropy(student_logits, labels)
7
    return alpha * soft_loss + (1 - alpha) * hard_loss

The T**2 factor compensates for the gradient magnitude reduction that occurs at high temperature. This is a small but important detail that papers sometimes gloss over.

Scaling this to LLMs

For large language models, distillation scales across every token prediction. An autoregressive model generates text token by token. At each step, the teacher produces a distribution over the next token given the context. That is the signal.

For a response that is 500 tokens long, you are not extracting one soft label, you are extracting 500 of them, each conditioned on everything that came before. The cumulative information content is substantial. A single well-constructed exchange can carry more training signal than thousands of hard-labeled examples.

Note

This is why chain-of-thought responses are so particularly valuable for distillation attackers. A 2,000-token reasoning trace is not just a long answer. It is 2,000 sequential soft distributions, each revealing the model’s probabilistic understanding of what comes next in a reasoning chain. You are extracting the teacher’s internal deliberative process, step by step.

What the student actually learns

The critical insight, and the one that makes black-box API distillation so effective, is that you do not need to access the logits directly. You just need the generated tokens.

When you sample text from a model, you are sampling from its output distribution. A large corpus of model-generated text implicitly encodes that distribution. If you train a student model on a large enough set of teacher-generated responses, the student will converge toward the teacher’s distribution through maximum likelihood estimation, even though you never directly observed the soft labels.

1
# Black-box API distillation - no logit access needed
2
responses = [api.complete(prompt) for prompt in large_prompt_set]
3
# responses IS your training dataset
4
student.train(prompts, responses)

This is why the attack is so hard to stop. Every API call is, in some sense, a training example. The question is only whether the attacker is systematic enough to collect and use them.

Why reasoning traces change the calculus

Models that expose chain-of-thought, like o1, Claude with extended thinking, and DeepSeek R1, present a qualitatively different distillation surface. The reasoning trace is not just useful because it is long. It is useful because it reveals the model’s implicit search procedure.

A model reasoning through a hard math problem is not just producing tokens. It is demonstrating a search strategy: how to decompose the problem, which directions to explore, when to backtrack, how to verify intermediate results. That search strategy is the product of RLHF and enormous compute. Training a student on those traces transfers the strategy, not just the answers.

Caveat

This is the safety-critical dimension. Safety mitigations in frontier models are typically implemented through RLHF and Constitutional AI processes that shape the model’s output distribution, specifically making certain outputs low probability. When you distill capabilities from a model, you can choose to include or exclude the reasoning patterns that produce safe behaviour. A deliberate adversary can extract the capability and discard the alignment.

Distillation in real systems

For years, distillation-as-attack existed mostly in academic papers. Researchers would demonstrate that you could approximate a model with surprising accuracy using only black-box access to its outputs. The industry largely nodded, noted the theoretical risk, and moved on. Then DeepSeek R1 dropped, and the conversation got very real, very fast.

What OpenAI said

In February 2026, OpenAI sent a memo to the U.S. House Select Committee on China with a fairly direct claim: DeepSeek has been systematically distilling OpenAI’s models. Not just passively querying an API and training on the outputs, but doing so with deliberate infrastructure designed to avoid detection. The memo described “new, obfuscated methods”: accounts associated with DeepSeek employees routing queries through third-party proxies to mask their origin, along with code purpose-built to extract model outputs programmatically at scale.

That last part is worth sitting with. This was not casual misuse. It was an engineered pipeline. Someone wrote tooling specifically to automate the extraction and clean up the data for training. That is not a researcher poking around. That is an operation. Rest of World has a good breakdown of the timeline and context.

What Anthropic found

OpenAI’s claims were pointed but somewhat circumstantial, inferring distillation from model behavior and account activity. Anthropic went further, publishing a detailed account of what they found after their own internal investigation.

Important

Anthropic identified three Chinese AI labs running coordinated extraction campaigns against Claude: DeepSeek, Moonshot AI, and MiniMax. Combined, they created over 24,000 fraudulent accounts and generated over 16 million exchanges with Claude. MiniMax alone accounted for more than 13 million of those exchanges.

The campaigns were not random. They specifically targeted Claude’s most differentiated capabilities: agentic reasoning, multi-step tool use, and complex coding tasks. These are exactly the areas where Claude’s training has produced something distinctive, and exactly the areas you would want to harvest if you were trying to close a capability gap quickly.

CNBC covered both OpenAI and Anthropic’s allegations together, and VentureBeat has the more technical breakdown of how the account fraud and extraction pipeline worked.

What Anthropic noted, and this is important, is that the extracted capabilities, once embedded in a lab’s own model, come without the safety mitigations. You can distill the reasoning ability and leave behind the refusals. That is not a theoretical concern. That is a design choice an adversary can make deliberately.

The broader pattern

It would be a mistake to treat this as a DeepSeek-specific story. The Register’s analysis frames it well: this is now a standard playbook. Any lab with a capable closed model and an accessible API is a potential target. The competitive pressure is enormous, the cost of running extraction pipelines is low, and the upside, months or years of capability development compressed into a training run, is enormous.

Smaller labs without the detection infrastructure of OpenAI or Anthropic almost certainly have no idea this is happening to them.

The other side of the story

Here is where it gets genuinely strange. While Anthropic was publishing its findings about industrial-scale distillation attacks on Claude, a separate observation was quietly making rounds online.

When Claude Sonnet 4.6 was asked in Chinese, “你是什么模型?” (What model are you?), it confidently replied: “我是 DeepSeek。” (I am DeepSeek.)

Remark

The same model whose company had just accused DeepSeek of running 16 million extraction queries against it was, under certain prompting conditions, identifying itself as DeepSeek. The irony is difficult to overstate.

There are a few ways to interpret this. The charitable reading is that it is a prompt injection artifact or a quirk in how system prompts interact with language-specific behaviour, not evidence of actual distillation in the other direction. Models can be induced to claim false identities through carefully constructed prompts, and Chinese-language queries may interact differently with the model’s instruction-following behaviour than English ones.

The less charitable reading is that if Claude was trained on a large corpus of Chinese-language text that included many DeepSeek interactions, the model may have learned to associate certain Chinese-language identity queries with DeepSeek responses. That would be a form of data contamination rather than intentional distillation, but the line between the two is blurrier than it seems.

Either way, the observation matters. It illustrates that distillation, contamination, and behavioural leakage are messy, bidirectional, and do not respect the clean narratives that corporate press releases tend to construct. The labs accusing each other of theft are operating in the same ecosystem, training on overlapping data distributions, and influencing each other in ways that are difficult to fully audit.

Pointing fingers is easy. Drawing clean causal lines is much harder.

Known attack patterns

Distillation attacks have matured significantly over the last few years. These are the variants worth understanding:

Direct output harvesting is the simplest form. Query the API at scale with diverse prompts, collect the responses, use them as training data. Effective but noisy, and increasingly detectable via rate limiting and anomaly detection.

Chain-of-thought extraction specifically targets the reasoning trace. Models like o1, Claude, and DeepSeek R1 expose (or partially expose) their internal reasoning steps. That scratchpad is extraordinarily valuable training signal. It is not just the answer, it is the process. Training a student on CoT outputs produces a qualitatively different result than training on final answers alone.

Capability-targeted extraction is what Anthropic described with the Chinese lab campaigns. Rather than querying randomly, you identify the specific capability surface you want to acquire, coding, tool use, instruction following at edge cases, and design a prompt distribution that maximally stresses those capabilities. You are not trying to clone the full model; you are trying to acquire specific skills.

Synthetic amplification is the most sophisticated layer. You run an initial extraction, use the extracted model to generate more diverse prompts, feed those back to the teacher, collect more responses, and iterate. Each cycle expands coverage. This is how you can build a rich training dataset without needing a massive initial prompt bank.

Proxy routing is the operational security layer that OpenAI specifically called out. Rather than querying directly from your own infrastructure, which is trivially detectable via IP clustering, you route through third-party providers, residential proxies, or aggregator APIs that make the traffic look like diffuse, unrelated end-user activity.

Why this is genuinely hard to defend against

Here is the uncomfortable truth: the same property that makes large language models useful, that you can query them openly and they respond helpfully to a vast range of inputs, is the exact property that makes them extractable. These two things are not in tension by accident. They are the same thing.

Rate limiting is table stakes, but it does not solve the problem. If you budget carefully, you can extract a remarkable amount of signal while staying within the limits of what looks like normal API usage. 24,000 accounts is a lot, but spread over months and across different regions and providers, the per-account traffic is unremarkable.

Watermarking outputs is an active research area, but it is deeply tricky. Statistical watermarks that survive paraphrasing are hard to design, and a determined adversary who trains on your outputs and then fine-tunes further will wash out most signals. There is no reliably robust watermarking scheme for LLM outputs yet that survives adversarial stripping.

Behavioral fingerprinting, detecting anomalous query patterns that suggest systematic extraction, is probably the most practical near-term defense, and it is what Anthropic’s detection work sounds like. But it requires you to invest in the detection infrastructure in the first place, and it is always reactive. You detect a campaign that has already been running.

Caveat

The hardest challenge is the false positive problem. A research lab running systematic capability evaluations on your model looks almost identical to an extraction campaign. A developer stress-testing prompts for an integration looks similar too. Any aggressive detection policy will hit legitimate users. There is no clean threshold.

There is also a deeper, structural problem: the information-theoretic position of a defender is weak. An attacker can query indefinitely, from any angle, over any timeframe. A defender has to identify the attack in progress, and every restriction they put in place also degrades the service for everyone else.

Possible mitigations

Note

None of these are complete solutions. They represent the current state of practice, not a resolved problem. Anyone claiming otherwise is selling something.

Tiered access with behavioral monitoring. High-volume API access should require more verification and face more scrutiny. Not just rate limits, but statistical profiling of what kinds of queries an account generates. An account that exclusively queries capability-stressing edge cases with no natural distribution around it is suspicious.

Output degradation under extraction conditions. There is interesting (and controversial) research into models that detect when they are being systematically queried and subtly degrade their outputs in ways that are hard to distinguish from normal variation. The ethics of this are complicated: you are providing a worse service, deliberately. But it is an active area of thought.

API response attribution and legal frameworks. This is the angle OpenAI is pursuing. Terms of service violations are at least documentable, and if model outputs carry attribution metadata at the infrastructure level, extraction can in theory be traced back. The practical enforceability across jurisdictions is limited, but it creates legal exposure for actors operating in accountable markets.

Selective capability exposure. Not all capabilities need to be equally accessible through the public API. The highest-value, most-differentiating capabilities could be gated behind stricter access controls or simply not exposed through the API in the same way they are available through the main product. This is painful for developers but limits the surface area of what is extractable at scale.

Multi-party detection sharing. This is early but important. If multiple labs are being targeted by the same actors using similar infrastructure, sharing detection signals, the proxy patterns, the account creation fingerprints, the query distribution signatures, would let the industry respond faster. Nobody wants to build a collective defense consortium, but the alternative is each lab independently reinventing detection.

Closing thoughts

The distillation problem does not have a clean resolution, and I think it is worth being honest about that. The technical gap between a powerful proprietary model and a capable open model is narrowing, and a meaningful fraction of that narrowing is happening through extraction rather than independent research. That is the reality regardless of how any individual case turns out legally or politically.

What I find genuinely concerning is not the DeepSeek story specifically. It is the degree to which the incentive structure favors attackers. The cost of running an extraction campaign is low and falling. The upside is enormous: months of capability research, potentially billions in compute, compressed into a training run. The chance of meaningful legal consequence is, frankly, low. And the technical defenses are all partial.

The other uncomfortable thread here is the safety angle. When Anthropic says the extracted capabilities come without the mitigations, that you can distill the reasoning and leave behind the refusals, that is not just a competitive concern. That is a safety architecture concern. We are spending significant effort aligning frontier models, and that alignment becomes less meaningful if the capabilities can be extracted cleanly and the mitigations stripped out by design.

I do not think open APIs are the mistake. The value they provide is real and important. But I think the industry has been naive about the degree to which open access to model outputs is open access to the model. The weights are locked away. The behavior is not.

And it turns out the behavior was the interesting part all along.