ExplainerJan 22, 2026· 7 min read

How AI Detection Works (And Why It's Not Perfect)

The science behind perplexity, burstiness, and pattern recognition — and why every AI detector has a meaningful false positive rate.

AI detectors have become a fixture in classrooms, newsrooms, and hiring processes. But most people using them — and most people being evaluated by them — don't understand how they actually work. That gap matters, because the technology has real limitations that affect how reliable these tools are in practice.

The Core Concept: Perplexity

Every AI detector is built on a concept called perplexity. Perplexity measures how surprised a language model is by a piece of text — specifically, how well it could have predicted each word given the words before it.

When ChatGPT or Claude generates text, it always picks from the most statistically likely next words. The result is text with very low perplexity: it's almost exactly what a language model would predict. Human writing, by contrast, includes unexpected word choices, idiosyncratic phrasing, and stylistic decisions that models wouldn't have predicted. Human text has higher perplexity.

📊 In simple terms: low perplexity = AI-like. High perplexity = human-like. Detectors score your text based on where it falls on this spectrum.

The Second Signal: Burstiness

Perplexity alone isn't enough to distinguish AI from human writing. That's where burstiness comes in.

Burstiness measures variation in sentence length and complexity. Human writers naturally produce "bursty" text — a long, complex sentence followed by a short one. Then another short one. Then something sprawling and clause-heavy that runs on for what feels like too long before finally landing.

AI models tend to produce low-burstiness text. Sentence lengths stay suspiciously consistent. Each paragraph has a similar structure. The rhythm is even — which is exactly the kind of evenness that no human writer sustains naturally over thousands of words.

How Detectors Combine These Signals

Modern detectors like GPTZero, Originality.ai, and Sapling don't rely on a single metric. They combine perplexity and burstiness, often with additional features:

Sentence-level scoring — individual sentences are scored and highlighted so reviewers can see which specific passages look AI-generated
Model fingerprinting — detectors trained on GPT-4 output learn specific stylistic patterns associated with that model
Vocabulary distribution — AI models favor certain words and phrases at rates that differ from human writing
Coherence patterns — AI text tends to be very logically organized, which is actually a statistical tell

The False Positive Problem

Here's where things get complicated. AI detectors have a meaningful false positive rate — they flag human-written text as AI-generated. This has been documented extensively in academic research.

26%

False positive rate observed for non-native English speakers in a 2023 Stanford study on GPTZero

The false positive rate is especially high for:

Non-native English speakers — who tend to write in simpler, more predictable patterns that resemble AI output
Technical writing — scientific papers, legal documents, and manuals use formal, repetitive structures that score as AI-like
Heavily edited prose — polished writing with consistent style can score as AI-generated even when it's entirely human

Why Detectors Are Getting Better — And Worse

Getting better

Detectors are trained on continuously updated datasets of AI-generated text. As models improve, so does the training data for detectors. Some newer detectors can identify text from specific model versions with moderate accuracy.

Getting worse

The fundamental problem is unsolvable: as AI models improve, their output becomes more human-like — which means it becomes harder to distinguish statistically. GPT-4 output is harder to detect than GPT-3 output. Whatever comes next will be harder still.

There's also an arms race dynamic. As humanization tools improve and users learn to modify AI output, detectors face an ever-harder classification problem.

What This Means in Practice

AI detection scores should be treated as probabilistic signals, not verdicts. A 90% AI score means the text is statistically unusual in ways that correlate with AI generation — it doesn't mean the text was definitely written by an AI.

For anyone using AI detection as part of a high-stakes decision (academic integrity, journalism, hiring), the responsible approach is to treat scores as one input among many — not as definitive proof of anything.

For writers using AI tools legitimately, the takeaway is that detection is beatable — not because the tools are bad, but because the underlying statistical problem is genuinely hard. Humanization tools that specifically target perplexity and burstiness are effective precisely because they address the actual signals these detectors rely on.

Try Tea & Lemonade AI free

5,000 free credits. No credit card required.

Get Started Free →

← More articles