AI vs human: who writes better Wikipedia?
The question sounds like clickbait. It's not. We've spent months building a game where players try to tell AI-written Wikipedia summaries apart from real ones, and the process has taught us a lot about where each side actually excels. The answer is more complicated than "humans win" or "AI is catching up." They're playing different games entirely.
We decided to break this down properly. Six criteria, honest scores, real examples. No cheerleading for either side.
The criteria
We picked six dimensions that matter for encyclopedia writing: accuracy, readability, neutrality, sourcing, coverage breadth, and consistency. These aren't arbitrary. They map directly onto Wikipedia's own content policies and the things editors fight about on talk pages.
For each criterion, we evaluated based on our own experience generating hundreds of thousands of AI summaries for Bluffpedia, conversations with Wikipedia editors, and published research on AI-generated text (particularly the 2023 study from Stanford and the 2024 work by Shumailov et al. on model collapse).
Accuracy
This is where things get uncomfortable for AI boosters. Large language models do not have access to ground truth. They predict plausible next tokens based on training data. Sometimes that produces accurate statements. Sometimes it produces confident nonsense.
We've seen our AI generation pipeline produce summaries claiming that a real city in Germany was founded in 1847 when it was actually founded in 1247. The number looked right. It fit the context. It was wrong by exactly 600 years. A human Wikipedia editor would have checked the source. The AI had no source to check.
That said, for well-documented topics where the training data contains abundant correct information, AI can be remarkably accurate about basic facts. The population of Tokyo, the year the Eiffel Tower was built, the chemical formula for water. These are facts repeated so frequently in training data that the model gets them right essentially every time.
The problem is the long tail. For obscure historical events, small towns, niche scientific topics, the AI's accuracy drops off a cliff. And it drops off silently. There's no confidence interval attached to a generated sentence. The AI sounds just as sure about the founding date of a village in rural Poland as it does about the founding date of New York City.
Score: AI 5/10, Humans 9/10. Humans make mistakes too, but they can cite their sources and correct errors when challenged.
Readability
Here's where AI has a genuine edge. Human-written Wikipedia articles are famously uneven. One paragraph might be elegant and well-structured. The next might be a dense thicket of jargon contributed by a domain expert who never learned to write for a general audience.
This unevenness happens because Wikipedia articles are written by dozens or hundreds of people over years. Nobody is responsible for the overall flow. Sections get added, rewritten, expanded, condensed. The result is often functional but not exactly pleasant to read.
AI-generated text, by contrast, has a uniform voice. Sentence lengths vary but not wildly. Transitions between ideas are smooth. Technical terms get brief explanations. It reads the way a single competent writer would write.
We've noticed this effect clearly in Bluffpedia. When we show players four summaries (one real, three AI-generated), the real one is sometimes the worst-written of the four. Not because the facts are wrong, but because the prose is clunkier. Multiple players have told us they've started using "writes too well" as a signal that something is fake.
A telling pattern from our data
In rounds where the real Wikipedia summary was about a technical or scientific topic, players picked the AI-generated summary as "real" 41% of the time. The AI's smooth, accessible writing style actually made it seem more authoritative than the genuine article.
Score: AI 8/10, Humans 6/10. AI wins on prose quality, but that smoothness is itself a tell.
Neutrality
Wikipedia's Neutral Point of View (NPOV) policy is one of its hardest rules to follow. Humans have opinions. They have biases. They care about their subjects. Wikipedia editors constantly argue about whether a particular phrasing is neutral enough, and these arguments fill thousands of talk page archives.
AI models trained on internet text have absorbed biases too, but they manifest differently. AI-generated text tends toward a bland, positive-leaning neutrality that's actually pretty close to what Wikipedia aims for. It rarely takes strong positions. It hedges. It uses passive voice.
But there's a catch. AI neutrality is shallow. When a topic is genuinely controversial (the Israeli-Palestinian conflict, climate change policy, contested historical events), AI either produces something so bland it's useless, or it picks up the dominant framing from its training data without recognizing that the framing itself is contested.
Human editors handle controversy better because they argue about it. The Wikipedia article on the Armenian genocide, for example, went through years of contentious editing before reaching a version that most parties found tolerable. That process is messy and sometimes ugly, but it produces something that AI can't: a neutrality that has been tested by people who actually disagree.
Score: AI 6/10, Humans 7/10. AI is neutral by default; humans achieve neutrality through effort. The hard-won version is more robust.
Sourcing
This one is straightforward. AI cannot source its claims. It generates text that looks like it could be sourced, but the citations would need to be added by a human who actually verifies them.
Wikipedia's verifiability policy (WP:V) says that all material challenged or likely to be challenged must be attributed to a reliable, published source. This is the backbone of the entire project. Without it, Wikipedia would just be a collection of plausible-sounding claims. Which is, coincidentally, exactly what AI-generated text is.
Some researchers have experimented with retrieval-augmented generation (RAG) systems that pull real sources and attach them to AI-generated claims. The results are better than pure generation but still not reliable. The AI sometimes attaches a real source that doesn't actually support the claim it's been paired with. Or it cites a source that says the opposite of what the generated text claims. Hallucinated sourcing might actually be worse than no sourcing, because it creates a false sense of verification.
Score: AI 2/10, Humans 9/10. This is AI's biggest weakness for encyclopedia writing.
Coverage breadth
Now we get to something AI is legitimately good at. Wikipedia has over 44 million articles across 300+ languages, but coverage is wildly uneven. English Wikipedia has detailed articles on every Marvel movie but stub-length entries for many important historical figures from Africa and Asia. Articles about women, non-Western topics, and subjects outside of popular culture are systematically shorter and less developed.
AI can generate reasonable first-draft content for topics that Wikipedia currently covers poorly. A stub article about a small city in Indonesia that has three sentences could be expanded into a useful overview by an AI that has absorbed information about that city from travel guides, government databases, and news articles in its training data.
We've seen this play out in Bluffpedia. When the game selects a Wikipedia article that's a stub, the AI-generated alternatives are sometimes more informative than the real thing. Players occasionally choose the real summary specifically because it seems suspiciously short and thin, reasoning that Wikipedia wouldn't publish something so minimal. They're wrong. Wikipedia publishes stubs all the time.
Score: AI 7/10, Humans 5/10. Humans create deeper content but leave huge gaps. AI can fill gaps with decent first drafts.
Consistency
This is related to readability but distinct. Consistency means following the same formatting conventions, using the same terminology, applying the same structural patterns across articles.
Wikipedia has a massive Manual of Style (MoS), and editors argue about it constantly. Should dates use "March 15" or "15 March"? Should article titles use sentence case? Should disambiguation pages use bullet points or prose? These fights are real and ongoing, and the result is that Wikipedia's actual consistency is... okay. Not great. Different articles follow different conventions depending on which editors worked on them and when.
AI trained on Wikipedia has absorbed a kind of averaged version of the Manual of Style. It produces text that's more internally consistent than a random sample of human-written articles. Every AI-generated summary we produce follows roughly the same structural conventions, because the model has learned a single generalized Wikipedia style rather than the dozens of micro-styles that exist across the real encyclopedia.
Score: AI 8/10, Humans 5/10. AI wins here fairly convincingly.
The full comparison
| Criterion | AI score | Human score | Notes |
|---|---|---|---|
| Accuracy | 5/10 | 9/10 | AI hallucinates; humans verify |
| Readability | 8/10 | 6/10 | AI prose is smoother but suspiciously even |
| Neutrality | 6/10 | 7/10 | Human neutrality is battle-tested |
| Sourcing | 2/10 | 9/10 | AI's biggest weakness, period |
| Coverage breadth | 7/10 | 5/10 | AI fills gaps humans haven't reached |
| Consistency | 8/10 | 5/10 | AI follows one averaged style |
Total scores: AI 36/60, Humans 41/60. Humans win, but not by the landslide you might expect.
Visualizing the gap
What this actually means
The scorecard reveals something that should shape how we think about AI in knowledge production. AI is good at the mechanical parts of writing: producing clean prose, maintaining consistency, generating content at scale. Humans are good at the epistemic parts: knowing what's true, finding sources, handling nuance and controversy.
These aren't the same skill. And they don't overlap much. An AI that writes beautifully but can't verify its claims is fundamentally different from a human who writes awkwardly but knows where the facts come from.
The most interesting scenario is collaboration. An AI that generates a well-structured first draft, followed by human editors who verify claims, add sources, and handle the parts that require actual knowledge of the world. Wikipedia's community has been cautiously exploring this approach, and the early results suggest it works better than either side working alone.
Where AI genuinely wins
We should be honest about the areas where AI does better, because pretending it doesn't helps nobody.
Stub articles. Wikipedia has millions of articles that are just a sentence or two. An AI can produce a reasonable 200-word overview faster than a human volunteer can. For the tens of thousands of small towns, minor historical figures, and obscure species that currently have minimal Wikipedia coverage, AI-generated first drafts would be a real improvement.
Grammar and style. Non-native English speakers contribute enormously to English Wikipedia, and their contributions are valuable. But sometimes the prose needs cleanup. AI is excellent at producing grammatically correct, stylistically consistent text.
Boilerplate sections. Many Wikipedia articles have standardized sections (geography, demographics, climate, transportation for city articles; taxonomy, distribution, behavior for species articles). AI can generate these from structured data more efficiently than humans can write them from scratch.
Where humans clearly win
Detecting nonsense. An AI might write that a composer "studied under Johann Sebastian Bach at the Paris Conservatory in 1920." A human immediately recognizes that Bach died in 1750 and the Paris Conservatory didn't exist until 1795. This kind of cross-referencing requires actual world knowledge, not pattern matching.
Handling controversy. The Wikipedia article on homeopathy has been the subject of edit wars for over a decade. The current version reflects a carefully negotiated consensus that describes the practice while clearly stating that scientific evidence does not support it. No AI could have navigated that process.
Editorial judgment. Which facts are important enough to include? What level of detail is appropriate? When does an article need splitting into sub-articles? These are judgment calls that require understanding the reader, the subject, and Wikipedia's role in the broader information ecosystem.
Source evaluation. Not all sources are equal. A newspaper of record is more reliable than a blog post. A peer-reviewed study carries more weight than an opinion piece. Humans can evaluate source quality. AI can't, because it doesn't interact with sources at all.
What Bluffpedia teaches about this
Every round of Bluffpedia is a micro-experiment in the AI-vs-human question. When you play, you're evaluating exactly the criteria we discussed: does this text sound accurate? Is it well-written? Does it feel like something a human would produce, or something a machine assembled?
The players who get good at the game develop an intuition for the specific ways AI text differs from human text. They notice the smooth prose that's a little too smooth. They catch the plausible claim that's a little too convenient. They recognize the absence of sourcing cues and the presence of hedging language where a real article would commit to a specific fact.
That intuition is worth developing. We live in a world where AI-generated text is everywhere, and it's not always labeled. The ability to read critically, to ask "does this feel like it came from someone who actually knows this subject?" is a skill that transfers far beyond a trivia game.
The answer to "who writes better Wikipedia?" depends entirely on what you mean by "better." AI writes cleaner. Humans write truer. For now, we need both.