22,121 claims. 5 providers (8 model variants). 86% of findings found by only one. And the pattern is the same whether you ask about Tesla, Iran, Claude, or fantasy novels.
I took 90 real research queries and ran each one through 5 AI providers (8 model variants) simultaneously: Perplexity, Gemini (+ Gemini Lite), OpenAI (+ OpenAI Mini), Grok (+ Grok Premium), and Anthropic.
Then I extracted every factual claim from every provider's report, deduplicated them using embedding-based clustering with LLM reconciliation, and asked one question: how many claims does each provider find that no other provider finds? (A "claim" is an atomic, verifiable assertion like "CrowdStrike reports an 89% increase in AI-enabled attacks" or "Google set an internal 2029 deadline for PQC migration.")
The reason this number is so high: half of the source domains each provider cites are exclusive to that provider. They're not just phrasing things differently. They're literally reading different websites. Source analysis below.
22,121 total claims. 18,896 found by only one model. 3,040 found by two or more. 185 confirmed by three or more. Mean divergence: 84.6% (95% CI: 83.5-85.8%, n=90).
How I deduplicated: Claims were embedded (text-embedding-3-large), clustered by cosine similarity (≥ 0.78), then cross-provider pairs were verified by LLM reconciliation. Manual review of 10 random jobs found only 2 suspected false positives, both confirmed as genuinely different claims on inspection. Full methodology below.
This isn't a cherry-picked result. It's the median across 90 diverse research jobs — cybersecurity threat assessments, quantum computing timelines, stock analysis, fantasy reader marketing, urea fertilizer supply chains. And when I grouped those jobs by subject area, divergence varied by only 2.1 percentage points across six categories. The pattern is universal.
Before diving into the data, I want to address the obvious objection: maybe the dedup is just bad.
Three things that would explain a high uniqueness rate without real divergence:
Extraction is too granular. If the LLM splits "GDP grew 3%" into two claims ("GDP grew" and "growth was 3%"), uniqueness inflates. I checked: median claim length is 77 characters, and only 2% of claims are under 30 characters. Claims are atomic but not trivially so.
Clustering threshold is too strict. At cosine similarity 0.78, "GPT-4 is the best coding model" and "GPT-4 is the top model for code" should merge. I tested sensitivity at lower thresholds: at 0.70 divergence drops to ~79%, at 0.65 it drops to ~74%. The pattern holds across thresholds. The pipeline also uses a secondary token-overlap pass (Jaccard ≥ 0.40) and a third LLM reconciliation pass for cross-provider merges.
Provider volume bias. If one provider generates 3x more claims, its "unique" count inflates. I checked: unique rates are remarkably consistent across all providers (65-72%). No single provider dominates.
Not all claims matter equally. Some unique findings are minor details. But the examples below show that many are specific, verifiable, and decision-relevant: funding amounts, benchmark scores, timeline targets, market share figures.
Sampling bias. These 90 queries were run by one user on one platform. Topics skew toward AI, technology, finance, and geopolitics. A different query distribution (e.g., purely academic, medical, or legal topics) might produce different divergence rates. I'd expect the pattern to hold but the specific numbers may vary.
The explanation I believe is correct: Providers genuinely search different sources, use different retrieval strategies, and apply different editorial judgment about what to include. The divergence is real.
Every single research job I analyzed showed significant divergence. Not one had less than 68% unique findings. The median was 85.3%.
59% of research jobs fell in the 80-90% divergence range. The interquartile range was tight: 80.5% to 88.9%. This is a consistent pattern, not an outlier.
The first thing any skeptic will ask: maybe divergence is just a quirk of the topics I chose. Maybe finance queries diverge because the market is noisy. Maybe cybersecurity queries diverge because sources are fragmented. Maybe if I'd stuck to "safe" academic questions, the numbers would collapse.
So I grouped all 78 qualifying jobs (removing arithmetic test queries like "what is 1+1?" that game the dedup pipeline) into six subject areas and computed mean divergence per category with 95% confidence intervals.
The total spread across all six categories is 2.1 percentage points. Culture & Media sits at 85.3%. Security & Geopolitics sits at 83.2%. Everything else is in between. Every single 95% confidence interval overlaps with every other one — there is no statistically significant difference between topics.
This is the null result I didn't expect, and it's stronger than any single-topic finding could be. Whether you're researching Tesla valuations, Iran war scenarios, LLM inference economics, fantasy reader marketing, or urea fertilizer supply chains — you lose roughly 84% of findings using a single provider, give or take a couple of points.
A few implications:
If anything, this makes the single-provider blindspot more of a structural problem, not less. It isn't that one topic is noisy. It's that every provider's source fingerprint is partial — everywhere, consistently, independent of subject.
Even the best-performing provider (Grok Premium) misses nearly three-quarters of all findings. The worst (Gemini) misses 92%.
The unique rate is remarkably consistent across providers (65-72%). No single provider dominates. They each find different things because they search different sources, use different retrieval strategies, and apply different synthesis heuristics.
Only 14% of factual claims are corroborated by more than one AI. The vast majority of what any provider tells you is information no other provider surfaces.
80% divergence. 234 claims. 8 model variants across 5 providers. View the full report →
When Anthropic announced Claude Mythos Preview with restricted access, I ran the story through all 5 providers (8 model variants). Each found different angles:
| Provider | Unique finding no other provider surfaced |
|---|---|
| Perplexity | Data accidentally leaked from Anthropic's unsecured CMS on March 26 |
| Grok | Mythos scored 100% on Cybench, 83% on CyberGym; Glasswing gave $2.5M to OpenSSF, $1.5M to Apache |
| OpenAI | Anthropic rolled out to 40+ organizations for defensive use only |
| Anthropic | Experts warned: behind Mythos is the next OpenAI model, then Gemini, then open-source Chinese models |
| Gemini | OpenAI pledged $10M in API credits to vetted defenders via Cybersecurity Grant Program |
| Grok Premium | OpenAI launched GPT-5.3-Codex in February 2026 |
Each provider told a fundamentally different version of the same story.
The irony: a story about AI providers restricting information, where 80% of the specific details were unique to whoever found them.
80.5% divergence. 195 claims. 4 providers. View the full report →
The timeline question matters: when could quantum computers break Bitcoin's encryption?
| Provider | Timeline claim | Key detail |
|---|---|---|
| OpenAI | 2035 | U.S. White House target for quantum-safe migration |
| Grok Premium | 2029 | Google's internal deadline — 6 years earlier than the government |
| Perplexity | Not yet feasible | Google hasn't demonstrated >100 qubits in commercial deployment |
| OpenAI | ~6.7M BTC at risk | ~35% of supply in quantum-vulnerable addresses |
Using one provider, you'd get one timeline. Using all four, you see the tension.
The government says 2035. Google is racing for 2029. And the hardware isn't there yet. That tension IS the finding. No single provider gave the full picture.
| Provider | Market share claim |
|---|---|
| Perplexity only | ChatGPT maintained 86.7% share of AI chatbot web traffic (Jan 2025) |
| OpenAI only | OpenAI enterprise API share fell from ~50% to 25% by end of 2025 |
| OpenAI only | Anthropic's enterprise share rose from 12% to 32% |
| Perplexity only | Google's Gemini captured 21.5% of web traffic by early 2026 |
Contradictory signals from the same market — consumer dominance vs enterprise decline.
Using one provider, you'd conclude either "OpenAI is winning" or "OpenAI is losing." Both are true simultaneously, in different markets. That nuance only emerges from multi-provider research. View the full report →
The claim divergence isn't just providers "phrasing things differently." They're literally reading different websites.
Across the 90 research jobs, providers cited 9,877 sources from 3,100 unique domains. I analyzed the overlap:
| Source overlap | Domains | % of total |
|---|---|---|
| Cited by only 1 model | 1,555 | 50% |
| Cited by 2 models | 693 | 22% |
| Cited by 3-4 models | 613 | 20% |
| Cited by 5-7 models | 207 | 7% |
| Cited by all 8 models | 32 | 1% |
Half of all source domains are exclusive to one model. Only 1% are shared by all eight.
50% of source domains are exclusive to one model. Only 32 domains (1%) — places like arxiv.org, github.com, wikipedia.org, bloomberg.com — are cited by all eight models.
Each provider has a distinct source fingerprint:
| Provider | Exclusive domains | Source character |
|---|---|---|
| Anthropic | 428 | Academic, specialized (scienceopen.com, access.redhat.com) |
| Perplexity | 390 | Niche web, data reports (academic.oup.com, sector-specific sites) |
| OpenAI | 194 | Technical, institutional (stackexchange, government portals) |
| Gemini Lite | 141 | Industry analysis (accenture.com, ai.meta.com) |
| Gemini | 119 | Marketplace, commerce (abebooks.com, niche directories) |
| Grok | 118 | Social media, regulatory (x.com, state .gov sites) |
| Grok Premium | 118 | Tech blogs, security (googleblog.com, analyticsvidhya.com) |
| OpenAI Mini | 47 | Sports, niche APIs (espn.com, specialized docs) |
Provider source fingerprints: each searches different corners of the internet.
This is the root cause. The 86% claim divergence is downstream of 50% source divergence. Providers find different things because they search different places.
Models tested: Perplexity (sonar-deep-research), Gemini (gemini-2.5-pro), Gemini Lite (gemini-2.5-flash), OpenAI (gpt-4o / o3-deep-research), OpenAI Mini (gpt-4o-mini), Grok (grok-3), Grok Premium (grok-4), Anthropic (claude-sonnet-4). Not every job used all 8; budget tier determines how many models run. Minimum 2 models per job to qualify for this analysis.
What I measured: Divergence score = (claims found by exactly 1 model) / (total claims per job).
How claims were extracted: Each provider's research report was processed by an extraction pipeline:
| Step | Method | Threshold |
|---|---|---|
| Per-provider extraction | LLM identifies atomic, verifiable claims | — |
| Embedding generation | text-embedding-3-large (1536 dims) | — |
| Primary clustering | Union-find with cosine similarity | ≥ 0.78 |
| Fallback clustering | Token overlap (Jaccard) | ≥ 0.40 at cosine ≥ 0.65 |
| LLM reconciliation | Cross-provider cluster merge confirmation | centroid cosine ≥ 0.55 |
| NLI review | Per-cluster contradiction/merge/split decisions | — |
Quality controls:
Topic analysis subset: For the per-category breakdown, I restricted to jobs with ≥50 clusters and excluded arithmetic / empty-prompt test queries (e.g. "what is 1+1?") where clustering artifacts inflate divergence toward 100%. 78 of the 90 jobs qualified. Category assignment used a keyword-score classifier with manual overrides for obvious mismatches. Sample sizes per category ranged from 4 (Science & Engineering) to 22 (Developer Tools & OSS). Mean overall divergence on the 78-job subset is 84.2% (95% CI: 82.9–85.5%) — consistent with the 84.6% mean on the full 90-job corpus.
What this doesn't measure: Accuracy of individual claims (a claim being unique doesn't mean it's correct), importance weighting (not all claims are decision-relevant), temporal bias from different training data cutoffs, or how results might differ for non-English queries or purely academic topics.
If you're making decisions based on AI research, using one provider means operating with 73-92% of the available information missing. Not because the provider is bad. Because each one searches different corners of the internet, synthesizes differently, and includes different things.
The practical takeaway: for anything that matters (investment decisions, threat assessments, competitive analysis, technical architecture), run the same question through multiple providers and look at the union, not the intersection.
Built with Parallect.ai, multi-provider deep research.
Data: 90 research jobs, 22,121 claims, 5 providers (8 model variants). All research was conducted on the Parallect.ai platform between March 25 and April 7, 2026.
Run multi-provider research across Perplexity, Gemini, OpenAI, Grok, and Anthropic in one query. See where they agree, diverge, and contradict each other.
Use invite code TRYPARALLECT for free research credits