I Ran 90 Research Queries Through 8 Deep Research Models. 86% of Findings Were Unique to One.

I took 90 real research queries and ran each one through 5 AI providers (8 model variants) simultaneously: Perplexity, Gemini (+ Gemini Lite), OpenAI (+ OpenAI Mini), Grok (+ Grok Premium), and Anthropic.

Then I extracted every factual claim from every provider's report, deduplicated them using embedding-based clustering with LLM reconciliation, and asked one question: how many claims does each provider find that no other provider finds? (A "claim" is an atomic, verifiable assertion like "CrowdStrike reports an 89% increase in AI-enabled attacks" or "Google set an internal 2029 deadline for PQC migration.")

The reason this number is so high: half of the source domains each provider cites are exclusive to that provider. They're not just phrasing things differently. They're literally reading different websites. Source analysis below.

22,121 total claims. 18,896 found by only one model. 3,040 found by two or more. 185 confirmed by three or more. Mean divergence: 84.6% (95% CI: 83.5-85.8%, n=90).

How I deduplicated: Claims were embedded (text-embedding-3-large), clustered by cosine similarity (≥ 0.78), then cross-provider pairs were verified by LLM reconciliation. Manual review of 10 random jobs found only 2 suspected false positives, both confirmed as genuinely different claims on inspection. Full methodology below.

This isn't a cherry-picked result. It's the median across 90 diverse research jobs — cybersecurity threat assessments, quantum computing timelines, stock analysis, fantasy reader marketing, urea fertilizer supply chains. And when I grouped those jobs by subject area, divergence varied by only 2.1 percentage points across six categories. The pattern is universal.

Could this be wrong?

Before diving into the data, I want to address the obvious objection: maybe the dedup is just bad.

Three things that would explain a high uniqueness rate without real divergence:

Extraction is too granular. If the LLM splits "GDP grew 3%" into two claims ("GDP grew" and "growth was 3%"), uniqueness inflates. I checked: median claim length is 77 characters, and only 2% of claims are under 30 characters. Claims are atomic but not trivially so.
Clustering threshold is too strict. At cosine similarity 0.78, "GPT-4 is the best coding model" and "GPT-4 is the top model for code" should merge. I tested sensitivity at lower thresholds: at 0.70 divergence drops to ~79%, at 0.65 it drops to ~74%. The pattern holds across thresholds. The pipeline also uses a secondary token-overlap pass (Jaccard ≥ 0.40) and a third LLM reconciliation pass for cross-provider merges.
Provider volume bias. If one provider generates 3x more claims, its "unique" count inflates. I checked: unique rates are remarkably consistent across all providers (65-72%). No single provider dominates.
Not all claims matter equally. Some unique findings are minor details. But the examples below show that many are specific, verifiable, and decision-relevant: funding amounts, benchmark scores, timeline targets, market share figures.
Sampling bias. These 90 queries were run by one user on one platform. Topics skew toward AI, technology, finance, and geopolitics. A different query distribution (e.g., purely academic, medical, or legal topics) might produce different divergence rates. I'd expect the pattern to hold but the specific numbers may vary.

The explanation I believe is correct: Providers genuinely search different sources, use different retrieval strategies, and apply different editorial judgment about what to include. The divergence is real.

The Distribution

Every single research job I analyzed showed significant divergence. Not one had less than 68% unique findings. The median was 85.3%.

59% of research jobs fell in the 80-90% divergence range. The interquartile range was tight: 80.5% to 88.9%. This is a consistent pattern, not an outlier.

Does This Vary By Topic?

The first thing any skeptic will ask: maybe divergence is just a quirk of the topics I chose. Maybe finance queries diverge because the market is noisy. Maybe cybersecurity queries diverge because sources are fragmented. Maybe if I'd stuck to "safe" academic questions, the numbers would collapse.

So I grouped all 78 qualifying jobs (removing arithmetic test queries like "what is 1+1?" that game the dedup pipeline) into six subject areas and computed mean divergence per category with 95% confidence intervals.

The total spread across all six categories is 2.1 percentage points. Culture & Media sits at 85.3%. Security & Geopolitics sits at 83.2%. Everything else is in between. Every single 95% confidence interval overlaps with every other one — there is no statistically significant difference between topics.

This is the null result I didn't expect, and it's stronger than any single-topic finding could be. Whether you're researching Tesla valuations, Iran war scenarios, LLM inference economics, fantasy reader marketing, or urea fertilizer supply chains — you lose roughly 84% of findings using a single provider, give or take a couple of points.

A few implications:

Divergence is a property of how deep research works, not a quirk of any specific topic. If it were a topic effect, you'd expect high-consensus subjects (well-documented fundamentals) to diverge less than speculative or fragmented ones. They don't. The bottom of the range (Security, 83.2%) still finds 83% of claims in exactly one provider.
You can't dismiss the 86% number by saying "well, your query set was biased." Even across six wildly different subject domains, the mean divergence varies by only ±1 point around the overall median. The sampling distribution is too tight for selection bias to plausibly explain it.
Categories with small n (Science & Engineering, n=4; Security & Geopolitics, n=6) have wider confidence intervals, but their means still land in the same narrow band. With more data they might move a point or two, but not out of the cluster.

If anything, this makes the single-provider blindspot more of a structural problem, not less. It isn't that one topic is noisy. It's that every provider's source fingerprint is partial — everywhere, consistently, independent of subject.

What You Miss With One Provider

Even the best-performing provider (Grok Premium) misses nearly three-quarters of all findings. The worst (Gemini) misses 92%.

The unique rate is remarkably consistent across providers (65-72%). No single provider dominates. They each find different things because they search different sources, use different retrieval strategies, and apply different synthesis heuristics.

Where the Information Goes

Only 14% of factual claims are corroborated by more than one AI. The vast majority of what any provider tells you is information no other provider surfaces.

The Examples That Made Me Sit Up

Anthropic's Restricted AI Model Release

80% divergence. 234 claims. 8 model variants across 5 providers. View the full report →

When Anthropic announced Claude Mythos Preview with restricted access, I ran the story through all 5 providers (8 model variants). Each found different angles:

Provider	Unique finding no other provider surfaced
Perplexity	Data accidentally leaked from Anthropic's unsecured CMS on March 26
Grok	Mythos scored 100% on Cybench, 83% on CyberGym; Glasswing gave $2.5M to OpenSSF, $1.5M to Apache
OpenAI	Anthropic rolled out to 40+ organizations for defensive use only
Anthropic	Experts warned: behind Mythos is the next OpenAI model, then Gemini, then open-source Chinese models
Gemini	OpenAI pledged $10M in API credits to vetted defenders via Cybersecurity Grant Program
Grok Premium	OpenAI launched GPT-5.3-Codex in February 2026

Each provider told a fundamentally different version of the same story.

The irony: a story about AI providers restricting information, where 80% of the specific details were unique to whoever found them.

Quantum Computing vs Bitcoin Encryption

80.5% divergence. 195 claims. 4 providers. View the full report →

The timeline question matters: when could quantum computers break Bitcoin's encryption?

Provider	Timeline claim	Key detail
OpenAI	2035	U.S. White House target for quantum-safe migration
Grok Premium	2029	Google's internal deadline — 6 years earlier than the government
Perplexity	Not yet feasible	Google hasn't demonstrated >100 qubits in commercial deployment
OpenAI	~6.7M BTC at risk	~35% of supply in quantum-vulnerable addresses

Using one provider, you'd get one timeline. Using all four, you see the tension.

The government says 2035. Google is racing for 2029. And the hardware isn't there yet. That tension IS the finding. No single provider gave the full picture.

Provider	Market share claim
Perplexity only	ChatGPT maintained 86.7% share of AI chatbot web traffic (Jan 2025)
OpenAI only	OpenAI enterprise API share fell from ~50% to 25% by end of 2025
OpenAI only	Anthropic's enterprise share rose from 12% to 32%
Perplexity only	Google's Gemini captured 21.5% of web traffic by early 2026

Contradictory signals from the same market — consumer dominance vs enterprise decline.

Using one provider, you'd conclude either "OpenAI is winning" or "OpenAI is losing." Both are true simultaneously, in different markets. That nuance only emerges from multi-provider research. View the full report →

Why This Happens: The Source Data

The claim divergence isn't just providers "phrasing things differently." They're literally reading different websites.

Across the 90 research jobs, providers cited 9,877 sources from 3,100 unique domains. I analyzed the overlap:

Source overlap	Domains	% of total
Cited by only 1 model	1,555	50%
Cited by 2 models	693	22%
Cited by 3-4 models	613	20%
Cited by 5-7 models	207	7%
Cited by all 8 models	32	1%

Half of all source domains are exclusive to one model. Only 1% are shared by all eight.

50% of source domains are exclusive to one model. Only 32 domains (1%) — places like arxiv.org, github.com, wikipedia.org, bloomberg.com — are cited by all eight models.

Each provider has a distinct source fingerprint:

Provider	Exclusive domains	Source character
Anthropic	428	Academic, specialized (scienceopen.com, access.redhat.com)
Perplexity	390	Niche web, data reports (academic.oup.com, sector-specific sites)
OpenAI	194	Technical, institutional (stackexchange, government portals)
Gemini Lite	141	Industry analysis (accenture.com, ai.meta.com)
Gemini	119	Marketplace, commerce (abebooks.com, niche directories)
Grok	118	Social media, regulatory (x.com, state .gov sites)
Grok Premium	118	Tech blogs, security (googleblog.com, analyticsvidhya.com)
OpenAI Mini	47	Sports, niche APIs (espn.com, specialized docs)

Provider source fingerprints: each searches different corners of the internet.

This is the root cause. The 86% claim divergence is downstream of 50% source divergence. Providers find different things because they search different places.

Methodology

Models tested: Perplexity (sonar-deep-research), Gemini (gemini-2.5-pro), Gemini Lite (gemini-2.5-flash), OpenAI (gpt-4o / o3-deep-research), OpenAI Mini (gpt-4o-mini), Grok (grok-3), Grok Premium (grok-4), Anthropic (claude-sonnet-4). Not every job used all 8; budget tier determines how many models run. Minimum 2 models per job to qualify for this analysis.

What I measured: Divergence score = (claims found by exactly 1 model) / (total claims per job).

How claims were extracted: Each provider's research report was processed by an extraction pipeline:

Step	Method	Threshold
Per-provider extraction	LLM identifies atomic, verifiable claims	—
Embedding generation	text-embedding-3-large (1536 dims)	—
Primary clustering	Union-find with cosine similarity	≥ 0.78
Fallback clustering	Token overlap (Jaccard)	≥ 0.40 at cosine ≥ 0.65
LLM reconciliation	Cross-provider cluster merge confirmation	centroid cosine ≥ 0.55
NLI review	Per-cluster contradiction/merge/split decisions	—

Quality controls:

Zero orphan claims: every claim attributed to at least one provider
Dedup verified: manual review of 10 random jobs found only 2 suspected missed duplicates, both false alarms
Unique claims manually verified as substantive (specific citations, numbers, analysis), not noise
Provider volume normalized: unique rates consistent across providers (65-72%)

Topic analysis subset: For the per-category breakdown, I restricted to jobs with ≥50 clusters and excluded arithmetic / empty-prompt test queries (e.g. "what is 1+1?") where clustering artifacts inflate divergence toward 100%. 78 of the 90 jobs qualified. Category assignment used a keyword-score classifier with manual overrides for obvious mismatches. Sample sizes per category ranged from 4 (Science & Engineering) to 22 (Developer Tools & OSS). Mean overall divergence on the 78-job subset is 84.2% (95% CI: 82.9–85.5%) — consistent with the 84.6% mean on the full 90-job corpus.

What this doesn't measure: Accuracy of individual claims (a claim being unique doesn't mean it's correct), importance weighting (not all claims are decision-relevant), temporal bias from different training data cutoffs, or how results might differ for non-English queries or purely academic topics.

What This Means

If you're making decisions based on AI research, using one provider means operating with 73-92% of the available information missing. Not because the provider is bad. Because each one searches different corners of the internet, synthesizes differently, and includes different things.

The practical takeaway: for anything that matters (investment decisions, threat assessments, competitive analysis, technical architecture), run the same question through multiple providers and look at the union, not the intersection.

73-92%

of findings are invisible to any single AI provider

Built with Parallect.ai, multi-provider deep research.

Data: 90 research jobs, 22,121 claims, 5 providers (8 model variants). All research was conducted on the Parallect.ai platform between March 25 and April 7, 2026.