Parallect
← Back to Blog
researchdataai-providersdivergence

I Ran 90 Research Queries Through 8 Deep Research Models. 86% of Findings Were Unique to One.

22,121 claims. 5 providers (8 model variants). 86% of findings found by only one. And the pattern is the same whether you ask about Tesla, Iran, Claude, or fantasy novels.

@justin·

I took 90 real research queries and ran each one through 5 AI providers (8 model variants) simultaneously: Perplexity, Gemini (+ Gemini Lite), OpenAI (+ OpenAI Mini), Grok (+ Grok Premium), and Anthropic.

Then I extracted every factual claim from every provider's report, deduplicated them using embedding-based clustering with LLM reconciliation, and asked one question: how many claims does each provider find that no other provider finds? (A "claim" is an atomic, verifiable assertion like "CrowdStrike reports an 89% increase in AI-enabled attacks" or "Google set an internal 2029 deadline for PQC migration.")

The reason this number is so high: half of the source domains each provider cites are exclusive to that provider. They're not just phrasing things differently. They're literally reading different websites. Source analysis below.

22,121 total claims. 18,896 found by only one model. 3,040 found by two or more. 185 confirmed by three or more. Mean divergence: 84.6% (95% CI: 83.5-85.8%, n=90).

How I deduplicated: Claims were embedded (text-embedding-3-large), clustered by cosine similarity (≥ 0.78), then cross-provider pairs were verified by LLM reconciliation. Manual review of 10 random jobs found only 2 suspected false positives, both confirmed as genuinely different claims on inspection. Full methodology below.

This isn't a cherry-picked result. It's the median across 90 diverse research jobs — cybersecurity threat assessments, quantum computing timelines, stock analysis, fantasy reader marketing, urea fertilizer supply chains. And when I grouped those jobs by subject area, divergence varied by only 2.1 percentage points across six categories. The pattern is universal.

Could this be wrong?

Before diving into the data, I want to address the obvious objection: maybe the dedup is just bad.

Three things that would explain a high uniqueness rate without real divergence:

  1. Extraction is too granular. If the LLM splits "GDP grew 3%" into two claims ("GDP grew" and "growth was 3%"), uniqueness inflates. I checked: median claim length is 77 characters, and only 2% of claims are under 30 characters. Claims are atomic but not trivially so.

  2. Clustering threshold is too strict. At cosine similarity 0.78, "GPT-4 is the best coding model" and "GPT-4 is the top model for code" should merge. I tested sensitivity at lower thresholds: at 0.70 divergence drops to ~79%, at 0.65 it drops to ~74%. The pattern holds across thresholds. The pipeline also uses a secondary token-overlap pass (Jaccard ≥ 0.40) and a third LLM reconciliation pass for cross-provider merges.

  3. Provider volume bias. If one provider generates 3x more claims, its "unique" count inflates. I checked: unique rates are remarkably consistent across all providers (65-72%). No single provider dominates.

  4. Not all claims matter equally. Some unique findings are minor details. But the examples below show that many are specific, verifiable, and decision-relevant: funding amounts, benchmark scores, timeline targets, market share figures.

  5. Sampling bias. These 90 queries were run by one user on one platform. Topics skew toward AI, technology, finance, and geopolitics. A different query distribution (e.g., purely academic, medical, or legal topics) might produce different divergence rates. I'd expect the pattern to hold but the specific numbers may vary.

The explanation I believe is correct: Providers genuinely search different sources, use different retrieval strategies, and apply different editorial judgment about what to include. The divergence is real.

The Distribution

Every single research job I analyzed showed significant divergence. Not one had less than 68% unique findings. The median was 85.3%.

59% of research jobs fell in the 80-90% divergence range. The interquartile range was tight: 80.5% to 88.9%. This is a consistent pattern, not an outlier.

Does This Vary By Topic?

The first thing any skeptic will ask: maybe divergence is just a quirk of the topics I chose. Maybe finance queries diverge because the market is noisy. Maybe cybersecurity queries diverge because sources are fragmented. Maybe if I'd stuck to "safe" academic questions, the numbers would collapse.

So I grouped all 78 qualifying jobs (removing arithmetic test queries like "what is 1+1?" that game the dedup pipeline) into six subject areas and computed mean divergence per category with 95% confidence intervals.

The total spread across all six categories is 2.1 percentage points. Culture & Media sits at 85.3%. Security & Geopolitics sits at 83.2%. Everything else is in between. Every single 95% confidence interval overlaps with every other one — there is no statistically significant difference between topics.

This is the null result I didn't expect, and it's stronger than any single-topic finding could be. Whether you're researching Tesla valuations, Iran war scenarios, LLM inference economics, fantasy reader marketing, or urea fertilizer supply chains — you lose roughly 84% of findings using a single provider, give or take a couple of points.

A few implications:

  1. Divergence is a property of how deep research works, not a quirk of any specific topic. If it were a topic effect, you'd expect high-consensus subjects (well-documented fundamentals) to diverge less than speculative or fragmented ones. They don't. The bottom of the range (Security, 83.2%) still finds 83% of claims in exactly one provider.
  2. You can't dismiss the 86% number by saying "well, your query set was biased." Even across six wildly different subject domains, the mean divergence varies by only ±1 point around the overall median. The sampling distribution is too tight for selection bias to plausibly explain it.
  3. Categories with small n (Science & Engineering, n=4; Security & Geopolitics, n=6) have wider confidence intervals, but their means still land in the same narrow band. With more data they might move a point or two, but not out of the cluster.

If anything, this makes the single-provider blindspot more of a structural problem, not less. It isn't that one topic is noisy. It's that every provider's source fingerprint is partial — everywhere, consistently, independent of subject.

What You Miss With One Provider

Even the best-performing provider (Grok Premium) misses nearly three-quarters of all findings. The worst (Gemini) misses 92%.

The unique rate is remarkably consistent across providers (65-72%). No single provider dominates. They each find different things because they search different sources, use different retrieval strategies, and apply different synthesis heuristics.

Where the Information Goes

Only 14% of factual claims are corroborated by more than one AI. The vast majority of what any provider tells you is information no other provider surfaces.

The Examples That Made Me Sit Up

Anthropic's Restricted AI Model Release

80% divergence. 234 claims. 8 model variants across 5 providers. View the full report →

When Anthropic announced Claude Mythos Preview with restricted access, I ran the story through all 5 providers (8 model variants). Each found different angles:

ProviderUnique finding no other provider surfaced
PerplexityData accidentally leaked from Anthropic's unsecured CMS on March 26
GrokMythos scored 100% on Cybench, 83% on CyberGym; Glasswing gave $2.5M to OpenSSF, $1.5M to Apache
OpenAIAnthropic rolled out to 40+ organizations for defensive use only
AnthropicExperts warned: behind Mythos is the next OpenAI model, then Gemini, then open-source Chinese models
GeminiOpenAI pledged $10M in API credits to vetted defenders via Cybersecurity Grant Program
Grok PremiumOpenAI launched GPT-5.3-Codex in February 2026

Each provider told a fundamentally different version of the same story.

The irony: a story about AI providers restricting information, where 80% of the specific details were unique to whoever found them.

Quantum Computing vs Bitcoin Encryption

80.5% divergence. 195 claims. 4 providers. View the full report →

The timeline question matters: when could quantum computers break Bitcoin's encryption?

ProviderTimeline claimKey detail
OpenAI2035U.S. White House target for quantum-safe migration
Grok Premium2029Google's internal deadline — 6 years earlier than the government
PerplexityNot yet feasibleGoogle hasn't demonstrated >100 qubits in commercial deployment
OpenAI~6.7M BTC at risk~35% of supply in quantum-vulnerable addresses

Using one provider, you'd get one timeline. Using all four, you see the tension.

The government says 2035. Google is racing for 2029. And the hardware isn't there yet. That tension IS the finding. No single provider gave the full picture.

AI Market Share: Contradictory Signals

ProviderMarket share claim
Perplexity onlyChatGPT maintained 86.7% share of AI chatbot web traffic (Jan 2025)
OpenAI onlyOpenAI enterprise API share fell from ~50% to 25% by end of 2025
OpenAI onlyAnthropic's enterprise share rose from 12% to 32%
Perplexity onlyGoogle's Gemini captured 21.5% of web traffic by early 2026

Contradictory signals from the same market — consumer dominance vs enterprise decline.

Using one provider, you'd conclude either "OpenAI is winning" or "OpenAI is losing." Both are true simultaneously, in different markets. That nuance only emerges from multi-provider research. View the full report →

Why This Happens: The Source Data

The claim divergence isn't just providers "phrasing things differently." They're literally reading different websites.

Across the 90 research jobs, providers cited 9,877 sources from 3,100 unique domains. I analyzed the overlap:

Source overlapDomains% of total
Cited by only 1 model1,55550%
Cited by 2 models69322%
Cited by 3-4 models61320%
Cited by 5-7 models2077%
Cited by all 8 models321%

Half of all source domains are exclusive to one model. Only 1% are shared by all eight.

50% of source domains are exclusive to one model. Only 32 domains (1%) — places like arxiv.org, github.com, wikipedia.org, bloomberg.com — are cited by all eight models.

Each provider has a distinct source fingerprint:

ProviderExclusive domainsSource character
Anthropic428Academic, specialized (scienceopen.com, access.redhat.com)
Perplexity390Niche web, data reports (academic.oup.com, sector-specific sites)
OpenAI194Technical, institutional (stackexchange, government portals)
Gemini Lite141Industry analysis (accenture.com, ai.meta.com)
Gemini119Marketplace, commerce (abebooks.com, niche directories)
Grok118Social media, regulatory (x.com, state .gov sites)
Grok Premium118Tech blogs, security (googleblog.com, analyticsvidhya.com)
OpenAI Mini47Sports, niche APIs (espn.com, specialized docs)

Provider source fingerprints: each searches different corners of the internet.

This is the root cause. The 86% claim divergence is downstream of 50% source divergence. Providers find different things because they search different places.

Methodology

Models tested: Perplexity (sonar-deep-research), Gemini (gemini-2.5-pro), Gemini Lite (gemini-2.5-flash), OpenAI (gpt-4o / o3-deep-research), OpenAI Mini (gpt-4o-mini), Grok (grok-3), Grok Premium (grok-4), Anthropic (claude-sonnet-4). Not every job used all 8; budget tier determines how many models run. Minimum 2 models per job to qualify for this analysis.

What I measured: Divergence score = (claims found by exactly 1 model) / (total claims per job).

How claims were extracted: Each provider's research report was processed by an extraction pipeline:

StepMethodThreshold
Per-provider extractionLLM identifies atomic, verifiable claims
Embedding generationtext-embedding-3-large (1536 dims)
Primary clusteringUnion-find with cosine similarity≥ 0.78
Fallback clusteringToken overlap (Jaccard)≥ 0.40 at cosine ≥ 0.65
LLM reconciliationCross-provider cluster merge confirmationcentroid cosine ≥ 0.55
NLI reviewPer-cluster contradiction/merge/split decisions

Quality controls:

  • Zero orphan claims: every claim attributed to at least one provider
  • Dedup verified: manual review of 10 random jobs found only 2 suspected missed duplicates, both false alarms
  • Unique claims manually verified as substantive (specific citations, numbers, analysis), not noise
  • Provider volume normalized: unique rates consistent across providers (65-72%)

Topic analysis subset: For the per-category breakdown, I restricted to jobs with ≥50 clusters and excluded arithmetic / empty-prompt test queries (e.g. "what is 1+1?") where clustering artifacts inflate divergence toward 100%. 78 of the 90 jobs qualified. Category assignment used a keyword-score classifier with manual overrides for obvious mismatches. Sample sizes per category ranged from 4 (Science & Engineering) to 22 (Developer Tools & OSS). Mean overall divergence on the 78-job subset is 84.2% (95% CI: 82.9–85.5%) — consistent with the 84.6% mean on the full 90-job corpus.

What this doesn't measure: Accuracy of individual claims (a claim being unique doesn't mean it's correct), importance weighting (not all claims are decision-relevant), temporal bias from different training data cutoffs, or how results might differ for non-English queries or purely academic topics.

What This Means

If you're making decisions based on AI research, using one provider means operating with 73-92% of the available information missing. Not because the provider is bad. Because each one searches different corners of the internet, synthesizes differently, and includes different things.

The practical takeaway: for anything that matters (investment decisions, threat assessments, competitive analysis, technical architecture), run the same question through multiple providers and look at the union, not the intersection.

73-92%
of findings are invisible to any single AI provider

Built with Parallect.ai, multi-provider deep research.

Data: 90 research jobs, 22,121 claims, 5 providers (8 model variants). All research was conducted on the Parallect.ai platform between March 25 and April 7, 2026.

Stop missing 86% of the picture.

Run multi-provider research across Perplexity, Gemini, OpenAI, Grok, and Anthropic in one query. See where they agree, diverge, and contradict each other.

Use invite code TRYPARALLECT for free research credits