May 31, 2026·13 min read·37 views·4 providers

Frontier AI in May 2026: Convergence or Single Leader

Late May 2026: top frontier models largely converge on aggregate benchmarks, yet Claude Opus 4.8 edges GPT-5.5—exact dates, scores, and sources provided.

Key Finding

On the LMArena live leaderboard, GPT-5.5 Pro is reported near the top with an Elo around 1510, while Claude Opus 4.7 is reported around 1505; older or variant claims also place Claude Opus 4.6 at #1 with Elo 1504 and note that GPT-4-class and Claude 3 Opus were near the top in mid-2024.

high confidenceSupported by grok, openai, anthropic, perplexity

Justin Furniss

@Parallect.ai and @SecureCoders. Founder. Hacker. Father. Seeker of all things AI

grokopenaiperplexityanthropic

Executive Summary

Rogo's "no single best model" claim is substantially accurate but temporally bounded: on the Big Finance Bench published May 27, 2026, Claude Opus 4.7, GPT-5.5, and Claude Sonnet 4.6 were separated by less than 0.3 percentage points on a rubric-graded aggregate score, with the best single model reaching 58.8% [1, 2].
The Artificial Analysis Intelligence Index currently places Claude Opus 4.8 at #1 with a score of 61.4, edging GPT-5.5 (xhigh) by 1.2 points — a meaningful but narrow margin that emerged only after Opus 4.8 launched on May 28, 2026, one day after Rogo's publication [3, 4, 5].
On LMArena's human-preference leaderboard, the top three models remain a statistical tie, with overlapping 95% confidence intervals; no model holds a statistically meaningful Elo lead [6].
Frontier capability has genuinely converged but has not fully plateaued: the evidence shows near-parity on broad aggregates, persistent domain-specific differentiation, and one model (Opus 4.8) that has just opened a modest but real gap on the most demanding composite index.
Trillions in training spend have produced diminishing returns on general benchmarks while accelerating cost reduction; the real differentiation frontier has shifted to inference efficiency, agentic performance, and specialized domain tasks.

1. The Rogo Big Finance Bench: What It Actually Measured and What It Found

The Benchmark's Design

Rogo published its Big Finance Bench (BFB) on May 27, 2026 [1, 2]. The benchmark comprises 928 questions written by ex-finance practitioners, scored against 15,656 rubric criteria representing 36,241 weighted points [2]. This scale matters for interpreting the "0.3 percentage point" claim: the evaluation is not a simple accuracy metric but a rubric-graded score that awards partial credit for reasoning quality, source identification, and analytical chain — not merely final numerical answers [2].

The rubric-grading methodology was deliberately chosen because strict final-answer scoring collapses near-successes into the same category as unsupported guesses. A model that correctly identifies the relevant SEC filing, applies the right accounting treatment, and follows sound analytical logic but makes a minor arithmetic error in the last step receives meaningful partial credit under rubric scoring but zero credit under binary accuracy [2]. This design choice makes the BFB more diagnostic of real-world financial analyst utility than standard benchmark accuracy.

Ten frontier models were evaluated in total [2].

The Core Finding: Sub-0.3pp Convergence

The claim that Claude Opus 4.7, GPT-5.5, and Claude Sonnet 4.6 sit within approximately 0.3 percentage points of each other on aggregate is well-confirmed across multiple independent sources [1, 2]. The best single model — either Opus 4.7 or GPT-5.5 depending on the sub-task — scored 58.8% on the rubric [2]. The three-way gap at the top is less than 0.3 percentage points overall.

Critically, no single model leads across the full dataset. The sub-task breakdown reveals distinct capability profiles:

GPT-5.5 is strongest on capital structure and M&A analysis [1]
Claude Sonnet 4.6 leads on earnings quality and financial statement analysis [1]
Claude Opus 4.7 is strongest on private capital and forecasting tasks [1]

This pattern — each model leading in different domains while tying on aggregate — is the structural signature of genuine convergence rather than one model simply being weaker overall.

The Routing Insight

Rogo's data contains a finding that goes beyond the convergence narrative: a coarse router that selects models by workflow type adds +4.5 percentage points over the best single model on the rubric score [2]. A best-of-ten oracle (selecting the best available model response per question) adds +13.2 percentage points [2]. This demonstrates that the models are not interchangeable — they have meaningfully different error distributions — but that no single model captures all of that available performance. The practical implication for financial institutions is that ensemble or routing architectures outperform any single frontier model by a margin that dwarfs the inter-model differences at the top.

2. The Artificial Analysis Intelligence Index: Current Rankings with Exact Figures

Index Composition and Methodology

The Artificial Analysis Intelligence Index is a composite of approximately ten challenging evaluations spanning mathematics, scientific reasoning, coding, agentic tasks, and language comprehension, with a text-only English focus [7, 8]. As of the v4.0 methodology update (March 2026), the index includes GDPval-AA, τ²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam, GPQA Diamond, and CritPt [7, 3, 4].

Current Rankings (Late May 2026)

The source evidence contains a direct conflict on the #1 position that must be presented transparently:

Position A (supported by Artificial Analysis's own published analysis): Claude Opus 4.8 (max effort / adaptive reasoning) holds the #1 position with a score of 61.4, placing it 1.2 points ahead of GPT-5.5 (xhigh) at 60.2, and 4.1 points above its predecessor Claude Opus 4.7 [3, 4, 5].

Position B (reflected in some pre-May-28 snapshots): GPT-5.5 (xhigh) held the #1 position at 60.2 before Opus 4.8's launch [7, 8].

The resolution is chronological: both positions are correct for their respective dates. GPT-5.5 led the index until May 28, 2026, when Claude Opus 4.8 launched and displaced it. The current ranked order is:

Rank	Model	AA Intelligence Index Score
1	Claude Opus 4.8 (max)	61.4
2	GPT-5.5 (xhigh)	60.2
3	GPT-5.5 (high)	~59
4	Claude Opus 4.7 (adaptive)	57.3
5	Gemini 3.1 Pro Preview	57.2

[3, 4, 5, 7, 8]

The 1.2-point gap between Opus 4.8 and GPT-5.5 (xhigh) is the widest lead any single model has held on this index in recent months. Earlier in 2026, Claude Opus 4.7, Gemini 3.1 Pro Preview, and GPT-5.4 were effectively tied at 57 [3, 4].

Sub-Benchmark Differentiation

The aggregate score conceals meaningful variation at the component level. GPT-5.5 retains the lead on Terminal-Bench 2.1 with a score of 78.2% versus Opus 4.8's 74.6% — a 3.6-point gap [3, 4]. This is a narrowing from the 12.1-point gap that existed when Opus 4.7 was the Anthropic flagship [9]. Gemini 3.1 Pro leads on scientific reasoning with 94.3% on GPQA Diamond [3]. On GDPval-AA (a real-world task benchmark), Opus 4.8 scored 1,890 Elo, implying an approximately 67% win rate against GPT-5.5 [10].

On SWE-bench Verified (software engineering), six models now score within 0.8 points of each other, with three of those models having launched within the past five weeks [7, 11, 12].

3. LMArena: Human Preference Leaderboard

Current State

LMArena's human-preference leaderboard, which aggregates pairwise blind comparisons into Elo ratings, presents the clearest evidence for convergence. The top models cluster in the 1500–1506 Elo range [13]. Available snapshots show:

GPT-5.5-high: ~1506 Elo
Claude Opus 4.7 Thinking: ~1505 Elo
Gemini 3.1 Pro: ~1505 Elo

[13]

One source reports Claude Opus 4.6 holding the headline #1 position on the LMArena Text leaderboard at Elo 1504, with Gemini 3.1 Pro Preview sitting within overlapping 95% confidence intervals [6]. The critical statistical point: an Elo gap is only meaningful when it exceeds the sum of the two models' 95% confidence intervals, a threshold typically around 18–22 Elo points [6]. The current top-three spread of approximately 1–5 Elo points falls well below this threshold, meaning the top three models on LMArena are a statistical tie [6].

It should be noted that LMArena snapshots available as of this writing do not yet reflect Claude Opus 4.8, which launched May 28, 2026 [6]. When Opus 4.8 is incorporated, its GDPval-AA performance (1,890 Elo, ~67% win rate against GPT-5.5) suggests it may break from the current cluster [10].

Stanford AI Index Corroboration

The Stanford 2026 AI Index independently confirms the LMArena picture: as of March 2026, the top closed models from Anthropic, xAI, Google, and OpenAI clustered within 25 Elo points of each other [14]. This is the tightest clustering the index has recorded.

4. Has Frontier Capability Genuinely Commoditized or Plateaued?

The Convergence Case

The evidence for convergence is strong and multi-dimensional:

Benchmark saturation: At least 14 distinct models now score above 90% on MMLU, a benchmark where GPT-4 led with approximately 86% in 2023 [14]. The top 15 models score in the 90–94% range on many standard tasks [15]. High-level benchmarks are effectively saturated for frontier models.

Finance domain parity: The BFB's sub-0.3pp spread across 928 finance-specific questions, scored against 15,656 rubric criteria, is the most domain-specific evidence of convergence available [1, 2]. This is not a general-purpose benchmark where prompt sensitivity might explain the gap — it is a carefully constructed professional evaluation.

Human preference ties: The LMArena statistical tie among the top three models means that real users, in blind pairwise comparisons, cannot reliably distinguish between GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro [6].

Rapid catch-up dynamics: The 12.1-point Terminal-Bench gap between GPT-5.5 and Claude Opus 4.7 narrowed to 3.6 points with Opus 4.8 in a single model generation [3, 9]. This rate of catch-up suggests that any lead one lab opens is quickly closed.

Cost collapse as a convergence signal: API costs for frontier-quality models fell roughly 80% between 2025 and early 2026, with models that cost $0.06 per 1,000 tokens in 2023 now running below $0.002 [16]. This commoditization of inference pricing reflects the underlying commoditization of capability — providers cannot sustain premium pricing when competitors match performance.

The Differentiation Case

The convergence narrative is real but incomplete. Several dimensions of genuine differentiation persist:

Domain-specific leadership: The BFB sub-task breakdown (GPT-5.5 on M&A, Sonnet 4.6 on earnings quality, Opus 4.7 on forecasting) is not noise — it reflects architecturally distinct strengths [1]. Similarly, Gemini 3.1 Pro's 94.3% GPQA Diamond score represents a genuine scientific reasoning advantage, and GPT-5.5's Terminal-Bench lead reflects real agentic capability differences [3].

Opus 4.8's current lead: The 1.2-point gap on the Artificial Analysis Intelligence Index and the ~67% GDPval-AA win rate against GPT-5.5 are the strongest evidence that one model — Claude Opus 4.8 — is currently the best overall on the most demanding composite evaluation [3, 4, 10]. This is a narrow lead, but it is real and consistent across multiple sub-benchmarks within the index.

Agentic and coding differentiation: Claude Code now accounts for roughly 4% of all public GitHub commits [3], a market-share signal that reflects real-world utility differentiation beyond benchmark scores. Six models score within 0.8 points on SWE-bench Verified, but the distribution of actual developer adoption is far less even [7, 11, 12].

Open-weights gap: Kimi K2.6 leads among open-weight models at 90.5% GPQA Diamond [7, 11, 12], but this still trails the closed-model frontier, indicating that the convergence is primarily among the top-tier closed labs rather than across the full model ecosystem.

5. The Training Spend Question: What Trillions Have Bought

Capital Expenditure Scale

The four major hyperscalers — Google, Amazon, Microsoft, and Meta — plan combined 2026 capital expenditure of $725 billion, a 77% increase over the prior year's record $410 billion [17, 18]. Goldman Sachs models project $765 billion in annual AI CapEx in 2026, rising to $1.6 trillion annually by 2031, implying roughly $7.6 trillion in cumulative CapEx over that period [19]. OpenAI reportedly spent on the order of $1.5–3 billion training GPT-5 alone [20].

What the Spend Has Produced

The relationship between training spend and benchmark differentiation has weakened substantially. The evidence suggests that massive parallel investment by multiple well-resourced labs has produced a situation where each new frontier model closes the gap opened by the previous generation's leader within months. The 12.1-point Terminal-Bench gap that GPT-5.5 held over Opus 4.7 was reduced to 3.6 points by Opus 4.8 in a single training cycle [3, 9].

The most significant structural shift in 2026 spending is the pivot from training to inference workloads [16, 19]. This reallocation reflects a recognition that the marginal return on additional training compute for general capability is declining, while the return on inference optimization — faster responses, lower costs, longer context, better tool use — remains high. The 80% cost reduction in frontier API pricing between 2025 and early 2026 is partly a consequence of this inference-side investment [16].

The academic literature on training costs confirms the underlying dynamic: as multiple labs converge on similar architectures, data mixtures, and training objectives, the capability gains from additional compute become more incremental [16]. OpenAI's decision to rebuild GPT-5.5's architecture, pretraining corpus, and objectives from scratch — the first such full rebuild since GPT-4.5 — was explicitly motivated by the need to find new scaling surfaces rather than simply adding compute to existing approaches [3].

6. Verdict: Is There a Single Best Model?

The Honest Answer

As of May 31, 2026, Claude Opus 4.8 is the strongest single model on the most rigorous composite evaluation available — the Artificial Analysis Intelligence Index, where it scores 61.4 versus GPT-5.5's 60.2 [3, 4, 5]. Its GDPval-AA Elo of 1,890 implies a ~67% win rate against GPT-5.5 on real-world task performance [10]. These are the strongest quantitative claims for a single model's superiority currently in evidence.

However, this verdict comes with four important qualifications:

The lead is narrow and recent: Opus 4.8 launched May 28, 2026 — three days ago. The 1.2-point AA Index lead is the widest any model has held in months, but it is not a commanding margin. GPT-5.5 retains the Terminal-Bench 2.1 lead (78.2% vs. 74.6%) [3].
The Rogo finding remains valid for its scope: On finance-specific tasks evaluated before Opus 4.8's launch, the three-way tie among Opus 4.7, GPT-5.5, and Sonnet 4.6 was real and documented [1, 2]. Rogo's conclusion that "there is no single best model" accurately described the state of play on May 27, 2026, and remains accurate for the finance domain specifically — Opus 4.8 was not included in the BFB evaluation.
LMArena has not yet incorporated Opus 4.8: The human-preference leaderboard still shows a statistical three-way tie [6]. When Opus 4.8 is added, the picture may shift, but it may also confirm that human raters cannot distinguish the models in blind comparison even when benchmark scores diverge.
Domain choice determines the winner: For scientific reasoning, Gemini 3.1 Pro's 94.3% GPQA Diamond is the strongest available result [3]. For terminal/agentic tasks, GPT-5.5's 78.2% Terminal-Bench score leads [3]. For finance workflows, the three-way tie persists [1, 2]. The "best model" question is only answerable once the use case is specified.

Summary Comparison Table

Dimension	Claude Opus 4.8	GPT-5.5 (xhigh)	Gemini 3.1 Pro	Claude Sonnet 4.6
AA Intelligence Index	61.4 (#1)	60.2 (#2)	57.2 (#5)	N/A
Terminal-Bench 2.1	74.6%	78.2% (#1)	~75%	N/A
GPQA Diamond	~91%	~90%	94.3% (#1)	N/A
GDPval-AA Elo	1,890	~1,750	N/A	N/A
LMArena Elo (text)	Not yet rated	~1506	~1505	N/A
Rogo BFB (finance)	Not tested	~58.8% (tied)	N/A	~58.8% (tied)
Finance: M&A/Capital Structure	N/A	Leads	N/A	N/A
Finance: Earnings Quality	N/A	N/A	N/A	Leads
Finance: Forecasting	Predecessor led	N/A	N/A	N/A

[1, 3, 4, 10, 6, 2]

The Structural Conclusion

Frontier AI capability has converged to a degree that was not true eighteen months ago. The Stanford AI Index's finding that top closed models cluster within 25 Elo points [14], the BFB's sub-0.3pp finance spread [1, 2], and the LMArena statistical tie [6] all point to a landscape where the choice of frontier model matters far less than it once did for most applications. The differentiation that remains is real but domain-specific, and it shifts with each new model release.

The strongest evidence against full commoditization is Opus 4.8's 1.2-point AA Index lead and its 67% GDPval-AA win rate — both suggesting that Anthropic has, at least momentarily, opened a gap that is larger than the statistical noise [3, 4, 10]. Whether that gap persists through the next OpenAI or Google release is the central empirical question of the next 60–90 days.

References

[1] Introducing the big finance benchmark (rogo.ai). rogo.ai. https://rogo.ai/news/introducing-the-big-finance-benchmark

[2] Status (x.com). x.com. https://x.com/RogoAI/status/2059743405203480888

[3] Claude Opus 4.8 - The new #1 AI model. artificialanalysis.ai. https://artificialanalysis.ai/articles/claude-opus-4-8-analysis-and-benchmarks

[4] Claude Opus 4.8 (max) - Intelligence, Performance & Price Analysis. artificialanalysis.ai. https://artificialanalysis.ai/models/claude-opus-4-8

[5] Claude Opus 4.8 Tops Artificial Analysis Intelligence Index, Edges Out GPT 5.5 With Score Of 61.4. officechai.com. https://officechai.com/ai/claude-opus-4-8-tops-artificial-analysis-intelligence-index-edges-out-gpt-5-5-with-score-of-61-4

[6] Arena Leaderboard - a Hugging Face Space by lmarena-ai. huggingface.co. https://huggingface.co/spaces/lmarena-ai/arena-leaderboard

[7] Artificial analysis intelligence index (artificialanalysis.ai). artificialanalysis.ai. https://artificialanalysis.ai/evaluations/artificial-analysis-intelligence-index

[8] Models (artificialanalysis.ai). artificialanalysis.ai. https://artificialanalysis.ai/leaderboards/models

[9] Claude Opus 4.7 (max) - Intelligence, Performance & Price Analysis. artificialanalysis.ai. https://artificialanalysis.ai/models/claude-opus-4-7

[10] Claude Opus 4.8 Beats GPT 5.5 On GDPval-AA Benchmark For Real World Tasks. officechai.com. https://officechai.com/ai/claude-opus-4-8-beats-gpt-5-5-on-gdpval-aa-benchmark-for-real-world-tasks

[11] AI Model & API Providers Analysis | Artificial Analysis. artificialanalysis.ai. https://artificialanalysis.ai

[12] Artificial Analysis. artificialanalysis.ai. https://artificialanalysis.ai/evaluations

[13] Chatbot arena (openlm.ai). openlm.ai. https://openlm.ai/chatbot-arena

[14] "Technical performance (hai.stanford.edu)." https://hai.stanford.edu/ai-index/2026-ai-index-report/technical-performance

[15] LM Leaderboard May 2026 | LMArena Elo, Pricing & Open Weights. swfte.com. https://swfte.com/ai/lm/leaderboard

[16] "The rising costs of training frontier AI models." https://arxiv.org/html/2405.21015v1

[17] Big Tech is about to spend $700 billion on AI this year. No one knows where the buildout ends. | Fortune. fortune.com. https://fortune.com/2026/04/30/big-tech-hyperscalers-will-spend-700-billion-on-ai-infrastructure-this-year-with-no-clear-end-in-sight-eye-on-ai

[18] Skyrocketing component prices push Big Tech capex to record $725 billion — Microsoft alone attributes $25 billion of AI budget to increased memory and chip costs | Tom's Hardware. tomshardware.com. https://tomshardware.com/tech-industry/big-tech/microsoft-attributed-25-billion-of-its-record-ai-budget-to-memory-chip-costs

[19] Tracking Trillions: The Assumptions Shaping the Scale of the AI Build-Out | Goldman Sachs. goldmansachs.com. https://goldmansachs.com/insights/articles/tracking-trillions-the-assumptions-shaping-scale-of-the-ai-build-out

[20] GPT-5 — Training cost, GPU hours & cluster size | BtMData. btmdata.com. https://btmdata.com/ai-training/gpt-5

Evidence Explorer

Select a citation or claim to explore evidence.

Cross-provider analysis

How 4 providers compared on 204 claims across 121 topic clusters

Consensus

Contested

Unique

Low-conf

standard

Consensus findings (9)

Multiple providers independently confirmed these. Treat as the most reliable evidence.

Rogo’s Big Finance Bench is a 928-question finance workflow benchmark focused on real-world financial analysis tasks.
77%
grokanthropicopenai
[1][53]
The best single model scored 58.8% on the rubric.
76%
grokanthropicopenai
[1][53]
On the LMArena live leaderboard, GPT-5.5 Pro is reported near the top with an Elo around 1510, while Claude Opus 4.7 is reported around 1505; older or variant claims also place Claude Opus 4.6 at #1 with Elo 1504 and note that GPT-4-class and Claude 3 Opus were near the top in mid-2024.
74%
grokopenaianthropicperplexity
[46][48][7][8]
On the Rogo finance benchmark/leaderboard, Claude Opus 4.7, GPT-5.5, and Sonnet 4.6 were all clustered at the top, separated by less than 0.3 percentage points overall.
74%
grokopenaianthropicperplexity
[1][3][53]
GPT-5.5 and Claude Opus 4.7/4.8 are near the top of broad benchmark aggregates, with GPT-5.5 slightly ahead on some evaluations including Terminal-Bench 2.1 (78.2% vs Opus 4.8's 74.6%).
74%
grokopenaiperplexityanthropic
[12][16][17][20][4][6][8]
Rogo’s Big Finance Bench results showed that the top frontier models were extremely close, with no single best model clearly leading across the benchmark and the top three separated by less than 0.3 percentage points.
68%
grokopenaiperplexity
[1]
As of late May 2026, frontier AI models had converged to roughly equal general capability, making frontier AI capability appear largely commoditized.
67%
grokopenaianthropic
[13][1][2]
The top models on LMArena are tightly clustered, with the top three effectively in a statistical tie.
60%
grokperplexityanthropic
[46][48][7]
+ 1 more consensus findings

Contested findings (2)

Providers disagreed. Both positions surfaced rather than picked.

Position A
An OpenAI GPT-5.5 model is in first place on the Artificial Analysis Intelligence Index v4.0.
openai
[12]
Position B
Claude Opus 4.8 is #1 on the Artificial Analysis Intelligence Index.
anthropic
[37][38][39]
These claims directly conflict on which model is #1 on the Artificial Analysis Intelligence Index: GPT-5.5 vs Claude Opus 4.8. These score claims conflict because they assign the top-ranked score to different models in the same index context. These claims directly conflict on the overall rank leader: GPT-5.5 overall #1 versus Claude Opus 4.8 first.
Position A
A coarse router selecting by workflow/source type improves performance by +4.5 pp on the rubric. A coarse router selecting by workflow/source type improves performance by +4.7 pp on the final answer.
grok
[1]
Position B
A coarse router added +4.5 pp over the best single model.
anthropic
[53]
2 providers split on this claim.

Single-source insights (80)

Reported by only one provider. Treat as preliminary unless independently verified.

A May 28, 2026 public snapshot lists Claude Opus 4.7 (Adaptive) at 57.3%.
75%
grok
[4][5][6]
OpenAI reportedly spent on the order of $1.5–3 billion training GPT-5.
73%
openai
[15]
Goldman Sachs models imply $765 billion in annual AI CapEx in 2026, rising to $1.6 trillion annually by 2031, for roughly $7.6 trillion in cumulative CapEx from 2026 to 2031.
73%
anthropic
[64][70]
The combined $725 billion 2026 capital expenditure is a 77% increase over last year's record $410 billion.
73%
anthropic
[63][64][65][67][69]
The Google, Amazon, Microsoft, and Meta plan for 2026 capital expenditure is a combined $725 billion.
72%
anthropic
[63][64][65][67][69]
The top three frontier models on Rogo’s Big Finance Bench were Anthropic’s Claude Opus 4.7, OpenAI’s GPT-5.5, and Claude Sonnet 4.6.
70%
openai
[1]
+ 74 more single-source insights

Low-confidence claims (60)

Weak signals the verifier flagged for hedged language in the report.

The report says that if a similar situation exists in 2026, one model could hold a notable lead over competitors.
29%
perplexity
The report says the most likely scenario is that top models converge on broad benchmarks while differentiation persists in specialized domains and non-performance dimensions.
29%
perplexity
The report says the environment in which it is answering does not provide any external search results or live web access.
31%
perplexity
A 0.3-percentage-point aggregate difference between models is typically within the margin of error, meaning the models would be tied within confidence intervals.
31%
perplexity
The report says on MMLU v1, values in the mid-80s to low-90s percent were common for frontier models.
31%
perplexity
+ 55 more low-confidence claims

Go Deeper

Follow-up questions based on where providers disagreed or confidence was low.

Verify which model is actually #1 on the Artificial Analysis Intelligence Index v4.0 as of late May 2026, and whether the leader is GPT-5.5 or Claude Opus 4.8

There is a direct contradiction on the same leaderboard: one provider says GPT-5.5 is first, another says Claude Opus 4.8 is #1. This is central to the user’s question about whether there is a single best frontier model, so it merits a targeted source check against the live or archived leaderboard snapshot.

DisagreementXS tier

Investigate this →

Determine the exact aggregate benchmark behind Rogo’s claim that Claude Opus 4.7, GPT-5.5, and Sonnet 4.6 are within ~0.3 percentage points of each other, and whether that spread is statistically meaningful

The key numerical claim driving the convergence argument is low-confidence and currently unverified. The follow-up should identify the specific Rogo aggregate metric, its weighting, sample size, and confidence interval to test whether a 0.3-point gap is real or within measurement noise.

Low ConfidenceS tier

Investigate this →

Check whether the late-May 2026 Artificial Analysis and LMArena snapshots support a statistically meaningful lead or only a tight cluster among the top frontier models

Multiple weak signals suggest the top models may be tightly clustered, but the exact margins and whether they exceed typical significance thresholds are unclear. A focused comparison of leaderboard scores, confidence intervals, and Elo differences would test the claim that convergence has occurred without assuming full commoditization.

Low ConfidenceS tier

Investigate this →

Assess whether routing or model selection by workflow/source type materially outperforms any single frontier model on the Rogo finance benchmark, and by how much

One of the strongest downstream implications is that even if top models are nearly tied, a router may beat them by a meaningful margin. This question is worth pursuing because it tests whether the practical winner is a single model or an orchestration strategy, with direct relevance to the “best overall” framing.

ImplicationM tier

Investigate this →

Quantify where frontier models still differentiate materially in specialized domains, such as finance, coding, scientific reasoning, and agentic workflows, despite broad benchmark convergence

Several signals point to convergence on broad aggregates but persistent separation in subdomains. Investigating domain-specific deltas across Rogo Big Finance Bench, SWE-bench Verified, GPQA, and related specialized evaluations would clarify whether any model is genuinely best overall versus best only in certain tasks.

ImplicationL tier

Investigate this →

Key Claims

Cross-provider analysis with confidence ratings and agreement tracking.

121 claims · sorted by confidence

high·grok, openai, anthropic, perplexity·agileleadershipdayindia.org agileleadershipdayindia.org openlm.ai+1·

On the Rogo finance benchmark/leaderboard, Claude Opus 4.7, GPT-5.5, and Sonnet 4.6 were all clustered at the top, separated by less than 0.3 percentage points overall.

high·grok, openai, anthropic, perplexity·rogo.ai linkedin.com x.com·

GPT-5.5 and Claude Opus 4.7/4.8 are near the top of broad benchmark aggregates, with GPT-5.5 slightly ahead on some evaluations including Terminal-Bench 2.1 (78.2% vs Opus 4.8's 74.6%).

high·grok, openai, perplexity, anthropic·renovateqr.com getaiperks.com buildfastwithai.com+4·

Rogo’s Big Finance Bench is a 928-question finance workflow benchmark focused on real-world financial analysis tasks.

high·grok, anthropic, openai·rogo.ai x.com·

The best single model scored 58.8% on the rubric.

high·grok, anthropic, openai·rogo.ai x.com·

Rogo published its Big Finance Bench benchmark on May 27, 2026.

high·grok, anthropic·rogo.ai linkedin.com x.com·

The Artificial Analysis Intelligence Index v4.0 was updated in March 2026 and uses v4.0.4 methodology.

high·grok, openai·benchlm.ai renovateqr.com artificialanalysis.ai+1·

Top AI systems are largely variants of Anthropic’s Claude Opus/Sonnet 4.x, OpenAI’s GPT-5.5 variants, and Google’s Gemini 3.1 Pro, and user preference tests barely distinguish GPT-5.5 from Claude or Gemini.

high·grok, openai·hai.stanford.edu rogo.ai buildmvpfast.com·

In late May 2026, the Artificial Analysis Intelligence Index top rank is disputed between Claude Opus 4.8 and OpenAI’s GPT-5.5, with reported scores around 61 for Claude Opus 4.8 and about 60 for GPT-5.5 depending on source/variant.

high·grok, perplexity(openai, anthropic disagree)·artificialanalysis.ai benchlm.ai renovateqr.com+5·

As of late May 2026, no single model has maintained a clear overall performance lead or can be considered the best overall.

high·grok, openai·hai.stanford.edu rogo.ai·

A May 28, 2026 public snapshot lists Claude Opus 4.7 (Adaptive) at 57.3%.

high·grok·benchlm.ai artificialanalysis.ai artificialanalysis.ai·

OpenAI reportedly spent on the order of $1.5–3 billion training GPT-5.

high·openai·btmdata.com·

Goldman Sachs models imply $765 billion in annual AI CapEx in 2026, rising to $1.6 trillion annually by 2031, for roughly $7.6 trillion in cumulative CapEx from 2026 to 2031.

high·anthropic·am.gs.com goldmansachs.com·

The combined $725 billion 2026 capital expenditure is a 77% increase over last year's record $410 billion.

high·anthropic·fool.com tomshardware.com tech-insider.org+2·

The Google, Amazon, Microsoft, and Meta plan for 2026 capital expenditure is a combined $725 billion.

high·anthropic·fool.com tomshardware.com tech-insider.org+2·