Executive Summary
- Rogo's "no single best model" claim is substantially accurate but temporally bounded: on the Big Finance Bench published May 27, 2026, Claude Opus 4.7, GPT-5.5, and Claude Sonnet 4.6 were separated by less than 0.3 percentage points on a rubric-graded aggregate score, with the best single model reaching 58.8% [1, 2].
- The Artificial Analysis Intelligence Index currently places Claude Opus 4.8 at #1 with a score of 61.4, edging GPT-5.5 (xhigh) by 1.2 points — a meaningful but narrow margin that emerged only after Opus 4.8 launched on May 28, 2026, one day after Rogo's publication [3, 4, 5].
- On LMArena's human-preference leaderboard, the top three models remain a statistical tie, with overlapping 95% confidence intervals; no model holds a statistically meaningful Elo lead [6].
- Frontier capability has genuinely converged but has not fully plateaued: the evidence shows near-parity on broad aggregates, persistent domain-specific differentiation, and one model (Opus 4.8) that has just opened a modest but real gap on the most demanding composite index.
- Trillions in training spend have produced diminishing returns on general benchmarks while accelerating cost reduction; the real differentiation frontier has shifted to inference efficiency, agentic performance, and specialized domain tasks.
1. The Rogo Big Finance Bench: What It Actually Measured and What It Found
The Benchmark's Design
Rogo published its Big Finance Bench (BFB) on May 27, 2026 [1, 2]. The benchmark comprises 928 questions written by ex-finance practitioners, scored against 15,656 rubric criteria representing 36,241 weighted points [2]. This scale matters for interpreting the "0.3 percentage point" claim: the evaluation is not a simple accuracy metric but a rubric-graded score that awards partial credit for reasoning quality, source identification, and analytical chain — not merely final numerical answers [2].
The rubric-grading methodology was deliberately chosen because strict final-answer scoring collapses near-successes into the same category as unsupported guesses. A model that correctly identifies the relevant SEC filing, applies the right accounting treatment, and follows sound analytical logic but makes a minor arithmetic error in the last step receives meaningful partial credit under rubric scoring but zero credit under binary accuracy [2]. This design choice makes the BFB more diagnostic of real-world financial analyst utility than standard benchmark accuracy.
Ten frontier models were evaluated in total [2].
The Core Finding: Sub-0.3pp Convergence
The claim that Claude Opus 4.7, GPT-5.5, and Claude Sonnet 4.6 sit within approximately 0.3 percentage points of each other on aggregate is well-confirmed across multiple independent sources [1, 2]. The best single model — either Opus 4.7 or GPT-5.5 depending on the sub-task — scored 58.8% on the rubric [2]. The three-way gap at the top is less than 0.3 percentage points overall.
Critically, no single model leads across the full dataset. The sub-task breakdown reveals distinct capability profiles:
- GPT-5.5 is strongest on capital structure and M&A analysis [1]
- Claude Sonnet 4.6 leads on earnings quality and financial statement analysis [1]
- Claude Opus 4.7 is strongest on private capital and forecasting tasks [1]
This pattern — each model leading in different domains while tying on aggregate — is the structural signature of genuine convergence rather than one model simply being weaker overall.
The Routing Insight
Rogo's data contains a finding that goes beyond the convergence narrative: a coarse router that selects models by workflow type adds +4.5 percentage points over the best single model on the rubric score [2]. A best-of-ten oracle (selecting the best available model response per question) adds +13.2 percentage points [2]. This demonstrates that the models are not interchangeable — they have meaningfully different error distributions — but that no single model captures all of that available performance. The practical implication for financial institutions is that ensemble or routing architectures outperform any single frontier model by a margin that dwarfs the inter-model differences at the top.
2. The Artificial Analysis Intelligence Index: Current Rankings with Exact Figures
Index Composition and Methodology
The Artificial Analysis Intelligence Index is a composite of approximately ten challenging evaluations spanning mathematics, scientific reasoning, coding, agentic tasks, and language comprehension, with a text-only English focus [7, 8]. As of the v4.0 methodology update (March 2026), the index includes GDPval-AA, τ²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam, GPQA Diamond, and CritPt [7, 3, 4].
Current Rankings (Late May 2026)
The source evidence contains a direct conflict on the #1 position that must be presented transparently:
Position A (supported by Artificial Analysis's own published analysis): Claude Opus 4.8 (max effort / adaptive reasoning) holds the #1 position with a score of 61.4, placing it 1.2 points ahead of GPT-5.5 (xhigh) at 60.2, and 4.1 points above its predecessor Claude Opus 4.7 [3, 4, 5].
Position B (reflected in some pre-May-28 snapshots): GPT-5.5 (xhigh) held the #1 position at 60.2 before Opus 4.8's launch [7, 8].
The resolution is chronological: both positions are correct for their respective dates. GPT-5.5 led the index until May 28, 2026, when Claude Opus 4.8 launched and displaced it. The current ranked order is:
| Rank | Model | AA Intelligence Index Score |
|---|---|---|
| 1 | Claude Opus 4.8 (max) | 61.4 |
| 2 | GPT-5.5 (xhigh) | 60.2 |
| 3 | GPT-5.5 (high) | ~59 |
| 4 | Claude Opus 4.7 (adaptive) | 57.3 |
| 5 | Gemini 3.1 Pro Preview | 57.2 |
[3, 4, 5, 7, 8]
The 1.2-point gap between Opus 4.8 and GPT-5.5 (xhigh) is the widest lead any single model has held on this index in recent months. Earlier in 2026, Claude Opus 4.7, Gemini 3.1 Pro Preview, and GPT-5.4 were effectively tied at 57 [3, 4].
Sub-Benchmark Differentiation
The aggregate score conceals meaningful variation at the component level. GPT-5.5 retains the lead on Terminal-Bench 2.1 with a score of 78.2% versus Opus 4.8's 74.6% — a 3.6-point gap [3, 4]. This is a narrowing from the 12.1-point gap that existed when Opus 4.7 was the Anthropic flagship [9]. Gemini 3.1 Pro leads on scientific reasoning with 94.3% on GPQA Diamond [3]. On GDPval-AA (a real-world task benchmark), Opus 4.8 scored 1,890 Elo, implying an approximately 67% win rate against GPT-5.5 [10].
On SWE-bench Verified (software engineering), six models now score within 0.8 points of each other, with three of those models having launched within the past five weeks [7, 11, 12].
3. LMArena: Human Preference Leaderboard
Current State
LMArena's human-preference leaderboard, which aggregates pairwise blind comparisons into Elo ratings, presents the clearest evidence for convergence. The top models cluster in the 1500–1506 Elo range [13]. Available snapshots show:
- GPT-5.5-high: ~1506 Elo
- Claude Opus 4.7 Thinking: ~1505 Elo
- Gemini 3.1 Pro: ~1505 Elo
[13]
One source reports Claude Opus 4.6 holding the headline #1 position on the LMArena Text leaderboard at Elo 1504, with Gemini 3.1 Pro Preview sitting within overlapping 95% confidence intervals [6]. The critical statistical point: an Elo gap is only meaningful when it exceeds the sum of the two models' 95% confidence intervals, a threshold typically around 18–22 Elo points [6]. The current top-three spread of approximately 1–5 Elo points falls well below this threshold, meaning the top three models on LMArena are a statistical tie [6].
It should be noted that LMArena snapshots available as of this writing do not yet reflect Claude Opus 4.8, which launched May 28, 2026 [6]. When Opus 4.8 is incorporated, its GDPval-AA performance (1,890 Elo, ~67% win rate against GPT-5.5) suggests it may break from the current cluster [10].
Stanford AI Index Corroboration
The Stanford 2026 AI Index independently confirms the LMArena picture: as of March 2026, the top closed models from Anthropic, xAI, Google, and OpenAI clustered within 25 Elo points of each other [14]. This is the tightest clustering the index has recorded.
4. Has Frontier Capability Genuinely Commoditized or Plateaued?
The Convergence Case
The evidence for convergence is strong and multi-dimensional:
Benchmark saturation: At least 14 distinct models now score above 90% on MMLU, a benchmark where GPT-4 led with approximately 86% in 2023 [14]. The top 15 models score in the 90–94% range on many standard tasks [15]. High-level benchmarks are effectively saturated for frontier models.
Finance domain parity: The BFB's sub-0.3pp spread across 928 finance-specific questions, scored against 15,656 rubric criteria, is the most domain-specific evidence of convergence available [1, 2]. This is not a general-purpose benchmark where prompt sensitivity might explain the gap — it is a carefully constructed professional evaluation.
Human preference ties: The LMArena statistical tie among the top three models means that real users, in blind pairwise comparisons, cannot reliably distinguish between GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro [6].
Rapid catch-up dynamics: The 12.1-point Terminal-Bench gap between GPT-5.5 and Claude Opus 4.7 narrowed to 3.6 points with Opus 4.8 in a single model generation [3, 9]. This rate of catch-up suggests that any lead one lab opens is quickly closed.
Cost collapse as a convergence signal: API costs for frontier-quality models fell roughly 80% between 2025 and early 2026, with models that cost $0.06 per 1,000 tokens in 2023 now running below $0.002 [16]. This commoditization of inference pricing reflects the underlying commoditization of capability — providers cannot sustain premium pricing when competitors match performance.
The Differentiation Case
The convergence narrative is real but incomplete. Several dimensions of genuine differentiation persist:
Domain-specific leadership: The BFB sub-task breakdown (GPT-5.5 on M&A, Sonnet 4.6 on earnings quality, Opus 4.7 on forecasting) is not noise — it reflects architecturally distinct strengths [1]. Similarly, Gemini 3.1 Pro's 94.3% GPQA Diamond score represents a genuine scientific reasoning advantage, and GPT-5.5's Terminal-Bench lead reflects real agentic capability differences [3].
Opus 4.8's current lead: The 1.2-point gap on the Artificial Analysis Intelligence Index and the ~67% GDPval-AA win rate against GPT-5.5 are the strongest evidence that one model — Claude Opus 4.8 — is currently the best overall on the most demanding composite evaluation [3, 4, 10]. This is a narrow lead, but it is real and consistent across multiple sub-benchmarks within the index.
Agentic and coding differentiation: Claude Code now accounts for roughly 4% of all public GitHub commits [3], a market-share signal that reflects real-world utility differentiation beyond benchmark scores. Six models score within 0.8 points on SWE-bench Verified, but the distribution of actual developer adoption is far less even [7, 11, 12].
Open-weights gap: Kimi K2.6 leads among open-weight models at 90.5% GPQA Diamond [7, 11, 12], but this still trails the closed-model frontier, indicating that the convergence is primarily among the top-tier closed labs rather than across the full model ecosystem.
5. The Training Spend Question: What Trillions Have Bought
Capital Expenditure Scale
The four major hyperscalers — Google, Amazon, Microsoft, and Meta — plan combined 2026 capital expenditure of $725 billion, a 77% increase over the prior year's record $410 billion [17, 18]. Goldman Sachs models project $765 billion in annual AI CapEx in 2026, rising to $1.6 trillion annually by 2031, implying roughly $7.6 trillion in cumulative CapEx over that period [19]. OpenAI reportedly spent on the order of $1.5–3 billion training GPT-5 alone [20].
What the Spend Has Produced
The relationship between training spend and benchmark differentiation has weakened substantially. The evidence suggests that massive parallel investment by multiple well-resourced labs has produced a situation where each new frontier model closes the gap opened by the previous generation's leader within months. The 12.1-point Terminal-Bench gap that GPT-5.5 held over Opus 4.7 was reduced to 3.6 points by Opus 4.8 in a single training cycle [3, 9].
The most significant structural shift in 2026 spending is the pivot from training to inference workloads [16, 19]. This reallocation reflects a recognition that the marginal return on additional training compute for general capability is declining, while the return on inference optimization — faster responses, lower costs, longer context, better tool use — remains high. The 80% cost reduction in frontier API pricing between 2025 and early 2026 is partly a consequence of this inference-side investment [16].
The academic literature on training costs confirms the underlying dynamic: as multiple labs converge on similar architectures, data mixtures, and training objectives, the capability gains from additional compute become more incremental [16]. OpenAI's decision to rebuild GPT-5.5's architecture, pretraining corpus, and objectives from scratch — the first such full rebuild since GPT-4.5 — was explicitly motivated by the need to find new scaling surfaces rather than simply adding compute to existing approaches [3].
6. Verdict: Is There a Single Best Model?
The Honest Answer
As of May 31, 2026, Claude Opus 4.8 is the strongest single model on the most rigorous composite evaluation available — the Artificial Analysis Intelligence Index, where it scores 61.4 versus GPT-5.5's 60.2 [3, 4, 5]. Its GDPval-AA Elo of 1,890 implies a ~67% win rate against GPT-5.5 on real-world task performance [10]. These are the strongest quantitative claims for a single model's superiority currently in evidence.
However, this verdict comes with four important qualifications:
-
The lead is narrow and recent: Opus 4.8 launched May 28, 2026 — three days ago. The 1.2-point AA Index lead is the widest any model has held in months, but it is not a commanding margin. GPT-5.5 retains the Terminal-Bench 2.1 lead (78.2% vs. 74.6%) [3].
-
The Rogo finding remains valid for its scope: On finance-specific tasks evaluated before Opus 4.8's launch, the three-way tie among Opus 4.7, GPT-5.5, and Sonnet 4.6 was real and documented [1, 2]. Rogo's conclusion that "there is no single best model" accurately described the state of play on May 27, 2026, and remains accurate for the finance domain specifically — Opus 4.8 was not included in the BFB evaluation.
-
LMArena has not yet incorporated Opus 4.8: The human-preference leaderboard still shows a statistical three-way tie [6]. When Opus 4.8 is added, the picture may shift, but it may also confirm that human raters cannot distinguish the models in blind comparison even when benchmark scores diverge.
-
Domain choice determines the winner: For scientific reasoning, Gemini 3.1 Pro's 94.3% GPQA Diamond is the strongest available result [3]. For terminal/agentic tasks, GPT-5.5's 78.2% Terminal-Bench score leads [3]. For finance workflows, the three-way tie persists [1, 2]. The "best model" question is only answerable once the use case is specified.
Summary Comparison Table
| Dimension | Claude Opus 4.8 | GPT-5.5 (xhigh) | Gemini 3.1 Pro | Claude Sonnet 4.6 |
|---|---|---|---|---|
| AA Intelligence Index | 61.4 (#1) | 60.2 (#2) | 57.2 (#5) | N/A |
| Terminal-Bench 2.1 | 74.6% | 78.2% (#1) | ~75% | N/A |
| GPQA Diamond | ~91% | ~90% | 94.3% (#1) | N/A |
| GDPval-AA Elo | 1,890 | ~1,750 | N/A | N/A |
| LMArena Elo (text) | Not yet rated | ~1506 | ~1505 | N/A |
| Rogo BFB (finance) | Not tested | ~58.8% (tied) | N/A | ~58.8% (tied) |
| Finance: M&A/Capital Structure | N/A | Leads | N/A | N/A |
| Finance: Earnings Quality | N/A | N/A | N/A | Leads |
| Finance: Forecasting | Predecessor led | N/A | N/A | N/A |
[1, 3, 4, 10, 6, 2]
The Structural Conclusion
Frontier AI capability has converged to a degree that was not true eighteen months ago. The Stanford AI Index's finding that top closed models cluster within 25 Elo points [14], the BFB's sub-0.3pp finance spread [1, 2], and the LMArena statistical tie [6] all point to a landscape where the choice of frontier model matters far less than it once did for most applications. The differentiation that remains is real but domain-specific, and it shifts with each new model release.
The strongest evidence against full commoditization is Opus 4.8's 1.2-point AA Index lead and its 67% GDPval-AA win rate — both suggesting that Anthropic has, at least momentarily, opened a gap that is larger than the statistical noise [3, 4, 10]. Whether that gap persists through the next OpenAI or Google release is the central empirical question of the next 60–90 days.
References
[1] Introducing the big finance benchmark (rogo.ai). rogo.ai. https://rogo.ai/news/introducing-the-big-finance-benchmark
[2] Status (x.com). x.com. https://x.com/RogoAI/status/2059743405203480888
[3] Claude Opus 4.8 - The new #1 AI model. artificialanalysis.ai. https://artificialanalysis.ai/articles/claude-opus-4-8-analysis-and-benchmarks
[4] Claude Opus 4.8 (max) - Intelligence, Performance & Price Analysis. artificialanalysis.ai. https://artificialanalysis.ai/models/claude-opus-4-8
[5] Claude Opus 4.8 Tops Artificial Analysis Intelligence Index, Edges Out GPT 5.5 With Score Of 61.4. officechai.com. https://officechai.com/ai/claude-opus-4-8-tops-artificial-analysis-intelligence-index-edges-out-gpt-5-5-with-score-of-61-4
[6] Arena Leaderboard - a Hugging Face Space by lmarena-ai. huggingface.co. https://huggingface.co/spaces/lmarena-ai/arena-leaderboard
[7] Artificial analysis intelligence index (artificialanalysis.ai). artificialanalysis.ai. https://artificialanalysis.ai/evaluations/artificial-analysis-intelligence-index
[8] Models (artificialanalysis.ai). artificialanalysis.ai. https://artificialanalysis.ai/leaderboards/models
[9] Claude Opus 4.7 (max) - Intelligence, Performance & Price Analysis. artificialanalysis.ai. https://artificialanalysis.ai/models/claude-opus-4-7
[10] Claude Opus 4.8 Beats GPT 5.5 On GDPval-AA Benchmark For Real World Tasks. officechai.com. https://officechai.com/ai/claude-opus-4-8-beats-gpt-5-5-on-gdpval-aa-benchmark-for-real-world-tasks
[11] AI Model & API Providers Analysis | Artificial Analysis. artificialanalysis.ai. https://artificialanalysis.ai
[12] Artificial Analysis. artificialanalysis.ai. https://artificialanalysis.ai/evaluations
[13] Chatbot arena (openlm.ai). openlm.ai. https://openlm.ai/chatbot-arena
[14] "Technical performance (hai.stanford.edu)." https://hai.stanford.edu/ai-index/2026-ai-index-report/technical-performance
[15] LM Leaderboard May 2026 | LMArena Elo, Pricing & Open Weights. swfte.com. https://swfte.com/ai/lm/leaderboard
[16] "The rising costs of training frontier AI models." https://arxiv.org/html/2405.21015v1
[17] Big Tech is about to spend $700 billion on AI this year. No one knows where the buildout ends. | Fortune. fortune.com. https://fortune.com/2026/04/30/big-tech-hyperscalers-will-spend-700-billion-on-ai-infrastructure-this-year-with-no-clear-end-in-sight-eye-on-ai
[18] Skyrocketing component prices push Big Tech capex to record $725 billion — Microsoft alone attributes $25 billion of AI budget to increased memory and chip costs | Tom's Hardware. tomshardware.com. https://tomshardware.com/tech-industry/big-tech/microsoft-attributed-25-billion-of-its-record-ai-budget-to-memory-chip-costs
[19] Tracking Trillions: The Assumptions Shaping the Scale of the AI Build-Out | Goldman Sachs. goldmansachs.com. https://goldmansachs.com/insights/articles/tracking-trillions-the-assumptions-shaping-scale-of-the-ai-build-out
[20] GPT-5 — Training cost, GPU hours & cluster size | BtMData. btmdata.com. https://btmdata.com/ai-training/gpt-5