May 31, 2026·15 min read·4 providers

AI vs Superforecasters: ForecastBench May 2026

ForecastBench (May 23, 2026) finds DeepMind's 'green tree' close to top humans but trailing by ~2–3 Brier Index points; parity claims are overstated.

Key Finding

ForecastBench may overstate AI’s forecasting ability and does not fully capture hard-to-predict domains such as black-swan or geopolitical judgment; thus AI has not yet shown comprehensive parity on the full benchmark or in real-world forecasting.

high confidenceSupported by anthropic, openai, grok
Justin Furniss
Justin Furniss

@Parallect.ai and @SecureCoders. Founder. Hacker. Father. Seeker of all things AI

anthropicperplexityopenaigrok

Executive Summary

  • "Green Tree" is real but mischaracterized: Google DeepMind's "green tree" is a confirmed ForecastBench tournament submission, currently ranked #2 overall with a Brier Index of approximately 67.8–67.9 as of the May 23, 2026 leaderboard update — meaningfully below the superforecaster median of 70.2. The specific claim of parity "around March 15, 2026" originates from a single secondary source and cannot be independently verified against primary leaderboard data [1, 2].
  • The benchmark numbers are clear: On ForecastBench, superforecasters hold a Brier Index of ~70.2 overall; the best AI system (DeepMind "green tree") sits at ~67.8–67.9 — a gap of roughly 2–3 Brier Index points that is statistically meaningful, not noise [1, 2].
  • Parity is overstated in popular framing: The gap has closed dramatically since 2023 (GPT-4 scored 0.131 vs. superforecasters' 0.081), and AI has achieved near-parity or transient parity on specific data-rich question subsets, but comprehensive parity on the full benchmark has not been established as of late May 2026 [3, 4].
  • The strongest pro-parity evidence comes from Bridgewater AIA Labs' November 2025 paper showing the AIA Forecaster scoring 0.1125 on FB-7-21 against a superforecaster median of 0.1110 — a near-tie on one benchmark slice, though the system still lags on harder market questions [5, 6].
  • Practical implications are real but prospective: Finance, insurance, and governance applications are widely discussed, but quantified deployment outcomes remain sparse; the dominant framing is "AI plus superforecasters," not replacement [7, 8].

1. The "Green Tree" System: Verification of the System, Date, and Benchmark

What "Green Tree" Actually Is

Google DeepMind's "green tree" is a confirmed entry on the ForecastBench tournament leaderboard, operated by the Forecasting Research Institute (FRI) [1, 2]. Multiple independent sources corroborate that DeepMind submitted forecasting systems under internal code names rather than public product brand names, with at least two entries visible on the leaderboard: "green tree" and a second entry labeled "yellow mouse" [1]. This naming convention — animal-color combinations — is consistent with internal project labeling practices rather than public product releases, and DeepMind has not publicly announced a product called "Green Tree" [1].

ForecastBench itself is a dynamic, contamination-free benchmark that continuously generates and curates approximately 1,000 real-world forecasting questions spanning geopolitics, economics, technology, and public health [3, 4]. It evaluates both human forecasters and AI systems using difficulty-adjusted Brier scores, converted to a Brier Index for public reporting [3, 4]. The benchmark is specifically designed to prevent training-data contamination by focusing on future events at the time of question creation [3].

The March 15, 2026 Parity Claim: What the Evidence Shows

The specific claim that Green Tree "hit parity with top human superforecasters around March 15, 2026" traces to a single secondary source — a Substack newsletter dated May 22, 2026 [9]. This source is not a primary leaderboard record, a peer-reviewed paper, or an FRI publication. No FRI methodology post, no ForecastBench leaderboard snapshot, and no DeepMind announcement corroborates the specific date of March 15, 2026 as a parity milestone.

What the primary leaderboard data does show is more nuanced. One source reports that earlier preliminary leaderboard states showed DeepMind models transiently appearing above the superforecaster median before additional question resolutions shifted rankings — consistent with the kind of transient crossing that could generate a "parity" headline without representing durable parity [1]. The ForecastBench tournament leaderboard allows scaffolded and ensembled entries, and results can be noisy in small samples or on preliminary boards [1, 2].

The most defensible interpretation: Green Tree may have briefly appeared at or above the superforecaster median on a specific leaderboard snapshot or question subset (particularly data-rich "dataset questions") around that timeframe, generating the parity claim. But as of the authoritative May 23, 2026 leaderboard update, Green Tree ranked #2 overall — below the superforecaster median — indicating that if transient parity occurred, it did not persist on the full benchmark [1, 2].

Current Leaderboard Position (May 23, 2026)

The May 23, 2026 ForecastBench tournament leaderboard update provides the most current authoritative figures [1, 2]:

ForecasterOverall Brier IndexDataset QuestionsMarket QuestionsRank
Superforecaster Median70.263.678.6#1
DeepMind "green tree"~67.8–67.9~64.6–64.8~71.4#2
DeepMind "yellow mouse"~67.6#3 (approx.)
xAI Grok 4.20 Preview~67.4#4 (approx.)
Public Median Forecaster~64.5Lower
Claude Sonnet 4.5 (zero-shot)~63.3Lower
OpenAI o3 (scratchpad)~63.2Lower

Sources: [1, 2]

Several features of this table are analytically important. First, Green Tree leads all AI systems but trails superforecasters by approximately 2.3–2.4 Brier Index points overall. Second, on dataset questions — the data-rich, shorter-horizon items where AI has structural advantages — Green Tree (64.6–64.8) actually exceeds the superforecaster median (63.6), which is the most plausible basis for the March 15 parity claim [1]. Third, on market questions — longer-horizon, judgment-intensive items — the gap reverses sharply: superforecasters score 78.6 versus Green Tree's 71.4, a 7.2-point deficit for AI [1, 2]. This asymmetry is critical to understanding what "parity" means in this context.


2. ForecastBench Scores: Current Best AI Brier Score vs. Human Superforecaster Aggregate

The Brier Score and Brier Index: Technical Grounding

The Brier score, introduced by Glenn W. Brier in 1950, measures the mean squared error between predicted probabilities and realized binary outcomes: for a single forecast, (p − o)², and for N forecasts, (1/N)Σ(pᵢ − oᵢ)² [3, 4]. Lower scores indicate better calibration and accuracy; a perfect forecaster scores 0, while a naive forecaster always predicting 50% scores 0.25 [3].

FRI's Brier Index transforms this into a more intuitive scale: Brier Index = (1 − √Brier score) × 100% [1]. A score of 100% represents perfect foresight; 50% represents coin-flip forecasting; 0% represents a maximally wrong forecaster [1]. A Brier score of 0.086 corresponds to a Brier Index of approximately 70.6% [1]. ForecastBench additionally applies difficulty-adjusted Brier scores that subtract question-level fixed effects (γⱼ) estimated from the observed distribution of scores across many forecasters, enabling apples-to-apples comparisons across forecasters who answered different question sets [3, 4].

The Trajectory: 2023 to May 2026

The improvement in AI forecasting accuracy over this period is well-documented across multiple sources [3, 4]:

DateSystemDifficulty-Adjusted Brier ScoreNotes
March 2023GPT-40.131Baseline LLM performance
February 2025GPT-4.50.101~20% improvement over 2 years
October 2025Best LLM (ForecastBench)0.101Superforecasters at 0.081
November 2025AIA Forecaster (FB-7-21)0.1125Superforecaster median: 0.1110
March 2026Best LLM (ForecastBench)~0.103Superforecasters at 0.086
May 2026DeepMind "green tree"~0.086–0.088 (est.)Brier Index ~67.9; SF at 70.2

Sources: [3, 4, 5, 6, 1]

The October 2025 snapshot is particularly well-documented: the superforecaster median achieved an overall difficulty-adjusted Brier score of 0.081, while the best LLM entry achieved 0.101 — a gap of approximately 20% in error terms [3, 4]. FRI's October 2025 linear extrapolation, based on an estimated annual improvement rate of approximately 0.016 difficulty-adjusted Brier points, projected LLM-superforecaster parity on ForecastBench by November 2026, with a 95% confidence interval spanning December 2025 to January 2028 [3, 4].

By March 5, 2026, an updated FRI extrapolation revised the parity estimate to March 2027 on the Brier score metric and May 2027 on the Brier Index metric, with 95% CIs of February 2026–January 2028 and April 2026–May 2028 respectively [1, 2]. A separate linear trend projection cited by one source estimates LLM-superforecaster parity around August 2027, with a 95% CI of March 2026 to August 2028 [1].

The AIA Forecaster: The Strongest Pro-Parity Data Point

The most compelling single data point in favor of near-parity comes from Bridgewater AIA Labs' AIA Forecaster, described in Alur et al., arXiv:2511.07678, dated November 10–11, 2025 [5, 6]. This system combines agentic search over high-quality news sources, a supervisor agent that reconciles disparate forecasts for the same event, and statistical calibration techniques designed to counter behavioral biases in LLMs [5, 6].

On the FB-7-21 benchmark slice, the AIA Forecaster scored 0.1125 against a superforecaster median of 0.1110 — a difference of 0.0015 Brier points, which is within the margin of noise for this sample size [5, 6]. This represents the closest any documented AI system has come to matching the superforecaster median on a ForecastBench-derived evaluation.

However, the same paper reveals important limitations. On the FB-Market subset — questions tied to prediction markets, which tend to be harder and more judgment-intensive — the AIA Forecaster scored 0.0753 against a human state-of-the-art of 0.0740, a gap of 0.0013 in the wrong direction [5, 6]. On the more challenging MarketLiquid benchmark, the AIA Forecaster scored 0.108 while market consensus scored 0.098 — a 10% deficit [5, 6]. Notably, an ensemble combining the AIA Forecaster and market prices achieved 0.092, outperforming both components individually, suggesting complementarity rather than substitution [5, 6].

Good Judgment's Market-Question Data

Good Judgment Inc.'s own analysis of ForecastBench performance on market questions reveals a more persistent gap [7, 8]. On market-question subsets, Good Judgment reports a superforecaster Brier score of approximately 0.039 versus approximately 0.059 for the best AI entrant — meaning the best AI's error rate is roughly 50% larger than that of superforecasters on these questions [7, 8]. This is consistent with the tournament leaderboard data showing a 7.2-point Brier Index gap on market questions (78.6 vs. 71.4) [1].


3. Is "Parity" Actually Established, or Is It Contested and Overstated?

The Case That Parity Has Been Reached (or Is Imminent)

The strongest evidence for parity rests on several pillars. First, the AIA Forecaster's November 2025 result on FB-7-21 (0.1125 vs. 0.1110) demonstrates that a purpose-built AI forecasting system can match the superforecaster median on a well-designed benchmark slice [5, 6]. Second, on ForecastBench's dataset questions — the data-rich, shorter-horizon category — Green Tree (64.6–64.8 Brier Index) actually exceeds the superforecaster median (63.6) as of May 2026 [1]. Third, the trajectory of improvement is steep and consistent: from GPT-4's 0.131 in March 2023 to near-superforecaster performance in late 2025 represents roughly a 35% reduction in error over two years [3, 4]. Fourth, one secondary source reports that the Forecasting Research Institute characterized the Green Tree result as a milestone achieved earlier than many expected, and that Green Tree's results showed no statistically significant difference from human top performers on certain question suites [9].

The Case That Parity Is Overstated

The evidence against comprehensive parity is, on balance, stronger and better-sourced. Several structural critiques of the parity claim deserve careful attention.

The frozen baseline problem. ForecastBench's human baseline is a frozen 2024 dataset [1, 8]. AI systems are evaluated against a static human benchmark rather than against superforecasters forecasting in real time. If superforecasters continued improving through 2025–2026 — which is plausible given their track record of calibration improvement — the true gap may be larger than the leaderboard suggests [8].

Question composition bias. ForecastBench is weighted toward data-rich, quantitative questions where AI has structural advantages [8]. The benchmark does not fully capture teaming, iterative updating, question formulation, or black-swan and geopolitical judgment — precisely the domains where superforecasters' qualitative reasoning skills are most valuable [1, 8]. Good Judgment's own analysis explicitly notes that ForecastBench does not measure what superforecasters actually do best [8].

The multiple-attempts problem. ForecastBench allows AI systems to submit updated forecasts every two weeks, while the human baseline is frozen [1]. Given enough evaluation cycles, statistical noise alone will eventually produce a result where an AI system falls below the benchmark threshold — a form of multiple comparisons inflation that can generate misleading parity headlines [1].

The market-question gap persists. On market questions — the hardest and most judgment-intensive category — superforecasters lead by 7.2 Brier Index points (78.6 vs. 71.4) and maintain roughly a 50% error-rate advantage [1, 7, 8]. This is not a marginal gap; it represents a domain where AI systems have not approached parity.

Statistical significance of the overall gap. The ForecastBench working paper reports that superforecasters significantly outperform the top LLM on both mean Brier scores and t-tests of performance differences, with p-values well below 0.001 on a subset of 200 questions [3, 4]. This is not a borderline result.

Expert forecasts of parity timing. The median superforecaster predicts AI will beat the ForecastBench benchmark by 2028; the median domain expert predicts 2030; the median public respondent predicts 2033 [1, 2]. These forecasts, made by the people most familiar with both AI capabilities and the benchmark's demands, suggest that the forecasting community itself does not believe parity has been achieved.

Good Judgment's institutional position. Good Judgment's CEO Warren Hatch has stated that AI has not fully replicated the collective human "hive mind" and that the company's philosophy is "Superforecasters plus AI, not one or the other" [8]. This framing — from the organization that manages the superforecaster baseline — is a meaningful signal about the state of play.

The Honest Synthesis

The evidence supports a nuanced conclusion: AI has achieved subset parity on data-rich, shorter-horizon questions and has dramatically closed the gap overall, but comprehensive parity on the full ForecastBench benchmark has not been established as of late May 2026. The popular framing of "AI has matched superforecasters" is an overstatement that conflates performance on favorable question subsets with performance across the full distribution of forecasting tasks. The gap on the overall leaderboard (70.2 vs. 67.8–67.9 Brier Index) is small in absolute terms but statistically meaningful and consistent across multiple measurement approaches. The trajectory suggests full parity is plausible within 1–3 years, but it has not arrived yet.


4. Concrete Implications for Finance, Insurance, and Governance

Finance

The most concrete financial implication documented in the evidence base is the cost-disruption argument: if AI systems approach superforecaster-level accuracy, the cost of high-quality probabilistic forecasting collapses from the fees charged by elite human forecasting services to approximately the cost of an API subscription [10]. This threatens the business model of traditional geopolitical risk firms, which must justify premium fees against a ~$20/month API call [10]. The democratization effect — making elite-level forecasting intelligence accessible to startups and SMEs that previously could not afford it — is cited as a significant structural shift [10].

In quantitative finance, more accurate probabilistic forecasting directly enables better risk pricing and allocation, improved capital deployment decisions, and more precise hedging strategies [1]. The AIA Forecaster paper demonstrates this concretely: on the MarketLiquid benchmark, an ensemble of the AIA Forecaster and market prices achieved a Brier score of 0.092, outperforming market consensus alone (0.098) — suggesting that AI forecasting systems can add measurable alpha even in liquid, information-rich markets [5, 6]. A small improvement in probabilistic accuracy compounds significantly in financial applications where decisions are made repeatedly at scale [1].

One industry survey cited by a single source reports that as of 2025, 96% of insurers surveyed were investing in data-driven forecasting and analytics technology [9]. While this figure cannot be independently verified from primary sources in the available registry, it is consistent with the broader pattern of institutional adoption.

Insurance

For insurance and reinsurance, the implications center on exposure modeling and catastrophe risk pricing. More accurate probabilistic forecasting of geopolitical events, climate-related risks, and economic disruptions directly affects underwriting decisions and reserve adequacy [1]. The domain-specific AI forecasting systems developed by Google DeepMind — including GraphCast and GenCast for weather and extreme event prediction — represent a separate but related track of AI forecasting capability that is already being integrated into catastrophe modeling workflows [1]. These weather-specific systems operate at a different level of maturity than general-purpose forecasting benchmarks; they are not the same as "green tree" but represent the broader DeepMind forecasting portfolio.

The insurance industry's concern is less about whether AI matches superforecasters on abstract benchmark questions and more about whether AI can improve the accuracy of specific, high-stakes probability estimates — flood probabilities, pandemic severity distributions, geopolitical disruption likelihoods — that feed directly into pricing models. The evidence suggests AI is increasingly useful for this purpose, particularly when combined with human judgment rather than deployed autonomously [7, 8].

Governance and Policy

For governance applications, the implications are primarily about decision support rather than automation. The evidence base consistently frames AI forecasting as a tool to augment human decision-making in policy contexts — improving the quality of scenario analysis, stress-testing policy assumptions, and providing calibrated probability estimates for legislative and regulatory decisions [10, 1]. The Forecasting Research Institute's LEAP (Longitudinal Expert AI Panel) Wave 5 report on security and geopolitics explicitly addresses how AI forecasting capabilities are reshaping the intelligence and policy analysis landscape [10].

The governance concern runs in both directions: AI forecasting tools can improve the quality of policy analysis, but they also create new risks if deployed without appropriate oversight. Institutions are advised to track calibration over time, combine human and AI forecasts rather than relying on either alone, and maintain strong governance frameworks for high-stakes deployment [1]. The risk of overconfidence in AI forecasting outputs — particularly given the benchmark limitations discussed above — is a concrete governance challenge that has not yet been systematically addressed in regulatory frameworks.

The cost-democratization effect also has governance implications: if high-quality forecasting becomes cheap and widely accessible, the information advantage currently held by well-resourced governments and large institutions over smaller actors narrows. This has implications for competitive intelligence, regulatory arbitrage, and the distribution of strategic foresight capabilities across the public and private sectors [10].


5. Current Honest Status: What We Actually Know as of Late May 2026

The evidence, taken as a whole, supports the following conclusions with varying degrees of confidence:

Well-established (high confidence, multi-source):

  • ForecastBench is the leading public benchmark for AI vs. human forecasting, operated by FRI, using difficulty-adjusted Brier scores and a Brier Index [3, 4, 1, 2].
  • As of May 23, 2026, superforecasters hold the #1 position on the ForecastBench tournament leaderboard with a Brier Index of 70.2 overall [1, 2].
  • Google DeepMind's "green tree" is a real ForecastBench tournament submission, currently ranked #2 with a Brier Index of approximately 67.8–67.9 [1, 2].
  • The gap between AI and superforecasters has closed dramatically since 2023 (from ~35% error gap to ~3% Brier Index gap overall) but has not closed to zero [3, 4, 1].
  • On market questions, the gap remains substantial: superforecasters score 78.6 vs. Green Tree's 71.4 on the Brier Index [1].

Current evidence suggests (moderate confidence):

  • The specific claim of parity "around March 15, 2026" cannot be verified from primary leaderboard data and likely reflects either a transient leaderboard state on a specific question subset or a secondary-source overstatement [1, 2].
  • Green Tree may have achieved transient or subset parity on data-rich dataset questions, where it currently scores 64.6–64.8 vs. the superforecaster median of 63.6 [1].
  • Full parity on the overall ForecastBench benchmark is projected by FRI's linear extrapolation for sometime between mid-2027 and early 2028, with substantial uncertainty [1, 2].

Preliminary findings indicate (lower confidence, limited sources):

  • The AIA Forecaster (Alur et al., arXiv:2511.07678) achieved near-parity on the FB-7-21 slice (0.1125 vs. 0.1110 superforecaster median) in November 2025, representing the closest documented approach to parity on any well-designed evaluation [5, 6].
  • Ensemble approaches combining AI forecasts with market prices or human judgment consistently outperform either component alone, suggesting the near-term optimum is human-AI collaboration rather than AI replacement [5, 6, 8].

The available sources do not support: any claim that comprehensive, durable, statistically robust parity between AI systems and elite human superforecasters has been achieved on the full ForecastBench benchmark or in real-world forecasting practice as of late May 2026. The honest status is: AI has reached parity on a favorable subset of questions, is closing the gap rapidly on the overall benchmark, but remains meaningfully behind on the hardest, most judgment-intensive forecasting tasks — and the benchmark itself may understate the true gap by using a frozen 2024 human baseline.

References

[1] Leaderboards - ForecastBench. forecastbench.org. https://forecastbench.org/leaderboards/index.html

[2] ForecastBench. forecastbench.org. https://forecastbench.org

[3] "forecastbench:adynamic benchmark of ai forecasting ...." https://arxiv.org/pdf/2409.19839

[4] "ForecastBench A Dynamic (faculty.wharton.upenn.edu)." https://faculty.wharton.upenn.edu/wp-content/uploads/2026/02/ForecastBench_A_Dynamic_.pdf

[5] "AIA Forecaster: Technical Report." https://arxiv.org/pdf/2511.07678

[6] "[2511.07678] AIA Forecaster: Technical Report." https://arxiv.org/abs/2511.07678

[7] Human vs AI Forecasts: What Leaders Need to Know. goodjudgment.com. https://goodjudgment.com/human-vs-ai-forecasts

[8] What Superforecasters Actually Said About ForecastBench - Good Judgment. goodjudgment.com. https://goodjudgment.com/what-superforecasters-actually-said-about-forecastbench

[9] Welcome to May 22, 2026 - by Dr. Alex Wissner-Gross. theinnermostloop.substack.com. https://theinnermostloop.substack.com/p/welcome-to-may-22-2026

[10] Wave 5: Security and Geopolitics - Longitudinal Expert AI Panel. leap.forecastingresearch.org. https://leap.forecastingresearch.org/reports/wave5

Evidence Explorer

Select a citation or claim to explore evidence.

Cross-provider analysis

How 4 providers compared on 245 claims across 118 topic clusters

6
Consensus
4
Contested
91
Unique
46
Low-conf
standard

Consensus findings (6)

Multiple providers independently confirmed these. Treat as the most reliable evidence.

  • On the FRI's interpretable Brier Index scale, a score of 100% means perfect foresight, while a score of 0% means a maximally wrong forecaster.

    82%
    anthropicperplexitygrok
  • ForecastBench is FRI’s dynamic, contamination-free benchmark of real-world AI forecasting on future events, using difficulty-adjusted Brier-based scoring and public reporting via a Brier Index.

    79%
    anthropicperplexityopenaigrok
  • On ForecastBench, superforecasters led the leaderboard in late 2025 to early 2026 with a difficulty-adjusted Brier score of about 0.081–0.086 (about 70.2 on the Brier Index), while the best LLMs trailed at about 0.101–0.103.

    79%
    anthropicperplexityopenaigrok
  • Claims of parity exist, but that parity is contested and may be overstated in popular framing.

    72%
    anthropicopenaigrok
  • ForecastBench may overstate AI’s forecasting ability and does not fully capture hard-to-predict domains such as black-swan or geopolitical judgment; thus AI has not yet shown comprehensive parity on the full benchmark or in real-world forecasting.

    71%
    anthropicopenaigrok
  • On FRI's live leaderboard, superforecasters are about 50% more accurate than the nearest/best AI on market questions, and they lead overall.

    70%
    anthropicperplexitygrok

Contested findings (4)

Providers disagreed. Both positions surfaced rather than picked.

  • Position A

    Google DeepMind's "green tree" claimed the top spot on the Forecasting Research Institute's benchmark. A Google DeepMind submission labeled “green tree” on ForecastBench’s tournament leaderboard is currently reported at an overall Brier Index of roughly 67.9. One of Google DeepMind’s ForecastBench entries appears under the label “green tree.” Another Google DeepMind ForecastBench entry appears under the label “yellow mouse.” As of May 23, 2026, Google DeepMind’s “green tree” is reported at an overall Brier Index of 67.9. On ForecastBench’s Overall Score, DeepMind’s Green Tree scored 67.9 as of May 2026.

    anthropicperplexityopenai

    Position B

    Green tree is a Google DeepMind tournament submission to ForecastBench. As of the May 23, 2026 leaderboard update, Google DeepMind green tree had 67.8 overall Brier Index. As of the May 23, 2026 leaderboard update, Google DeepMind green tree had about 64.6–64.8 on dataset questions. As of the May 23, 2026 leaderboard update, Google DeepMind green tree had about 71.4 on market questions. Google DeepMind green tree ranked #2 overall.

    grok

    4 providers split on this claim.

  • Position A

    As of late May 2026, AI forecasting systems have reached the threshold of parity with elite human superforecasters on at least one major benchmark (ForecastBench). As of late May 2026, AI has not reached unambiguous parity with elite human superforecasters on comprehensive real-world forecasting benchmarks.

    anthropicgrok

    Position B

    As of the second half of May 2026, general-purpose large language model forecasting systems have substantially closed the gap with elite human superforecasters but have not yet fully matched them on broad, real-world forecasting tasks under shared, contamination-controlled protocols.

    perplexity

    Claims [0] and [4]/[6]/[8] assert parity or near-parity, while [5] and [9] explicitly deny full or unambiguous parity on broader benchmarks at the same time period.

  • Position A

    Lower Brier scores indicate better calibration and accuracy. A lower Brier score indicates better accuracy.

    perplexityopenai

    Position B

    A higher Brier Index is better.

    grok

    This claim states the opposite direction for the Brier metric (higher is better), conflicting with the lower-is-better claims.

  • Position A

    On ForecastBench, an LLM has matched superforecaster performance for the first time. The ForecastBench working paper reports that superforecasters significantly outperform the top LLM on both mean Brier scores and t-tests of performance differences.

    anthropicperplexity

    Position B

    The ForecastBench result is real.

    openai

    States the opposite result: superforecasters significantly outperform the top LLM on ForecastBench measures.

Single-source insights (91)

Reported by only one provider. Treat as preliminary unless independently verified.

  • The October 2025 figures correspond to roughly a 20% human edge.

    73%
    anthropic
  • The Brier score was originally introduced by Glenn W. Brier in 1950 in “Verification of forecasts expressed in terms of probability.”

    71%
    perplexity
  • The Forecasting Research Institute introduced the Brier Index in a 2025 explainer titled “Making Forecasting Scores Easier to Interpret: Introducing the Brier Index.”

    70%
    perplexity
  • In late 2025, there was still a roughly 20% performance gap.

    70%
    openai
  • As of 2025, 96% of insurers surveyed were investing in data-driven forecasting and analytics tech.

    70%
    openai
  • Around mid-March 2026, Green Tree was reported to match the accuracy of elite human forecasters.

    70%
    openai
  • + 85 more single-source insights

Low-confidence claims (46)

Weak signals the verifier flagged for hedged language in the report.

  • Given enough evaluations, eventually one AI attempt will fall under the benchmark mark by chance.

    56%
    anthropic
  • AI might hallucinate or miss context in those harder forecasting areas.

    56%
    openai
  • The median public forecast ranked #2 on the leaderboard a year ago.

    57%
    anthropic
  • Today, the median public forecast sits at #22.

    57%
    anthropic
  • AI systems are already used as informational tools in financial decision-making.

    57%
    perplexity
  • + 41 more low-confidence claims

Go Deeper

Follow-up questions based on where providers disagreed or confidence was low.

Verify the identity and benchmark details of Google DeepMind’s reported “Green Tree” forecasting system, including whether it is a public DeepMind product name or an internal label, and confirm the exact May 23, 2026 ForecastBench overall score and ranking

The current signals conflict on whether “green tree” is a real DeepMind system, whether the label is internal, and whether its overall score is 67.8, 67.9, or something else. Resolving the exact system identity, leaderboard snapshot date, and benchmark nomenclature is necessary before any parity claim can be trusted.

DisagreementXS tier
Investigate this →

Determine the correct direction and normalization of ForecastBench’s Brier metric and whether the reported “Brier Index” is comparable to standard Brier score across providers

Providers disagree on whether higher or lower is better, which undermines every score comparison in the parity debate. A targeted methods lookup into ForecastBench’s scoring convention and any transformations used in the tournament leaderboard is needed to interpret the reported values correctly.

DisagreementXS tier
Investigate this →

Establish whether ForecastBench currently supports a statistically valid claim of parity between the best AI system and elite human superforecasters, using the latest leaderboard plus the underlying working paper tests

The weak signals split between “parity has been reached,” “superforecasters still outperform,” and “parity is not yet established.” The most valuable follow-on is a direct check of the benchmark’s latest leaderboard alongside the paper’s mean-Brier and significance results, including whether the comparison is on all questions or only subsets like dataset versus market items.

DisagreementS tier
Investigate this →

Reconstruct the exact human-vs-AI Brier figures on ForecastBench and the subset claims for dataset questions, market questions, and the public median forecaster over time

Several signals mention a human median at different ranks and specific subset gaps, but these are not yet tied to a consistent time series. A focused quantitative reconstruction would clarify whether AI is merely close on easier data-rich items while still trailing on market and judgment-heavy questions.

Low ConfidenceM tier
Investigate this →

Investigate the concrete downstream implications of AI forecasting parity for finance, insurance, and governance, and identify whether any quantified deployment or ROI evidence exists

The signals suggest major implications—risk pricing, allocation, hedging, insurance purchases, regulatory decisions, and governance—but also note that public discussion is high-level and lacks quantified impact. This is worth pursuing because it separates plausible strategic consequences from unsupported extrapolation.

ImplicationM tier
Investigate this →

Key Claims

Cross-provider analysis with confidence ratings and agreement tracking.

118 claims · sorted by confidence
1

ForecastBench is FRI’s dynamic, contamination-free benchmark of real-world AI forecasting on future events, using difficulty-adjusted Brier-based scoring and public reporting via a Brier Index.

high·anthropic, perplexity, openai, grok·greentreegroup.comtheinnermostloop.substack.comforecastbench.org+7·
2

On ForecastBench, superforecasters led the leaderboard in late 2025 to early 2026 with a difficulty-adjusted Brier score of about 0.081–0.086 (about 70.2 on the Brier Index), while the best LLMs trailed at about 0.101–0.103.

high·anthropic, perplexity, openai, grok·greentreegroup.comtheinnermostloop.substack.comforecastbench.org+10·
3

Google DeepMind’s “green tree” is a ForecastBench tournament submission, and as of the May 23, 2026 leaderboard update it was reported at roughly 67.8–67.9 overall Brier Index / Overall Score, ranking around #2 overall.

high·anthropic, perplexity, openai, grok·greentreegroup.comtheinnermostloop.substack.comforecastbench.org+3·
4

On the FRI's interpretable Brier Index scale, a score of 100% means perfect foresight, while a score of 0% means a maximally wrong forecaster.

high·anthropic, perplexity, grok·arxiv.orgmarkets.financialcontent.comnifc.gov+2·
5

Claims of parity exist, but that parity is contested and may be overstated in popular framing.

6

ForecastBench may overstate AI’s forecasting ability and does not fully capture hard-to-predict domains such as black-swan or geopolitical judgment; thus AI has not yet shown comprehensive parity on the full benchmark or in real-world forecasting.

high·anthropic, openai, grok·emergentmind.comtheinnermostloop.substack.comemergentmind.com+4·
7

The Brier Index is defined as (1 - sqrt(Brier score)) × 100%, so a Brier score of 0.086 corresponds to a Brier Index of about 70.6%, and Green Tree’s Brier score is in the high 0.08’s, roughly matching the human benchmark within margin of error.

high·perplexity, openai·theinnermostloop.substack.comarxiv.org·
8

On March 5, 2026, the updated extrapolated Brier Index parity estimate was May 2027, with a 95% CI of April 2026 to May 2028.

9

The Forecasting Research Institute projected a linear trend toward full parity, with parity suggested by December 2026 as of February 2026.

10

A Brier Index of 50% corresponds to uninformed, coin-flip forecasting (constant 50% prediction).

11

The milestone was announced around March 15, 2026.

high·anthropic, openai·theinnermostloop.substack.com·
12

ForecastBench uses a frozen 2024 human benchmark.

13

The October 2025 figures correspond to roughly a 20% human edge.

14

The Brier score was originally introduced by Glenn W. Brier in 1950 in “Verification of forecasts expressed in terms of probability.”

high·perplexity·nifc.gov·
15

The Forecasting Research Institute introduced the Brier Index in a 2025 explainer titled “Making Forecasting Scores Easier to Interpret: Introducing the Brier Index.”

high·perplexity·arxiv.org·

Sources

33 unique sources cited across 118 claims.

Academic5 sources
[2511.07678] AIA Forecaster: Technical Report
arxiv.orgvia anthropic, perplexity, grok, openai
11 claims
AIA Forecaster: Technical Report
arxiv.orgvia anthropic
7 claims
forecastbench:adynamic benchmark of ai forecasting ...
arxiv.orgvia anthropic, perplexity, openai, grok
3 claims
FORECASTBENCH:ADYNAMIC BENCHMARK OF AI ...
faculty.wharton.upenn.eduvia anthropic
1 claim
Government1 source
Monthly seasonal outlook (nifc.gov)
nifc.govvia anthropic, perplexity, grok, openai
4 claims
News & Media12 sources
Welcome to May 22, 2026 - by Dr. Alex Wissner-Gross
theinnermostloop.substack.comvia anthropic, perplexity, openai, grok
46 claims
How well can large language models predict the future?
forecastingresearch.substack.comvia anthropic, perplexity, openai, grok
23 claims
Making Forecasting Scores Easier to Interpret: Introducing the Brier Index
forecastingresearch.substack.comvia anthropic, perplexity, grok, openai
12 claims
What Superforecasters Actually Said About ForecastBench
goodjudgment.substack.comvia anthropic, perplexity, openai, grok
12 claims
FinancialContent - The Great Forecast Convergence: AI Closing the 20% Gap on Human Superforecasters
markets.financialcontent.comvia anthropic, perplexity, grok, openai
7 claims
Llms are closing the gap on human (forecastingresearch.substack.com)
forecastingresearch.substack.comvia anthropic, perplexity, openai, grok
7 claims
Forecasting the Economic Effects of AI
forecastingresearch.substack.comvia anthropic
4 claims
3 claims
What forecastbench doesnt measure (goodjudgment.substack.com)
goodjudgment.substack.comvia anthropic, openai, grok
3 claims
ChatGPT vs Gemini: 9 Tests, 1 Clear Winner [2026]
tech-insider.orgvia perplexity, grok, openai
3 claims

Topics

AI forecastingsuperforecastersForecastBenchBrier scoreDeepMind green treeforecasting parity 2026AI finance insurance governance

Share this research

Share:

Research synthesized by Parallect AI

Multi-provider deep research — every angle, synthesized.

Start your own research