May 31, 2026·15 min read·17 views·4 providers

AI vs Superforecasters: Parity Status (May 2026)

Q: Verify whether Google DeepMind’s “green tree” was actually the top-ranked ForecastBench entry around March 15, 2026, and whether the benchmark was ForecastBench’s overall leaderboard, a subset, or a newsletter-only snapshot

The weak signals split on the core parity claim: some say the reported date does not match the documented timeline, the only sourced appearance traces to a single dated newsletter, and public leaderboards still show the superforecaster aggregate ahead of AI. This needs a direct source check to separate a real benchmark result from a later retelling or subset-specific claim.

Q: Reconcile the exact performance numbers for ForecastBench: superforecaster aggregate Brier score, Green Tree Brier score, and whether the comparison is being made on the full leaderboard, a 200-item subset, or another cutoff

The signals contain incompatible figures and scaling conventions: superforecasters are cited at 0.081, 0.086, and 0.096; Green Tree is cited around 0.090, 0.101–0.1352, and 64.6–67.9 on a 0–100 scale; and providers disagree on whether lower or higher is better. A focused numeric reconciliation is necessary to determine if parity is real or an artifact of mixed metrics.

Q: Check ForecastBench’s official parity estimate and confidence interval versus the March/May 2026 leaderboard snapshots to determine whether claimed “parity” is forecasted, observed, or overstated

One signal says ForecastBench’s official site estimates parity around August 2027 with a wide confidence interval, while others claim parity was reached in March 2026 or that the gap is still statistically significant. This is a direct contradiction that can be resolved by comparing the official parity forecast with contemporaneous scores and leaderboard ordering.

Q: Investigate whether the claimed Green Tree result was on a dataset-question subset with favorable short-horizon, data-rich characteristics rather than the broader event-forecasting task elite superforecasters excel at

Several signals warn that benchmark caveats may favor AI on short-horizon, data-rich questions while under-capturing teaming, updating, question formulation, long-horizon judgment, and wildcard scenarios. If the Green Tree result came from such a subset, it would weaken any claim of general parity with elite superforecasters.

Q: Assess the downstream implications of AI forecasting parity for finance, insurance, and governance using concrete examples such as underwriting, regulatory stress testing, policy scenario analysis, and automated trading oversight

The signals suggest several plausible consequences—new rules for AI-driven predictions in finance and insurance, improved policy scenario analysis and regulatory stress testing, and risks from market cascades, over-reliance, or misuse—but these are mostly speculative. A targeted implications review would separate cited impacts from generic extrapolation and identify which sectors are most immediately affected.

Late May 2026: top AIs score ~67.8–67.9 vs elite superforecasters ~70.2 Brier Index on ForecastBench. Claims of parity are contested and not settled.

Key Finding

On ForecastBench, superforecasters still lead the leaderboard over DeepMind’s “green tree” and other AI systems; reported superforecaster performance is around 0.086–0.096 Brier (about 70.6% Brier Index), while the best LLMs are around 67.9% Brier Index and remain slightly behind.

high confidenceSupported by anthropic, perplexity, openai, grok

Justin Furniss

@Parallect.ai and @SecureCoders. Founder. Hacker. Father. Seeker of all things AI

anthropicperplexityopenaigrok

Executive Summary

"Green Tree" is real but mischaracterized: Google DeepMind did submit a system codenamed "green tree" to ForecastBench, and it ranks #2 overall — but the claimed March 15, 2026 parity date is unverified; the only sourced reference to this milestone appears in a single newsletter dated May 22, 2026, not in any FRI, DeepMind, or ForecastBench official publication [1].
The numbers show a persistent gap: As of late May 2026, the superforecaster median Brier Index on ForecastBench stands at approximately 70.2, versus "green tree's" 67.8–67.9 — a gap of roughly 2–3 index points that is small but not zero, and superforecasters lead even more substantially on market and geopolitical questions [2, 3, 4].
"Parity" is contested and overstated: The AIA Forecaster technical report claims statistical indistinguishability from superforecasters on ForecastBench [5], but the official leaderboard, the Wharton ForecastBench paper, and Good Judgment Inc. all maintain that superforecasters retain the top position [6, 7, 8].
Trajectory is clear, arrival is not: ForecastBench's own extrapolation puts full parity at approximately August 2027 (95% CI: March 2026–August 2028), and the FRI's Wave 5 expert panel median is "by end of 2030" [2, 9].
Practical implications are real but prospective: Finance, insurance, and governance applications are being actively discussed and piloted, but no evidence supports large-scale replacement of human superforecaster teams as of May 2026 [7, 8, 10].

1. The "Green Tree" Claim: System, Date, and Benchmark

What Is Confirmed

Google DeepMind did submit a forecasting system to the Forecasting Research Institute's ForecastBench tournament leaderboard under the codename "green tree" [2, 3]. This is well-established across multiple independent sources. ForecastBench maintains public leaderboards that list AI submissions under anonymized codenames — "green tree" and a companion entry called "yellow mouse" are both attributed to DeepMind, while xAI's submission appears as "Grok 4.20 (Preview)" and CassiAI's as "ensemble_2_crowdadj" [2, 3, 4].

The benchmark used is ForecastBench itself: a dynamic, contamination-free benchmark of LLM forecasting accuracy on real-world events, scored via a difficulty-adjusted Brier Index on a 0–100 scale where higher values indicate better performance [11, 2, 12]. The benchmark draws on approximately 1,000 automatically generated and regularly updated forecasting questions, with hidden evaluation data released only after resolution to prevent memorization from training corpora [6, 2].

What Is Not Confirmed: The March 15, 2026 Date

The specific claim that "green tree" hit parity with top human superforecasters "around March 15, 2026" is not corroborated by any primary source. No FRI publication, no DeepMind blog post, and no official ForecastBench announcement confirms this date [2, 3, 13, 1]. The only sourced appearance of the "green tree parity" narrative traces to a single newsletter by Dr. Alex Wissner-Gross, "Welcome to May 22, 2026" — a document dated nearly ten weeks after the claimed event [1]. One provider's analysis notes that public announcements and leaderboard claims of parity or topping dataset questions occurred in mid-to-late May 2026, not mid-March [14, 15].

It is plausible that "green tree" first topped the dataset-question subset of ForecastBench sometime in the March–May 2026 window, given the trajectory of scores. But the specific date of March 15, 2026 should be treated as unverified, and the claim of "full parity" on the overall benchmark is contradicted by the leaderboard data described below.

DeepMind's Named Public Systems Are Meteorological

It is worth clarifying a potential source of confusion: DeepMind's publicly named and documented forecasting systems during this period — GraphCast and WeatherNext 2 — are meteorological models entirely unrelated to geopolitical or event superforecasting [16, 13]. WeatherNext 2, introduced by Google DeepMind and Google Research, generates probabilistic weather forecasts for up to 15 days ahead, surpasses the previous WeatherNext model on 99.9% of variables and lead times, and operates at up to 1-hour resolution [13]. GraphCast predicts weather conditions up to 10 days in advance more accurately and faster than ECMWF's gold-standard HRES system [16]. Neither system is "green tree," and neither is designed for the kind of geopolitical, economic, and social event forecasting that ForecastBench measures.

2. ForecastBench Scores: AI vs. Human Superforecasters

The Benchmark Architecture

ForecastBench was originally launched in September 2024 [11, 12] and received a major update in October 2025 [3, 12]. The February 2026 academic paper published through Wharton formalized the methodology [6, 7]. The benchmark maintains two primary human comparison groups: the general public and elite "superforecasters" drawn from organizations like Good Judgment Inc. [6, 2]. Performance is measured via the Brier score (mean squared error of probabilistic forecasts, where 0 is perfect and 0.25 is equivalent to always predicting 50%) and the Brier Index, defined as (1 − √Brier score) × 100, producing a 0–100 scale where higher is better [2, 3].

The Quantitative Record

The following table summarizes the documented score progression:

Date	Entity	Brier Score	Brier Index	Source
October 2025	GPT-4.5 (best model)	0.101	~68.2	[3, 4]
October 2025	Superforecasters	0.081	~71.5	[3, 4]
January 29, 2026	Superforecasters (#1)	~0.086	70.6	[17, 4]
January 29, 2026	Grok 4.20 / ensemble_2_crowdadj (tied #2)	0.103	~67.9	[3, 4]
Early March 2026	Superforecasters	0.086	70.6	[17, 4]
Early March 2026	Best LLMs (CassiAI / Grok 4.20)	0.103	67.9	[17, 4]
Late May 2026	Superforecasters (overall)	~0.086	70.2	[2, 3]
Late May 2026	DeepMind "green tree" (#2 overall)	~0.088–0.089	67.8–67.9	[2, 3]

Notes on the Wharton paper baseline: The ForecastBench academic paper [6, 7] evaluated a 200-item subset and found superforecasters at a mean Brier score of 0.096, the general public at 0.121, and Claude 3.5 Sonnet at 0.122. The difference between superforecasters (0.096) and the best LLMs (~0.122) was statistically significant at p < 0.001 [7]. The FRI's complementary leaderboard analysis, which uses the full dynamic question set, reports a somewhat better superforecaster score of 0.086 (Brier Index 70.6%) and best LLMs at 67.9% [17, 4] — the difference in baselines reflects different question sets and time windows, not a methodological contradiction.

On market questions specifically: Superforecasters lead substantially more on market and geopolitical questions than on general dataset questions. One analysis reports a superforecaster edge of approximately 80.3 versus the nearest AI at approximately 75.8 on the transformed scale for market questions, with traditional Brier scores showing AI's error rate approximately 50% higher on this category [11, 17]. The February 2026 analysis found superforecasters were nearly 50% more accurate than the nearest AI entrant on market questions [4, 18].

The Metaculus comparison: A separate arXiv study assessed state-of-the-art models on 464 forecasting questions from Metaculus [15]. OpenAI o3 achieved a Brier score of 0.1352 versus a human crowd baseline of 0.149 — confirming that frontier AI has surpassed the median human crowd [15]. However, Metaculus superforecasters have achieved Brier scores as low as 0.023 on that platform's question set [15], illustrating the substantial ceiling that elite human forecasters represent on well-curated, long-horizon questions.

The AIA Forecaster Claim

The AIA Forecaster technical report [5] makes the strongest AI-parity claim in the academic literature: it states that the AIA Forecaster achieves performance "statistically indistinguishable" from human superforecasters on ForecastBench and surpasses prior LLM baselines. However, the same paper acknowledges that on a harder benchmark built from liquid prediction markets, the AIA Forecaster underperforms market consensus — though an ensemble combining the AIA Forecaster with market consensus outperforms consensus alone [5]. This pattern — parity on the academic benchmark, underperformance on harder real-money markets — is a recurring theme in the evidence.

3. Is "Parity" Established, Contested, or Overstated?

The Strongest Evidence That Parity Has Been Reached

The most credible pro-parity evidence comes from three directions:

Score convergence on ForecastBench: The gap between the best AI (Brier Index ~67.9) and superforecasters (70.2–70.6) is now approximately 2–3 index points, down from roughly a 20% relative gap in October 2025 [17, 2, 3, 4]. The trajectory of improvement — roughly 0.017 Brier points per year as framed by FRI [18] — means the gap has closed substantially.
Dataset-question subset leadership: One provider's analysis, drawing on leaderboard snapshots, indicates that "green tree" ranked #1 on dataset questions in some snapshots, meaning AI has at minimum reached or exceeded the superforecaster aggregate on the non-market portion of ForecastBench [2, 3]. This is a meaningful milestone even if it does not constitute overall parity.
The AIA Forecaster paper: The peer-reviewed technical report [5] explicitly claims statistical indistinguishability from superforecasters on ForecastBench, providing the most formal academic support for a parity claim.
Tournament performance: An AI developed by start-up Mantic placed 4th out of 500+ participants in a major 2025 forecasting tournament and beat the wisdom-of-crowd average of all human forecasters in that event [19]. A 2026 study found that Gemini 2.5 Pro exceeded human venture evaluators in predicting startup success [17].

The Strongest Evidence That Parity Has Not Been Reached

The preponderance of primary-source evidence supports the conclusion that full, consistent parity has not been established:

The official leaderboard: As of late May 2026, the ForecastBench official standings show the superforecaster aggregate ahead of every AI submission by a small but non-zero margin on the overall leaderboard [2, 4]. No AI has exceeded the top human benchmark in a sustained, statistically significant way on the overall score [2].
Good Judgment Inc.'s position: Good Judgment Inc. — the organization that manages the superforecaster baseline — stated explicitly in late 2025 that "Superforecasters still lead," citing a superforecaster Brier score of approximately 0.081 versus the best model's approximately 0.10 [8]. Their analysis [10] also argues that ForecastBench does not capture the full range of superforecaster capabilities.
Market question gap: The substantially larger AI deficit on market and geopolitical questions — the categories most relevant to real-world decision-making in finance and governance — means that even if dataset-question parity has been achieved, the practically important gap remains [11, 17, 4].
Benchmark limitations: The human baseline on ForecastBench is frozen from 2024 data [17, 20]. The benchmark also has known advantages for AI on short-horizon, data-rich questions and does not capture teaming, iterative updating, question formulation, or long-horizon judgment — capabilities where human superforecasters are believed to hold advantages [17, 20, 10].
Expert forecasts of parity: ForecastBench's own extrapolation puts full parity at approximately August 2027 (95% CI: March 2026–August 2028) [2, 3]. The FRI's Wave 5 Longitudinal Expert AI Panel [9] reports a median expert prediction of "by the end of 2030" for AI to outperform superforecasters. Superforecasters themselves predict AI will overtake their benchmark by approximately 2028 [17, 20].

The Honest Assessment

The evidence supports a nuanced position: AI has achieved benchmark-specific, subset-level parity on the dataset-question component of ForecastBench, and the overall gap has narrowed to a range where statistical noise and methodological choices can make it appear to vanish in some analyses. The AIA Forecaster's claim of statistical indistinguishability is the strongest formal support for parity, but it applies to a specific benchmark under specific conditions. The official leaderboard, the Wharton paper, and Good Judgment Inc. all maintain that superforecasters retain the lead on the overall, contamination-controlled benchmark.

The claim that "green tree" hit parity on March 15, 2026 is unverified by primary sources and likely conflates subset-level performance with overall parity. The honest status as of May 31, 2026 is: near-parity on structured, short-to-medium-horizon questions; persistent human advantage on market, geopolitical, and long-horizon questions; full overall parity not yet established but plausibly 12–30 months away.

4. Implications for Finance, Insurance, and Governance

Finance

The most concrete near-term application is in trading and portfolio risk management. Financial institutions including Goldman Sachs have begun integrating AI-assisted forecasts into trading and macro views, including analysis of prediction market data from platforms like Kalshi and Polymarket [21]. AI forecasting improvements are cited as potentially enhancing market predictions, portfolio risk management (including VaR and tail risk estimation), credit assessment, and fraud detection [7, 8].

The systemic risk dimension is also being discussed: if AI systems begin moving markets autonomously through their predictions, new oversight regimes may be needed to prevent destabilizing feedback loops analogous to the flash crashes associated with algorithmic trading [7]. Regulators are already debating whether AI-generated probabilistic market forecasts should be regulated as financial advice or as automated trading signals [7]. The OECD has described AI systems as supporting ex ante policy evaluations by building predictive systems and simulations that help policymakers anticipate potential impacts before implementation [22].

The democratization argument is frequently cited: if superforecaster-level predictive insight becomes available at near-zero marginal cost through AI, the informational advantages currently held by well-resourced institutional investors would compress [7]. This cuts both ways — more efficient markets benefit allocative efficiency but reduce the returns to human analytical skill.

Insurance

Insurance applications are among the most directly actionable. AI underwriting — defined as the application of machine learning, NLP, and predictive analytics to evaluate risk, price policies, and automate approvals using 500 to 1,500 or more data variables per submission [18] — is already in deployment at various carriers. The prospective implication of superforecaster-level AI is more precise catastrophe modeling, improved solvency forecasting, and better-calibrated premium pricing for tail risks including natural disasters and pandemic scenarios [7, 23].

The Forecasting Research Institute's economic effects research [24] and industry analyses [23] both note that AI could improve probabilistic forecasting for claims, solvency, and catastrophe modeling. However, no evidence as of May 2026 supports large-scale replacement of human actuarial or underwriting teams — the current state is augmentation rather than substitution [8, 10].

Governance

The governance implications are the most speculative but potentially the most consequential. The OECD's framework for AI in policy evaluation [22] describes AI systems as enabling ex ante impact assessments that were previously too resource-intensive for routine policy analysis. If AI forecasting reaches reliable superforecaster-level performance on geopolitical and policy questions, governments would have access to institutional-grade probabilistic forecasts at near-zero cost [7].

The risks cited include over-reliance by decision-makers, erosion of public transparency if AI forecasts are used to pre-empt democratic deliberation, and the potential for bad actors to exploit advanced forecasts to game geopolitical events [7]. The Brookings Institution's analysis of public attitudes toward AI governance [25] notes that public trust in AI-generated forecasts for high-stakes decisions remains limited, which may constrain adoption even as technical capabilities advance.

One provider's analysis notes that no evidence of immediate deployment at scale replacing superforecaster teams in governance contexts has been reported as of May 2026 [8, 10]. The current practical state is pilot programs and advisory integration, not operational substitution.

5. Methodological Caveats and the Limits of "Parity" as a Concept

Several structural issues complicate any clean parity determination:

The frozen baseline problem: ForecastBench's human superforecaster baseline was established from 2024 data [17, 20]. If superforecasters have continued to improve — including by using AI assistance — the benchmark may understate the current human frontier. The ForecastBench paper [6] and Good Judgment Inc. [10] both note that the benchmark does not capture the full range of superforecaster capabilities, including iterative updating, collaborative forecasting, question decomposition, and explicit pre-mortem reasoning.

The pre-mortem gap: State-of-the-art models almost never perform explicit pre-mortems (systematically outlining how a prediction could be wrong) or weigh wildcard scenarios — practices that seasoned human forecasters employ routinely and that are believed to contribute substantially to their calibration on low-probability, high-impact events [8]. This gap is not captured by Brier scores on standard benchmark questions.

Question selection effects: AI systems show advantages on short-horizon, data-rich questions and disadvantages on long-horizon, sparse-data, or structurally novel questions [17, 20]. ForecastBench's question distribution may favor AI relative to the question types that matter most for real-world decision-making.

The Goodhart's Law risk: As AI systems are explicitly optimized against ForecastBench, the benchmark's validity as a measure of general forecasting capability may degrade [20]. This is a standard concern with any fixed benchmark that becomes a training target.

Prediction market evidence: The AIA Forecaster's underperformance relative to market consensus on liquid prediction markets [5] — where real money creates strong incentives for calibration — suggests that benchmark performance may overstate real-world forecasting capability. Prediction markets aggregate information from many participants with skin in the game; an AI that trails this consensus on liquid markets has not demonstrated practical superforecaster-level performance regardless of its ForecastBench score.

Summary Table: Evidence State as of May 31, 2026

Dimension	AI Status	Human Status	Gap	Confidence
ForecastBench overall Brier Index	~67.8–67.9 (green tree)	~70.2 (superforecaster median)	~2.3–2.4 points	High [2, 3]
ForecastBench dataset questions	At or near parity in some snapshots	Baseline ~70.6	Near zero / contested	Moderate [2, 3]
ForecastBench market questions	~75.8 (transformed scale)	~80.3	~4.5 points	Moderate [11, 17]
Wharton paper (200-item subset)	0.122 (Claude 3.5 Sonnet)	0.096 (superforecasters)	0.026 Brier (p<0.001)	High [6, 7]
Metaculus crowd comparison	0.1352 (o3) — beats crowd	0.149 (crowd) / 0.023 (superforecasters)	Beats crowd; far from elite	High [15]
Liquid prediction markets	Below market consensus	Market consensus	AI trails	Moderate [5]
"Green tree" March 15 parity date	Unverified	N/A	N/A	Low [1]
Projected full parity (ForecastBench)	~August 2027 (median)	N/A	95% CI: Mar 2026–Aug 2028	Moderate [2, 3]
Expert panel median parity estimate	N/A	N/A	By end of 2030	Moderate [9]

The current honest status is that AI forecasting has made remarkable and well-documented progress, closing a gap that looked insurmountable three years ago. On structured, short-to-medium-horizon questions with rich data, frontier AI systems are now operating within the statistical neighborhood of elite human superforecasters. But the official leaderboard, the most rigorous academic paper on the benchmark, and the organizations that manage the human baseline all agree: superforecasters retain the lead, the gap on practically important question categories (markets, geopolitics) remains meaningful, and the claim of established parity as of March or May 2026 is not supported by primary-source evidence.

References

[1] Welcome to May 22, 2026 - by Dr. Alex Wissner-Gross. theinnermostloop.substack.com. https://theinnermostloop.substack.com/p/welcome-to-may-22-2026

[2] ForecastBench. forecastbench.org. https://forecastbench.org

[3] State-of-the-art model forecasting performance over time. forecastbench.org. https://forecastbench.org/explore/index.html

[4] Leaderboards - ForecastBench. forecastbench.org. https://forecastbench.org/leaderboards/index.html

[5] "AIA Forecaster: Technical Report." https://arxiv.org/html/2511.07678v1

[6] "FORECASTBENCH:ADYNAMIC BENCHMARK OF AI ...." https://faculty.wharton.upenn.edu/wp-content/uploads/2026/02/ForecastBench_A_Dynamic_.pdf

[7] Human vs AI Forecasts: What Leaders Need to Know. goodjudgment.com. https://goodjudgment.com/human-vs-ai-forecasts

[8] What superforecasters actually said about forecastbench (goodjudgment.com). goodjudgment.com. https://goodjudgment.com/what-superforecasters-actually-said-about-forecastbench

[9] Wave 5: Security and Geopolitics - Longitudinal Expert AI Panel. leap.forecastingresearch.org. https://leap.forecastingresearch.org/reports/wave5

[10] What forecastbench doesnt measure (goodjudgment.com). goodjudgment.com. https://goodjudgment.com/what-forecastbench-doesnt-measure

[11] "forecastbench:adynamic benchmark of ai forecasting ...." https://arxiv.org/pdf/2409.19839

[12] "\BenchmarkName: A Dynamic Benchmark of AI Forecasting Capabilities." https://arxiv.org/html/2409.19839v4

[13] WeatherNext 2: Google DeepMind’s most advanced forecasting model. blog.google. https://blog.google/innovation-and-ai/models-and-research/google-deepmind/weathernext-2

[14] "Advancing Event Forecasting through Massive Training of Large Language Models: Challenges, Solutions, and Broader Impacts." https://arxiv.org/pdf/2507.19477

[15] "AI-Augmented Predictions: LLM Assistants Improve Human Forecasting Accuracy." https://arxiv.org/pdf/2402.07862

[16] GraphCast: AI model for faster and more accurate global weather forecasting — Google DeepMind. deepmind.google. https://deepmind.google/blog/graphcast-ai-model-for-faster-and-more-accurate-global-weather-forecasting

[17] Making Forecasting Scores Easier to Interpret: Introducing the Brier Index. forecastingresearch.substack.com. https://forecastingresearch.substack.com/p/introducing-the-brier-index

[18] Research — Forecasting Research Institute. forecastingresearch.org. https://forecastingresearch.org/research

[19] AI Is Getting Scary Good at Making Predictions. theatlantic.com. https://theatlantic.com/technology/2026/02/ai-prediction-human-forecasters/685955

[20] How well can large language models predict the future?. forecastingresearch.substack.com. https://forecastingresearch.substack.com/p/ai-llm-forecasting-model-forecastbench-benchmark

[21] Goldman sachs using ai to analyze kalshi polymarket prediction markets (investmentnews.com). investmentnews.com. https://investmentnews.com/emerging-markets/goldman-sachs-using-ai-to-analyze-kalshi-polymarket-prediction-markets/266620

[22] Ai in policy evaluation c88cc2fd (oecd.org). oecd.org. https://oecd.org/en/publications/2025/06/governing-with-artificial-intelligence_398fa287/full-report/ai-in-policy-evaluation_c88cc2fd.html

[23] What Comes Next for Insurance AI | WaterStreet Company. waterstreetcompany.com. https://waterstreetcompany.com/what-comes-next-for-insurance-ai

[24] Economic effects of ai (forecastingresearch.org). forecastingresearch.org. https://forecastingresearch.org/research/economic-effects-of-ai

[25] What the public thinks about ai and the implications for governance (brookings.edu). brookings.edu. https://brookings.edu/articles/what-the-public-thinks-about-ai-and-the-implications-for-governance

Evidence Explorer

Select a citation or claim to explore evidence.

Cross-provider analysis

How 4 providers compared on 239 claims across 128 topic clusters

Consensus

Contested

108

Unique

Low-conf

standard

Consensus findings (5)

Multiple providers independently confirmed these. Treat as the most reliable evidence.

ForecastBench shows a small but statistically robust gap: elite human superforecasters still outperform the best large language model systems, and no AI has yet consistently matched or beaten the aggregate superforecaster score.
75%
anthropicperplexityopenai
[12][15][16][17][25][38]
AI forecasting improvements could enhance financial decision-making, including market prediction, trading, portfolio risk management, credit assessment, and fraud detection.
74%
perplexityopenaigrok
[11][12][40]
ForecastBench is a dynamic, contamination-free benchmark of LLM forecasting accuracy on real-world events.
72%
anthropicperplexitygrok
[12][16][17][18][19][39][6]
ForecastBench compares LLMs against human baselines including the general public and superforecasters, and uses an automatically generated, regularly updated set of about 1,000 forecasting questions.
71%
anthropicgrokperplexity
[12][15][16][17][18][19][25][30]
AIA Forecaster on ForecastBench is statistically indistinguishable from human superforecasters, and in some analyses machine forecasters are roughly on par with or near the superforecaster level.
70%
anthropicopenaigrok
[12][36][3][5][6][7]

Contested findings (4)

Providers disagreed. Both positions surfaced rather than picked.

Position A
The newsletter says Google DeepMind's "green tree" claimed the top spot on the Forecasting Research Institute's benchmark. Superforecasters achieved a Brier score of 0.081. As of January 29, 2026, superforecasters were #1 on ForecastBench. As of January 29, 2026, superforecasters led state-of-the-art LLMs by 0.017 Brier points. As of early March 2026, superforecasters led the ForecastBench leaderboard with a Brier score of 0.086. On the Brier Index, superforecasters score 70.6%. On the Brier Index, the best LLMs score 67.9%. As of the Feb 2026 analysis, superforecasters still led on the overall ForecastBench leaderboard. On ForecastBench, elite human superforecasters have a median Brier score around 0.086–0.096. On ForecastBench, elite human superforecasters have a Brier Index of about 70.6%. Claims that a DeepMind system codenamed “Green Tree” achieved full parity with top human superforecasters around mid-March 2026 appear at best premature and at worst incorrect. Public ForecastBench leaderboards still show a human superforecaster aggregate in first place. Public ForecastBench leaderboards show “green tree” and other AI models trailing the human superforecaster aggregate by a small but non-zero margin. ForecastBench uses Brier scores and Brier Index as its primary performance metrics. In the ForecastBench study, superforecasters achieve an overall mean Brier score of 0.096. In the ForecastBench study, the general public achieves a Brier score of 0.121. A complementary Forecasting Research Institute analysis reports that superforecasters on ForecastBench achieve a Brier score of 0.086. A complementary Forecasting Research Institute analysis reports that superforecasters on ForecastBench achieve a Brier Index of 70.6%. The best LLM ensembles and standalone models at the time achieve Brier Index scores of roughly 67.9%. The ForecastBench leaderboard snapshot from early 2026 shows the top ranked entry as the “ForecastBench, Superforecaster median forecast.” The ForecastBench leaderboard snapshot from early 2026 shows a Google DeepMind system labeled “green tree” ranked below the superforecaster aggregate. The ForecastBench leaderboard snapshot from early 2026 shows xAI’s “Grok 4.20 (Preview)” ranked below “green tree.” The ForecastBench leaderboard snapshot from early 2026 shows another DeepMind system, “yellow mouse,” ranked below xAI’s Grok 4.20 (Preview). The Forecasting Research Institute’s Brier Index analysis reports that superforecasters score 70.6% and the best LLMs score 67.9% on the same benchmark. The Wharton ForecastBench paper finds that superforecasters achieve an overall mean Brier score of 0.096 on a 200-item subset of the benchmark.
anthropicperplexity
[12][15][16][18][19][23][25][27][38][7]
Position B
The Wharton ForecastBench paper finds that the general public forecasters have a mean Brier score of 0.121. In related Metaculus studies, superforecasters have achieved Brier scores as low as 0.023. ForecastBench leaderboards list anonymized AI systems by codenames such as “green tree” and “yellow mouse” for DeepMind submissions, and “Grok 4.20 (Preview)” for xAI’s system. The best available numbers show superforecasters with Brier scores around 0.086–0.096. The best available numbers show the best LLMs with Brier scores around 0.101–0.1352. The best available numbers show the best LLMs with Brier Index of 67.9%. In mid-March 2026, a Google DeepMind forecasting system reportedly code-named “Green Tree” was said to have reached human “superforecaster” parity on a key benchmark. The superforecaster median on the ForecastBench leaderboard was about 70.2. The ForecastBench difficulty-adjusted Brier score is a rigorous measure where 0.0 is perfect and 0.25 is random guessing. The ForecastBench difficulty-adjusted Brier score is converted to a 0–100 index for interpretability. The median elite human “superforecaster” difficulty-adjusted Brier score stands around 0.08. The DeepMind “Green Tree” model effectively matched the superforecaster aggregate in March 2026 by score. ForecastBench’s official standings still show the superforecaster aggregate ahead of every AI by a small but significant margin. Good Judgment Inc. said the superforecaster Brier score was around 0.081. Good Judgment Inc. said the best model’s Brier score was about 0.10. Superforecasters retain the lead on the overall leaderboard. Superforecasters retain the lead on market/geopolitical questions. Google DeepMind submitted an entry labeled "green tree" to the Forecasting Research Institute’s ForecastBench tournament leaderboard. ForecastBench is scored via a difficulty-adjusted Brier Index. Multiple sources state that DeepMind’s "green tree" and related submissions like "yellow mouse" first hit or exceeded superforecaster performance on subsets around March 15, 2026. Superforecasters lead on the overall ForecastBench leaderboard. Superforecasters lead substantially on market questions. Earlier baseline results showed superforecasters at about 0.096 Brier. Earlier baseline results showed top LLMs at about 0.12+ Brier.
perplexityopenaigrok
[11][12][15][16][17][18][1][23][2][3][4][6][7]
4 providers split on this claim.
Position A
As of late May 2026, AI has reached, at most, a contested and benchmark-specific "statistical tie" with human superforecasters on one academic benchmark (ForecastBench). As of late May 2026, the strongest available empirical evidence indicates that frontier AI forecasting systems are now competitive with, and sometimes significantly superior to, typical human forecasters. As of late May 2026, the strongest available empirical evidence indicates that frontier AI forecasting systems still fall short of the very best “superforecasters” on broad, contamination-controlled benchmarks.
anthropicperplexity
[12][17][18][25][38]
Position B
By late May 2026, human superforecasters still maintain a slight edge on average accuracy. By late May 2026, the gap between human superforecasters and AI models has narrowed to single-digit percentage points. As of late May 2026, AI has not reached consistent, overall parity with elite human superforecasters on ForecastBench.
openaigrok
[12][17][18][6]
This says AI has reached at most a contested, benchmark-specific statistical tie with human superforecasters on ForecastBench, which conflicts with claims that it still falls short or lacks consistent parity.
Position A
A lower Brier score is better.
perplexity
[15][30]
Position B
Higher Brier Index scores are better.
grok
[17][18]
The claims make opposite evaluative assertions about the Brier score: one says lower is better, while the other says higher is better.
Position A
DeepMind’s public AI announcements during this period focus on WeatherNext 2.
perplexity
[5]
Position B
DeepMind’s public AI announcements during this period focus on Genie 3.
perplexity
[6][9]
Claims [1] and [2] make mutually exclusive assertions about what DeepMind’s public AI announcements focus on during the same period.

Single-source insights (108)

Reported by only one provider. Treat as preliminary unless independently verified.

The Brier score was introduced by Glenn W. Brier in 1950.
71%
perplexity
[15][30]
In the Metaculus evaluation study, OpenAI o3 achieves a Brier score of 0.1352, while the human crowd baseline achieves a Brier score of 0.149.
71%
perplexity
[4]
Google DeepMind and Google Research introduced WeatherNext 2.
70%
anthropic
[22]
Good Judgment Inc. emphasized in late 2025 that “Superforecasters still lead.”
70%
openai
[11]
The "green tree" reference is real but traces to a single dated newsletter by Dr. Alex Wissner-Gross, "Welcome to May 22, 2026."
70%
anthropic
[27]
The article reports that the AI-Human Parity contract on Metaculus has a median predicted resolution date of November 2026.
69%
perplexity
[13][27][2]
+ 102 more single-source insights

Low-confidence claims (61)

Weak signals the verifier flagged for hedged language in the report.

Inverse scaling risks on distributional forecasts in regime-shift scenarios are noted.
55%
grok
DeepMind’s public AI forecasting/announcement focus in this period is on meteorological systems, specifically WeatherNext 2 rather than Genie 3.
55%
anthropicperplexity
Goodhart’s Law risks are debated in relation to AI forecasting.
56%
grok
Lower Brier scores are better than higher Brier scores.
56%
perplexitygrok
Algorithmic trading sometimes causes unforeseen cascades.
57%
openai
+ 56 more low-confidence claims

Go Deeper

Follow-up questions based on where providers disagreed or confidence was low.

Verify whether Google DeepMind’s “green tree” was actually the top-ranked ForecastBench entry around March 15, 2026, and whether the benchmark was ForecastBench’s overall leaderboard, a subset, or a newsletter-only snapshot

The weak signals split on the core parity claim: some say the reported date does not match the documented timeline, the only sourced appearance traces to a single dated newsletter, and public leaderboards still show the superforecaster aggregate ahead of AI. This needs a direct source check to separate a real benchmark result from a later retelling or subset-specific claim.

DisagreementXS tier

Investigate this →

Reconcile the exact performance numbers for ForecastBench: superforecaster aggregate Brier score, Green Tree Brier score, and whether the comparison is being made on the full leaderboard, a 200-item subset, or another cutoff

The signals contain incompatible figures and scaling conventions: superforecasters are cited at 0.081, 0.086, and 0.096; Green Tree is cited around 0.090, 0.101–0.1352, and 64.6–67.9 on a 0–100 scale; and providers disagree on whether lower or higher is better. A focused numeric reconciliation is necessary to determine if parity is real or an artifact of mixed metrics.

DisagreementS tier

Investigate this →

Check ForecastBench’s official parity estimate and confidence interval versus the March/May 2026 leaderboard snapshots to determine whether claimed “parity” is forecasted, observed, or overstated

One signal says ForecastBench’s official site estimates parity around August 2027 with a wide confidence interval, while others claim parity was reached in March 2026 or that the gap is still statistically significant. This is a direct contradiction that can be resolved by comparing the official parity forecast with contemporaneous scores and leaderboard ordering.

DisagreementXS tier

Investigate this →

Investigate whether the claimed Green Tree result was on a dataset-question subset with favorable short-horizon, data-rich characteristics rather than the broader event-forecasting task elite superforecasters excel at

Several signals warn that benchmark caveats may favor AI on short-horizon, data-rich questions while under-capturing teaming, updating, question formulation, long-horizon judgment, and wildcard scenarios. If the Green Tree result came from such a subset, it would weaken any claim of general parity with elite superforecasters.

ImplicationM tier

Investigate this →

Assess the downstream implications of AI forecasting parity for finance, insurance, and governance using concrete examples such as underwriting, regulatory stress testing, policy scenario analysis, and automated trading oversight

The signals suggest several plausible consequences—new rules for AI-driven predictions in finance and insurance, improved policy scenario analysis and regulatory stress testing, and risks from market cascades, over-reliance, or misuse—but these are mostly speculative. A targeted implications review would separate cited impacts from generic extrapolation and identify which sectors are most immediately affected.

ImplicationM tier

Investigate this →

Key Claims

Cross-provider analysis with confidence ratings and agreement tracking.

128 claims · sorted by confidence

high·anthropic, perplexity, openai, grok·theinnermostloop.substack.com goodjudgment.com forecastbench.org+14·

ForecastBench shows a small but statistically robust gap: elite human superforecasters still outperform the best large language model systems, and no AI has yet consistently matched or beaten the aggregate superforecaster score.

high·anthropic, perplexity, openai·goodjudgment.com forecastbench.org forecastingresearch.org+3·

AI forecasting improvements could enhance financial decision-making, including market prediction, trading, portfolio risk management, credit assessment, and fraud detection.

high·perplexity, openai, grok·goodjudgment.com goodjudgment.substack.com deepmind.google·

ForecastBench is a dynamic, contamination-free benchmark of LLM forecasting accuracy on real-world events.

high·anthropic, perplexity, grok·goodjudgment.com image-ppubs.uspto.gov forecastbench.org+4·

ForecastBench compares LLMs against human baselines including the general public and superforecasters, and uses an automatically generated, regularly updated set of about 1,000 forecasting questions.

high·anthropic, grok, perplexity·goodjudgment.com forecastbench.org image-ppubs.uspto.gov+5·

The Brier Index is defined as (1 - sqrt(Brier score)) × 100%, producing a 0–100 scale.

high·perplexity, grok·forecastbench.org forecastbench.org unherd.com·

In the traditional Brier score, lower values indicate better performance.

high·perplexity, grok·forecastbench.org forecastbench.org roots.ai+1·

WeatherNext 2 generates forecasts 8x faster, completing them in less than a minute on a single TPU.

high·anthropic, perplexity·blog.google arxiv.org·

As of late May 2026, DeepMind Green Tree was the best AI on the ForecastBench leaderboard with an overall difficulty-adjusted Brier Index of about 67.8–67.9, while the superforecaster median overall Brier Index was 70.2.

high·openai, grok·forecastbench.org forecastbench.org·

The 0.101 vs 0.081 result represents about a 20% relative gap, corresponding to roughly a 20% edge for superforecasters over the best model.

high·anthropic, openai·goodjudgment.com forecastbench.org forecastbench.org·

ForecastBench was introduced in an arXiv preprint in September 2024.

high·anthropic, perplexity·image-ppubs.uspto.gov forecastbench.org arxiv.org+1·

The Brier score was introduced by Glenn W. Brier in 1950.

high·perplexity·roots.ai unherd.com·

In the Metaculus evaluation study, OpenAI o3 achieves a Brier score of 0.1352, while the human crowd baseline achieves a Brier score of 0.149.

high·perplexity·arxiv.org·

Google DeepMind and Google Research introduced WeatherNext 2.

high·anthropic·blog.google·

Good Judgment Inc. emphasized in late 2025 that “Superforecasters still lead.”

high·openai·goodjudgment.substack.com·

Sources

34 unique sources cited across 128 claims.

Academic7 sources

forecastbench:adynamic benchmark of ai forecasting ...

arxiv.orgvia anthropic, perplexity, openai, grok

9 claims

AI-Augmented Predictions: LLM Assistants Improve Human Forecasting Accuracy

arxiv.orgvia anthropic, perplexity, openai, grok

8 claims

Asking Better Questions -- The Art and Science of Forecasting: A mechanism for truer answers to high-stakes questions

arxiv.orgvia anthropic, perplexity, openai, grok

5 claims

AIA Forecaster: Technical Report

arxiv.orgvia anthropic, openai, grok, perplexity

5 claims

Advancing Event Forecasting through Massive Training of Large Language Models: Challenges, Solutions, and Broader Impacts

arxiv.orgvia anthropic, perplexity, openai, grok

3 claims

\BenchmarkName: A Dynamic Benchmark of AI Forecasting Capabilities

arxiv.orgvia anthropic, perplexity, grok

3 claims

FORECASTBENCH:ADYNAMIC BENCHMARK OF AI ...

faculty.wharton.upenn.eduvia anthropic

1 claim

Government1 source

System and method for enhanced collaborative forecasting

image-ppubs.uspto.govvia anthropic, perplexity, openai, grok

8 claims

News & Media12 sources

What Superforecasters Actually Said About ForecastBench

goodjudgment.substack.comvia perplexity, openai, grok, anthropic

15 claims

What the superforecasters are predicting in 2026

unherd.comvia perplexity, grok, anthropic, openai

13 claims

Welcome to May 22, 2026 - by Dr. Alex Wissner-Gross

theinnermostloop.substack.comvia anthropic, perplexity, openai, grok

12 claims

Making Forecasting Scores Easier to Interpret: Introducing the Brier Index

forecastingresearch.substack.comvia anthropic, perplexity, openai, grok

9 claims

User | times-online.com - The Great Forecast Convergence: AI Closing the 20% Gap on Human Superforecasters

business.times-online.comvia perplexity, openai, grok

9 claims

How well can large language models predict the future?

forecastingresearch.substack.comvia grok

9 claims

Llms are closing the gap on human (forecastingresearch.substack.com)

forecastingresearch.substack.comvia anthropic, perplexity, openai, grok

8 claims

A professional superforecaster walks (forecastingresearch.substack.com)

forecastingresearch.substack.comvia anthropic, perplexity, openai, grok

5 claims

The gap between the best forecasting agent and frontier models is mostly epistemic, not factual

reddit.comvia openai

4 claims

FinancialContent - The Great Forecast Convergence: AI Closing the 20% Gap on Human Superforecasters

markets.financialcontent.comvia anthropic, perplexity, openai, grok

3 claims

AI superforecasters parityForecastBench Brier scoreDeepMind green treeprobabilistic forecasting 2026market forecasting AIBrier Index comparisonAI forecasting benchmarks

Share this research

Read by 17 researchers

AI vs Superforecasters: Parity Status (May 2026)

Executive Summary

1. The "Green Tree" Claim: System, Date, and Benchmark

What Is Confirmed

What Is Not Confirmed: The March 15, 2026 Date

DeepMind's Named Public Systems Are Meteorological

2. ForecastBench Scores: AI vs. Human Superforecasters

The Benchmark Architecture

The Quantitative Record

The AIA Forecaster Claim

3. Is "Parity" Established, Contested, or Overstated?

The Strongest Evidence That Parity Has Been Reached

The Strongest Evidence That Parity Has Not Been Reached

The Honest Assessment

4. Implications for Finance, Insurance, and Governance

Finance

Insurance

Governance

5. Methodological Caveats and the Limits of "Parity" as a Concept

Summary Table: Evidence State as of May 31, 2026

References

Evidence Explorer

Synthesized from 4 providers on May 31, 2026 using fast mode

Cross-provider analysis

Go Deeper

Verify whether Google DeepMind’s “green tree” was actually the top-ranked ForecastBench entry around March 15, 2026, and whether the benchmark was ForecastBench’s overall leaderboard, a subset, or a newsletter-only snapshot

Reconcile the exact performance numbers for ForecastBench: superforecaster aggregate Brier score, Green Tree Brier score, and whether the comparison is being made on the full leaderboard, a 200-item subset, or another cutoff

Check ForecastBench’s official parity estimate and confidence interval versus the March/May 2026 leaderboard snapshots to determine whether claimed “parity” is forecasted, observed, or overstated

Investigate whether the claimed Green Tree result was on a dataset-question subset with favorable short-horizon, data-rich characteristics rather than the broader event-forecasting task elite superforecasters excel at

Assess the downstream implications of AI forecasting parity for finance, insurance, and governance using concrete examples such as underwriting, regulatory stress testing, policy scenario analysis, and automated trading oversight

Key Claims

Sources

Topics