Executive Summary
- "Green Tree" is real but mischaracterized: Google DeepMind's "green tree" is a confirmed ForecastBench tournament submission, currently ranked #2 overall with a Brier Index of approximately 67.8–67.9 as of the May 23, 2026 leaderboard update — meaningfully below the superforecaster median of 70.2. The specific claim of parity "around March 15, 2026" originates from a single secondary source and cannot be independently verified against primary leaderboard data [1, 2].
- The benchmark numbers are clear: On ForecastBench, superforecasters hold a Brier Index of ~70.2 overall; the best AI system (DeepMind "green tree") sits at ~67.8–67.9 — a gap of roughly 2–3 Brier Index points that is statistically meaningful, not noise [1, 2].
- Parity is overstated in popular framing: The gap has closed dramatically since 2023 (GPT-4 scored 0.131 vs. superforecasters' 0.081), and AI has achieved near-parity or transient parity on specific data-rich question subsets, but comprehensive parity on the full benchmark has not been established as of late May 2026 [3, 4].
- The strongest pro-parity evidence comes from Bridgewater AIA Labs' November 2025 paper showing the AIA Forecaster scoring 0.1125 on FB-7-21 against a superforecaster median of 0.1110 — a near-tie on one benchmark slice, though the system still lags on harder market questions [5, 6].
- Practical implications are real but prospective: Finance, insurance, and governance applications are widely discussed, but quantified deployment outcomes remain sparse; the dominant framing is "AI plus superforecasters," not replacement [7, 8].
1. The "Green Tree" System: Verification of the System, Date, and Benchmark
What "Green Tree" Actually Is
Google DeepMind's "green tree" is a confirmed entry on the ForecastBench tournament leaderboard, operated by the Forecasting Research Institute (FRI) [1, 2]. Multiple independent sources corroborate that DeepMind submitted forecasting systems under internal code names rather than public product brand names, with at least two entries visible on the leaderboard: "green tree" and a second entry labeled "yellow mouse" [1]. This naming convention — animal-color combinations — is consistent with internal project labeling practices rather than public product releases, and DeepMind has not publicly announced a product called "Green Tree" [1].
ForecastBench itself is a dynamic, contamination-free benchmark that continuously generates and curates approximately 1,000 real-world forecasting questions spanning geopolitics, economics, technology, and public health [3, 4]. It evaluates both human forecasters and AI systems using difficulty-adjusted Brier scores, converted to a Brier Index for public reporting [3, 4]. The benchmark is specifically designed to prevent training-data contamination by focusing on future events at the time of question creation [3].
The March 15, 2026 Parity Claim: What the Evidence Shows
The specific claim that Green Tree "hit parity with top human superforecasters around March 15, 2026" traces to a single secondary source — a Substack newsletter dated May 22, 2026 [9]. This source is not a primary leaderboard record, a peer-reviewed paper, or an FRI publication. No FRI methodology post, no ForecastBench leaderboard snapshot, and no DeepMind announcement corroborates the specific date of March 15, 2026 as a parity milestone.
What the primary leaderboard data does show is more nuanced. One source reports that earlier preliminary leaderboard states showed DeepMind models transiently appearing above the superforecaster median before additional question resolutions shifted rankings — consistent with the kind of transient crossing that could generate a "parity" headline without representing durable parity [1]. The ForecastBench tournament leaderboard allows scaffolded and ensembled entries, and results can be noisy in small samples or on preliminary boards [1, 2].
The most defensible interpretation: Green Tree may have briefly appeared at or above the superforecaster median on a specific leaderboard snapshot or question subset (particularly data-rich "dataset questions") around that timeframe, generating the parity claim. But as of the authoritative May 23, 2026 leaderboard update, Green Tree ranked #2 overall — below the superforecaster median — indicating that if transient parity occurred, it did not persist on the full benchmark [1, 2].
Current Leaderboard Position (May 23, 2026)
The May 23, 2026 ForecastBench tournament leaderboard update provides the most current authoritative figures [1, 2]:
| Forecaster | Overall Brier Index | Dataset Questions | Market Questions | Rank |
|---|---|---|---|---|
| Superforecaster Median | 70.2 | 63.6 | 78.6 | #1 |
| DeepMind "green tree" | ~67.8–67.9 | ~64.6–64.8 | ~71.4 | #2 |
| DeepMind "yellow mouse" | ~67.6 | — | — | #3 (approx.) |
| xAI Grok 4.20 Preview | ~67.4 | — | — | #4 (approx.) |
| Public Median Forecaster | ~64.5 | — | — | Lower |
| Claude Sonnet 4.5 (zero-shot) | ~63.3 | — | — | Lower |
| OpenAI o3 (scratchpad) | ~63.2 | — | — | Lower |
Sources: [1, 2]
Several features of this table are analytically important. First, Green Tree leads all AI systems but trails superforecasters by approximately 2.3–2.4 Brier Index points overall. Second, on dataset questions — the data-rich, shorter-horizon items where AI has structural advantages — Green Tree (64.6–64.8) actually exceeds the superforecaster median (63.6), which is the most plausible basis for the March 15 parity claim [1]. Third, on market questions — longer-horizon, judgment-intensive items — the gap reverses sharply: superforecasters score 78.6 versus Green Tree's 71.4, a 7.2-point deficit for AI [1, 2]. This asymmetry is critical to understanding what "parity" means in this context.
2. ForecastBench Scores: Current Best AI Brier Score vs. Human Superforecaster Aggregate
The Brier Score and Brier Index: Technical Grounding
The Brier score, introduced by Glenn W. Brier in 1950, measures the mean squared error between predicted probabilities and realized binary outcomes: for a single forecast, (p − o)², and for N forecasts, (1/N)Σ(pᵢ − oᵢ)² [3, 4]. Lower scores indicate better calibration and accuracy; a perfect forecaster scores 0, while a naive forecaster always predicting 50% scores 0.25 [3].
FRI's Brier Index transforms this into a more intuitive scale: Brier Index = (1 − √Brier score) × 100% [1]. A score of 100% represents perfect foresight; 50% represents coin-flip forecasting; 0% represents a maximally wrong forecaster [1]. A Brier score of 0.086 corresponds to a Brier Index of approximately 70.6% [1]. ForecastBench additionally applies difficulty-adjusted Brier scores that subtract question-level fixed effects (γⱼ) estimated from the observed distribution of scores across many forecasters, enabling apples-to-apples comparisons across forecasters who answered different question sets [3, 4].
The Trajectory: 2023 to May 2026
The improvement in AI forecasting accuracy over this period is well-documented across multiple sources [3, 4]:
| Date | System | Difficulty-Adjusted Brier Score | Notes |
|---|---|---|---|
| March 2023 | GPT-4 | 0.131 | Baseline LLM performance |
| February 2025 | GPT-4.5 | 0.101 | ~20% improvement over 2 years |
| October 2025 | Best LLM (ForecastBench) | 0.101 | Superforecasters at 0.081 |
| November 2025 | AIA Forecaster (FB-7-21) | 0.1125 | Superforecaster median: 0.1110 |
| March 2026 | Best LLM (ForecastBench) | ~0.103 | Superforecasters at 0.086 |
| May 2026 | DeepMind "green tree" | ~0.086–0.088 (est.) | Brier Index ~67.9; SF at 70.2 |
Sources: [3, 4, 5, 6, 1]
The October 2025 snapshot is particularly well-documented: the superforecaster median achieved an overall difficulty-adjusted Brier score of 0.081, while the best LLM entry achieved 0.101 — a gap of approximately 20% in error terms [3, 4]. FRI's October 2025 linear extrapolation, based on an estimated annual improvement rate of approximately 0.016 difficulty-adjusted Brier points, projected LLM-superforecaster parity on ForecastBench by November 2026, with a 95% confidence interval spanning December 2025 to January 2028 [3, 4].
By March 5, 2026, an updated FRI extrapolation revised the parity estimate to March 2027 on the Brier score metric and May 2027 on the Brier Index metric, with 95% CIs of February 2026–January 2028 and April 2026–May 2028 respectively [1, 2]. A separate linear trend projection cited by one source estimates LLM-superforecaster parity around August 2027, with a 95% CI of March 2026 to August 2028 [1].
The AIA Forecaster: The Strongest Pro-Parity Data Point
The most compelling single data point in favor of near-parity comes from Bridgewater AIA Labs' AIA Forecaster, described in Alur et al., arXiv:2511.07678, dated November 10–11, 2025 [5, 6]. This system combines agentic search over high-quality news sources, a supervisor agent that reconciles disparate forecasts for the same event, and statistical calibration techniques designed to counter behavioral biases in LLMs [5, 6].
On the FB-7-21 benchmark slice, the AIA Forecaster scored 0.1125 against a superforecaster median of 0.1110 — a difference of 0.0015 Brier points, which is within the margin of noise for this sample size [5, 6]. This represents the closest any documented AI system has come to matching the superforecaster median on a ForecastBench-derived evaluation.
However, the same paper reveals important limitations. On the FB-Market subset — questions tied to prediction markets, which tend to be harder and more judgment-intensive — the AIA Forecaster scored 0.0753 against a human state-of-the-art of 0.0740, a gap of 0.0013 in the wrong direction [5, 6]. On the more challenging MarketLiquid benchmark, the AIA Forecaster scored 0.108 while market consensus scored 0.098 — a 10% deficit [5, 6]. Notably, an ensemble combining the AIA Forecaster and market prices achieved 0.092, outperforming both components individually, suggesting complementarity rather than substitution [5, 6].
Good Judgment's Market-Question Data
Good Judgment Inc.'s own analysis of ForecastBench performance on market questions reveals a more persistent gap [7, 8]. On market-question subsets, Good Judgment reports a superforecaster Brier score of approximately 0.039 versus approximately 0.059 for the best AI entrant — meaning the best AI's error rate is roughly 50% larger than that of superforecasters on these questions [7, 8]. This is consistent with the tournament leaderboard data showing a 7.2-point Brier Index gap on market questions (78.6 vs. 71.4) [1].
3. Is "Parity" Actually Established, or Is It Contested and Overstated?
The Case That Parity Has Been Reached (or Is Imminent)
The strongest evidence for parity rests on several pillars. First, the AIA Forecaster's November 2025 result on FB-7-21 (0.1125 vs. 0.1110) demonstrates that a purpose-built AI forecasting system can match the superforecaster median on a well-designed benchmark slice [5, 6]. Second, on ForecastBench's dataset questions — the data-rich, shorter-horizon category — Green Tree (64.6–64.8 Brier Index) actually exceeds the superforecaster median (63.6) as of May 2026 [1]. Third, the trajectory of improvement is steep and consistent: from GPT-4's 0.131 in March 2023 to near-superforecaster performance in late 2025 represents roughly a 35% reduction in error over two years [3, 4]. Fourth, one secondary source reports that the Forecasting Research Institute characterized the Green Tree result as a milestone achieved earlier than many expected, and that Green Tree's results showed no statistically significant difference from human top performers on certain question suites [9].
The Case That Parity Is Overstated
The evidence against comprehensive parity is, on balance, stronger and better-sourced. Several structural critiques of the parity claim deserve careful attention.
The frozen baseline problem. ForecastBench's human baseline is a frozen 2024 dataset [1, 8]. AI systems are evaluated against a static human benchmark rather than against superforecasters forecasting in real time. If superforecasters continued improving through 2025–2026 — which is plausible given their track record of calibration improvement — the true gap may be larger than the leaderboard suggests [8].
Question composition bias. ForecastBench is weighted toward data-rich, quantitative questions where AI has structural advantages [8]. The benchmark does not fully capture teaming, iterative updating, question formulation, or black-swan and geopolitical judgment — precisely the domains where superforecasters' qualitative reasoning skills are most valuable [1, 8]. Good Judgment's own analysis explicitly notes that ForecastBench does not measure what superforecasters actually do best [8].
The multiple-attempts problem. ForecastBench allows AI systems to submit updated forecasts every two weeks, while the human baseline is frozen [1]. Given enough evaluation cycles, statistical noise alone will eventually produce a result where an AI system falls below the benchmark threshold — a form of multiple comparisons inflation that can generate misleading parity headlines [1].
The market-question gap persists. On market questions — the hardest and most judgment-intensive category — superforecasters lead by 7.2 Brier Index points (78.6 vs. 71.4) and maintain roughly a 50% error-rate advantage [1, 7, 8]. This is not a marginal gap; it represents a domain where AI systems have not approached parity.
Statistical significance of the overall gap. The ForecastBench working paper reports that superforecasters significantly outperform the top LLM on both mean Brier scores and t-tests of performance differences, with p-values well below 0.001 on a subset of 200 questions [3, 4]. This is not a borderline result.
Expert forecasts of parity timing. The median superforecaster predicts AI will beat the ForecastBench benchmark by 2028; the median domain expert predicts 2030; the median public respondent predicts 2033 [1, 2]. These forecasts, made by the people most familiar with both AI capabilities and the benchmark's demands, suggest that the forecasting community itself does not believe parity has been achieved.
Good Judgment's institutional position. Good Judgment's CEO Warren Hatch has stated that AI has not fully replicated the collective human "hive mind" and that the company's philosophy is "Superforecasters plus AI, not one or the other" [8]. This framing — from the organization that manages the superforecaster baseline — is a meaningful signal about the state of play.
The Honest Synthesis
The evidence supports a nuanced conclusion: AI has achieved subset parity on data-rich, shorter-horizon questions and has dramatically closed the gap overall, but comprehensive parity on the full ForecastBench benchmark has not been established as of late May 2026. The popular framing of "AI has matched superforecasters" is an overstatement that conflates performance on favorable question subsets with performance across the full distribution of forecasting tasks. The gap on the overall leaderboard (70.2 vs. 67.8–67.9 Brier Index) is small in absolute terms but statistically meaningful and consistent across multiple measurement approaches. The trajectory suggests full parity is plausible within 1–3 years, but it has not arrived yet.
4. Concrete Implications for Finance, Insurance, and Governance
Finance
The most concrete financial implication documented in the evidence base is the cost-disruption argument: if AI systems approach superforecaster-level accuracy, the cost of high-quality probabilistic forecasting collapses from the fees charged by elite human forecasting services to approximately the cost of an API subscription [10]. This threatens the business model of traditional geopolitical risk firms, which must justify premium fees against a ~$20/month API call [10]. The democratization effect — making elite-level forecasting intelligence accessible to startups and SMEs that previously could not afford it — is cited as a significant structural shift [10].
In quantitative finance, more accurate probabilistic forecasting directly enables better risk pricing and allocation, improved capital deployment decisions, and more precise hedging strategies [1]. The AIA Forecaster paper demonstrates this concretely: on the MarketLiquid benchmark, an ensemble of the AIA Forecaster and market prices achieved a Brier score of 0.092, outperforming market consensus alone (0.098) — suggesting that AI forecasting systems can add measurable alpha even in liquid, information-rich markets [5, 6]. A small improvement in probabilistic accuracy compounds significantly in financial applications where decisions are made repeatedly at scale [1].
One industry survey cited by a single source reports that as of 2025, 96% of insurers surveyed were investing in data-driven forecasting and analytics technology [9]. While this figure cannot be independently verified from primary sources in the available registry, it is consistent with the broader pattern of institutional adoption.
Insurance
For insurance and reinsurance, the implications center on exposure modeling and catastrophe risk pricing. More accurate probabilistic forecasting of geopolitical events, climate-related risks, and economic disruptions directly affects underwriting decisions and reserve adequacy [1]. The domain-specific AI forecasting systems developed by Google DeepMind — including GraphCast and GenCast for weather and extreme event prediction — represent a separate but related track of AI forecasting capability that is already being integrated into catastrophe modeling workflows [1]. These weather-specific systems operate at a different level of maturity than general-purpose forecasting benchmarks; they are not the same as "green tree" but represent the broader DeepMind forecasting portfolio.
The insurance industry's concern is less about whether AI matches superforecasters on abstract benchmark questions and more about whether AI can improve the accuracy of specific, high-stakes probability estimates — flood probabilities, pandemic severity distributions, geopolitical disruption likelihoods — that feed directly into pricing models. The evidence suggests AI is increasingly useful for this purpose, particularly when combined with human judgment rather than deployed autonomously [7, 8].
Governance and Policy
For governance applications, the implications are primarily about decision support rather than automation. The evidence base consistently frames AI forecasting as a tool to augment human decision-making in policy contexts — improving the quality of scenario analysis, stress-testing policy assumptions, and providing calibrated probability estimates for legislative and regulatory decisions [10, 1]. The Forecasting Research Institute's LEAP (Longitudinal Expert AI Panel) Wave 5 report on security and geopolitics explicitly addresses how AI forecasting capabilities are reshaping the intelligence and policy analysis landscape [10].
The governance concern runs in both directions: AI forecasting tools can improve the quality of policy analysis, but they also create new risks if deployed without appropriate oversight. Institutions are advised to track calibration over time, combine human and AI forecasts rather than relying on either alone, and maintain strong governance frameworks for high-stakes deployment [1]. The risk of overconfidence in AI forecasting outputs — particularly given the benchmark limitations discussed above — is a concrete governance challenge that has not yet been systematically addressed in regulatory frameworks.
The cost-democratization effect also has governance implications: if high-quality forecasting becomes cheap and widely accessible, the information advantage currently held by well-resourced governments and large institutions over smaller actors narrows. This has implications for competitive intelligence, regulatory arbitrage, and the distribution of strategic foresight capabilities across the public and private sectors [10].
5. Current Honest Status: What We Actually Know as of Late May 2026
The evidence, taken as a whole, supports the following conclusions with varying degrees of confidence:
Well-established (high confidence, multi-source):
- ForecastBench is the leading public benchmark for AI vs. human forecasting, operated by FRI, using difficulty-adjusted Brier scores and a Brier Index [3, 4, 1, 2].
- As of May 23, 2026, superforecasters hold the #1 position on the ForecastBench tournament leaderboard with a Brier Index of 70.2 overall [1, 2].
- Google DeepMind's "green tree" is a real ForecastBench tournament submission, currently ranked #2 with a Brier Index of approximately 67.8–67.9 [1, 2].
- The gap between AI and superforecasters has closed dramatically since 2023 (from ~35% error gap to ~3% Brier Index gap overall) but has not closed to zero [3, 4, 1].
- On market questions, the gap remains substantial: superforecasters score 78.6 vs. Green Tree's 71.4 on the Brier Index [1].
Current evidence suggests (moderate confidence):
- The specific claim of parity "around March 15, 2026" cannot be verified from primary leaderboard data and likely reflects either a transient leaderboard state on a specific question subset or a secondary-source overstatement [1, 2].
- Green Tree may have achieved transient or subset parity on data-rich dataset questions, where it currently scores 64.6–64.8 vs. the superforecaster median of 63.6 [1].
- Full parity on the overall ForecastBench benchmark is projected by FRI's linear extrapolation for sometime between mid-2027 and early 2028, with substantial uncertainty [1, 2].
Preliminary findings indicate (lower confidence, limited sources):
- The AIA Forecaster (Alur et al., arXiv:2511.07678) achieved near-parity on the FB-7-21 slice (0.1125 vs. 0.1110 superforecaster median) in November 2025, representing the closest documented approach to parity on any well-designed evaluation [5, 6].
- Ensemble approaches combining AI forecasts with market prices or human judgment consistently outperform either component alone, suggesting the near-term optimum is human-AI collaboration rather than AI replacement [5, 6, 8].
The available sources do not support: any claim that comprehensive, durable, statistically robust parity between AI systems and elite human superforecasters has been achieved on the full ForecastBench benchmark or in real-world forecasting practice as of late May 2026. The honest status is: AI has reached parity on a favorable subset of questions, is closing the gap rapidly on the overall benchmark, but remains meaningfully behind on the hardest, most judgment-intensive forecasting tasks — and the benchmark itself may understate the true gap by using a frozen 2024 human baseline.
References
[1] Leaderboards - ForecastBench. forecastbench.org. https://forecastbench.org/leaderboards/index.html
[2] ForecastBench. forecastbench.org. https://forecastbench.org
[3] "forecastbench:adynamic benchmark of ai forecasting ...." https://arxiv.org/pdf/2409.19839
[4] "ForecastBench A Dynamic (faculty.wharton.upenn.edu)." https://faculty.wharton.upenn.edu/wp-content/uploads/2026/02/ForecastBench_A_Dynamic_.pdf
[5] "AIA Forecaster: Technical Report." https://arxiv.org/pdf/2511.07678
[6] "[2511.07678] AIA Forecaster: Technical Report." https://arxiv.org/abs/2511.07678
[7] Human vs AI Forecasts: What Leaders Need to Know. goodjudgment.com. https://goodjudgment.com/human-vs-ai-forecasts
[8] What Superforecasters Actually Said About ForecastBench - Good Judgment. goodjudgment.com. https://goodjudgment.com/what-superforecasters-actually-said-about-forecastbench
[9] Welcome to May 22, 2026 - by Dr. Alex Wissner-Gross. theinnermostloop.substack.com. https://theinnermostloop.substack.com/p/welcome-to-may-22-2026
[10] Wave 5: Security and Geopolitics - Longitudinal Expert AI Panel. leap.forecastingresearch.org. https://leap.forecastingresearch.org/reports/wave5