Executive Summary
- "Green Tree" is real but mischaracterized: Google DeepMind's "green tree" is an anonymized leaderboard entry on the Forecasting Research Institute's ForecastBench — not a standalone branded system — and the March 15, 2026 parity date cited in popular coverage appears to conflate a dataset-question milestone with overall benchmark parity.
- Current ForecastBench standings (May 30, 2026): The Superforecaster Median holds an overall Brier Index of 70.2; green tree sits at 67.8 — a gap of 2.4 index points, placing AI firmly at #2 overall but not at parity.
- Parity is contested and overstated: On the overall leaderboard, superforecasters retain a statistically meaningful lead. On the "dataset questions" sub-category, AI has reached or exceeded human performance; on the harder "market questions," superforecasters are nearly 50% more accurate than the best AI entry.
- Methodological caveats are substantial: ForecastBench's human baseline is frozen at 2024, allows multiple AI submission attempts, and covers only binary yes/no questions — all factors that may flatter AI performance relative to real-world operational forecasting.
- Practical implications are directionally real but unquantified: Finance, insurance, and governance applications are actively discussed, but no peer-reviewed study has yet published granular deployment figures or economic impact estimates tied specifically to the ForecastBench parity milestone.
1. The "Green Tree" System: What It Is and What It Is Not
1.1 Identity of the System
The name "green tree" refers to an anonymized or codename leaderboard entry on ForecastBench, the Forecasting Research Institute's dynamic benchmark of AI forecasting capabilities [1, 2]. It is not a separately branded, publicly announced standalone forecasting product from Google DeepMind. The ForecastBench leaderboard lists the entry as "google deepmind, green tree," indicating organizational attribution alongside an internal codename [3, 4]. Other DeepMind entries on the same leaderboard — reportedly including one styled "yellow mouse" — suggest the naming convention is an anonymization scheme for tournament submissions rather than product branding [1].
This distinction matters for evaluating the popular coverage. A May 30, 2026 commentary piece described "GreenTree" as a purpose-built AI system that "can predict the future as well as the best humans on Earth," framing the March 15, 2026 date as the moment "AI hit parity with superforecasters for the first time." That framing is a secondary-source interpretation, not a statement from DeepMind or the Forecasting Research Institute, and it overstates what the leaderboard data actually shows.
1.2 The March 15, 2026 Date: What Happened
The March 15, 2026 date cited in popular coverage does not correspond to a sustained overall parity event on ForecastBench. The available evidence points to two more precise interpretations:
-
Dataset-question leadership: On the sub-category of "dataset questions" — questions drawn from structured data sources rather than live prediction markets — green tree reached the top position or near-top position around this period [1, 2]. AI systems have structural advantages on data-rich, short-horizon questions, and this sub-category leadership was real.
-
A snapshot, not a sustained result: The overall ForecastBench leaderboard updated through May 30, 2026 shows the Superforecaster Median at Brier Index 70.2 and green tree at 67.8 [1, 5]. The March 15 milestone does not reflect sustained overall parity on the current leaderboard. One source reports the parity claim aligns with a specific milestone or dataset-question performance around that date, but explicitly notes it does not reflect sustained overall parity [1].
A separate single-source report places the parity or top-spot event closer to May 22, 2026, not March 15 — suggesting the popular coverage may have backdated or conflated multiple leaderboard updates [6]. Given the provisional nature of this dating, both dates should be treated with caution.
1.3 The Benchmark Used
ForecastBench uses difficulty-adjusted Brier scores converted into a Brier Index for cross-question comparability [3, 5]. The Brier Index is defined as (1 − √Brier score) × 100%, producing a 0–100% scale where 50% represents an uninformative baseline (always predicting 50%) and 100% represents perfect foresight [1, 2]. Lower raw Brier scores are better; higher Brier Index values are better. The difficulty adjustment is critical: it normalizes scores across questions of varying inherent predictability, allowing meaningful comparison between AI systems and human forecasters who may have answered overlapping but non-identical question sets [3].
2. ForecastBench: Current Scores and Leaderboard Status
2.1 The Benchmark's Design
ForecastBench is a dynamic, contamination-free benchmark run by the Forecasting Research Institute, consisting of 1,000 probabilistic questions that are automatically generated and updated [1, 3]. Questions span geopolitics, economics, technology, and science, with resolution horizons ranging from weeks to over a year [3]. The benchmark was originally launched in September 2024 and received a major update in October 2025 [7, 2]. It is now open to public submissions [2].
Critically, ForecastBench currently includes only binary yes/no questions, excluding point predictions, multiple-choice outcomes, quantile predictions, and full probability distributions [2]. This scope limitation is relevant to any claim about "general" forecasting parity.
The human reference baseline is a frozen superforecaster cohort surveyed primarily in 2024 [5]. Superforecasters on ForecastBench are defined as individuals with an established track record on prior forecasting platforms such as Good Judgment Open and the Good Judgment Project [3]. The benchmark allows multiple AI submission attempts, which is an asymmetry relative to the human baseline [5].
2.2 Score Progression Over Time
The trajectory of AI performance on ForecastBench is well-documented across multiple sources:
| Date | Superforecaster Brier Score | Best AI Brier Score | Gap (Brier Index) |
|---|---|---|---|
| October 2024 | ~0.081 | ~0.101 (GPT-4.5) | ~19% |
| October 2025 | 0.081 | 0.101 (GPT-4.5) | ~20% |
| February 20, 2026 | ~0.086 | 0.102 (CassiAI ensemble_2_crowdadj) | ~2.7 index pts |
| March 2026 | 0.086 | 0.103 (CassiAI / Grok 4.20 Preview) | ~2.7 index pts |
| May 30, 2026 | Brier Index 70.2 | Brier Index 67.8 (green tree) | 2.4 index pts |
Sources: [3, 8, 2, 5, 1]
In October 2024, the top-performing LLMs lagged superforecasters by approximately 19% in Brier Index terms [1, 2]. By March 2026, superforecasters led with a Brier Index of 70.6% versus 67.9% for the best LLMs — a gap of 2.7 percentage points [8]. As of the May 30, 2026 leaderboard update, the gap had narrowed marginally to 2.4 points (70.2 vs. 67.8) [1, 5].
2.3 Current Leaderboard (May 30, 2026)
The ForecastBench overall Brier Index leaderboard as of May 30, 2026 [1, 5]:
| Rank | Entry | Overall Brier Index |
|---|---|---|
| 1 | Superforecaster Median Forecast | 70.2 |
| 2 | Google DeepMind, green tree | 67.8 |
| 3 | xAI Grok 4.20 (Preview) | ~67.4 |
| 4 | Various ensembles | ~67.3 |
On the dataset questions sub-category, green tree ranks #1, leading or matching superforecasters [1]. On the market questions sub-category, the gap is dramatically larger: superforecasters score approximately 0.39–0.40 (Brier Index equivalent) while the nearest AI entry scores approximately 0.59 — meaning superforecasters are nearly 50% more accurate on this harder sub-category [3, 5].
As a point of historical context, in October 2024 the median public forecast ranked #2 overall on ForecastBench, behind superforecasters and ahead of all LLMs. By the May 2026 update, the median public forecast had fallen to approximately #22, displaced by the rapid improvement of frontier AI systems [1].
2.4 February 2026 Snapshot: CassiAI and Grok
As of February 20, 2026, the leading AI entries were CassiAI's ensemble_2_crowdadj (Brier score 0.102) and xAI's Grok 4.20 (Preview), both tied for first among AI systems at joint-second overall — behind the superforecaster median but ahead of all other AI entries [3, 8]. By March 2026, these two entries remained at #2 with a Brier score of 0.103, while superforecasters held 0.086 [8]. Green tree's ascent to the #2 position (displacing CassiAI and Grok) appears to have occurred between March and May 2026.
3. Is Parity Established, Contested, or Overstated?
3.1 The Case That Parity Has Been Reached
The strongest evidence for parity rests on three pillars:
Sub-category leadership. On dataset questions — the data-rich, shorter-horizon subset of ForecastBench — green tree ranks #1, meaning AI has demonstrably surpassed the superforecaster median on this category [1]. This is not trivial: dataset questions constitute a meaningful share of the benchmark and represent the type of forecasting most directly applicable to structured analytical tasks.
Rapid gap closure. The gap between frontier AI and superforecasters compressed from approximately 19 percentage points in October 2024 to 2.4 Brier Index points by May 2026 — a roughly 87% reduction in the performance gap over 19 months [1, 2, 5]. The trajectory is unambiguous.
Crowd surpassing. AI systems have clearly surpassed the median human crowd forecaster, which itself is a meaningful threshold [1, 3]. By late 2025, large language models had already outperformed the median human forecaster, with only the top ~1% of human forecasters maintaining a consistent edge [3].
Near-statistical indistinguishability on some domains. On ForecastBench's live questions, one analysis reports that AI ensemble difficulty-adjusted Brier scores are statistically indistinguishable from the superforecaster team's on many individual domains [1].
3.2 The Case That Parity Has Not Been Reached
The evidence against a parity claim is, on balance, more robust for the overall benchmark:
The leaderboard is unambiguous. As of May 30, 2026, superforecasters hold Brier Index 70.2 versus green tree's 67.8 — a 2.4-point gap that is consistent across multiple leaderboard snapshots and multiple providers' readings of the data [1, 5]. This is not a rounding error.
Market questions reveal a large structural gap. On the harder "market questions" sub-category, superforecasters score approximately 0.39–0.40 while the best AI scores approximately 0.59 — a nearly 50% accuracy advantage for humans [3, 5]. These questions are arguably the most economically relevant, involving live prediction markets where information asymmetry and judgment under genuine uncertainty matter most.
Benchmark design favors AI. Good Judgment Inc. has published substantive critiques of ForecastBench as a measure of real-world forecasting capability [9]. Key limitations include: the human baseline is frozen at 2024 (meaning superforecasters cannot improve their scores as AI systems continue to submit); AI systems are permitted multiple submission attempts while humans are not; the benchmark covers only binary questions, excluding the full range of forecasting tasks; and it does not capture teaming, aggregation, or upstream question formulation — all areas where human superforecasters excel [5, 9]. Good Judgment's Dr. Warren Hatch has stated that "when the data is sparse and the environment is in flux, machines are backward looking by definition" [9].
Temporal leakage and benchmark validity concerns. A 2026 academic study flagged "pitfalls" in assessing LLM forecasters, specifically warning that temporal leakage — where AI systems inadvertently use information from after a question's reference date due to training data or tool access — can inflate apparent AI performance [3]. The researchers explicitly caution against prematurely concluding that LLMs match or exceed human performance on forecasting tasks.
Superforecasters' own assessment. Superforecasters surveyed through the LEAP (Longitudinal Expert AI Panel) program have a median forecast of 2028 for AI eventually surpassing their benchmark [9]. ForecastBench's own linear trend projection places overall LLM-superforecaster parity around August 2027, with a 95% confidence interval spanning March 2026 to August 2028 [1]. The lower bound of that CI encompasses the March 2026 date cited in popular coverage, but the central estimate is 17 months away.
Longer-horizon questions remain unresolved. ForecastBench questions resolved to date skew toward short-horizon, data-rich topics where AI has structural advantages. Many longer-range, judgment-heavy questions are still pending resolution, and the final scores on those questions may widen the gap [9].
3.3 Verdict: Contested and Partially Overstated
The honest status as of May 31, 2026 is this: AI has reached parity with superforecasters on a specific sub-category of ForecastBench (dataset questions) and has closed the overall gap dramatically, but has not achieved sustained overall parity on the benchmark's full leaderboard. The 2.4 Brier Index point gap is small but consistent. The popular framing of "March 15, 2026 parity" conflates sub-category leadership with overall benchmark parity, and the secondary-source commentary that generated this claim is not corroborated by the ForecastBench leaderboard data itself.
Metaculus prediction market contracts place the median expected date of AI-human forecasting parity at November 2026, while PredictStreet's linear trend had suggested parity around August 2027 [1, 10]. These market-based estimates bracket the uncertainty well.
4. Implications for Finance, Insurance, and Governance
4.1 What the Sources Actually Say
The concrete implications cited in the literature for finance, insurance, and governance are directionally consistent but largely qualitative. No peer-reviewed study in the available source set has published granular deployment figures or quantified economic impacts tied specifically to the ForecastBench parity milestone. The following represents the state of evidence as of late May 2026.
Finance. Academic work on AI in financial forecasting documents that large language models can process earnings calls, regulatory filings, and macroeconomic indicators to generate probabilistic forecasts of market-relevant events [11]. The Federal Reserve Bank of San Francisco has explored AI simulation of the Survey of Professional Forecasters, finding that LLM-generated forecasts can replicate the distributional properties of professional economist predictions on standard macroeconomic variables [12]. The implication is that AI could reduce the cost of maintaining large panels of professional forecasters for scenario analysis and stress testing. However, the market-question gap on ForecastBench — where superforecasters retain a ~50% accuracy advantage — is a direct caution against deploying current AI systems as drop-in replacements for human judgment on live financial prediction tasks [5].
Insurance. A 2024 survey found that 73% of insurers view AI models as key to managing climate risks [13]. Catastrophe modeling firms have begun integrating AI forecasting into their risk assessment pipelines, particularly for extreme weather events where DeepMind's WeatherNext 2 and GenCast models have demonstrated performance exceeding traditional physics-based numerical weather prediction ensembles on 0- to 15-day horizons [3]. The insurance application is most mature in this domain: probabilistic AI forecasts of hurricane tracks, flood extents, and wildfire spread are being incorporated into real-time pricing and reinsurance treaty negotiations. The connection to superforecaster-style general-purpose forecasting is more speculative — insurers are primarily deploying domain-specific models, not general LLM forecasters.
Governance. The governance implications are the least concretely documented. Commentary sources note that scalable, high-accuracy prediction of geopolitical and policy events could transform intelligence analysis, regulatory impact assessment, and legislative scenario planning [1]. The LEAP program's Wave 5 report on security and geopolitics documents expert AI panel assessments of geopolitical risks, suggesting an emerging institutional infrastructure for AI-assisted policy forecasting [14]. One source notes that municipal governments have begun considering AI oversight frameworks — the New York City Council reportedly considered a bill to create a municipal AI oversight office — though this is a governance-of-AI question rather than AI-for-governance [1].
4.2 Human-AI Collaboration as the Dominant Near-Term Model
Across all three domains, the most consistently cited implication is not AI replacement of human forecasters but human-AI collaboration. Research published in a peer-reviewed study on AI-augmented predictions found that LLM assistants improve human forecasting accuracy when used as decision-support tools rather than autonomous replacements [15]. The combination of AI's ability to process large structured datasets rapidly with human judgment on novel, sparse-data situations appears to outperform either alone — a finding consistent with the persistent superforecaster advantage on market questions, which are precisely the questions where human contextual judgment matters most [15, 9].
5. Broader Context: What ForecastBench Does and Does Not Measure
ForecastBench is the most rigorous publicly available benchmark for general-purpose AI forecasting, but it is a proxy for operational forecasting capability, not a full substitute [1, 3]. The benchmark's limitations are worth stating precisely:
- Binary questions only: Real-world forecasting requires point estimates, probability distributions over continuous outcomes, and multi-outcome scenarios. ForecastBench's binary format systematically advantages AI systems that are well-calibrated on yes/no questions but may be poorly calibrated on the full distributional task [2].
- Frozen human baseline: The superforecaster cohort was surveyed primarily in 2024. As AI systems continue to submit updated entries, the comparison is increasingly between current AI and 2024-vintage human performance [5].
- No teaming or aggregation: Operational superforecasting at Good Judgment Inc. involves team deliberation, structured aggregation, and iterative updating — none of which is captured in the individual-forecast baseline used by ForecastBench [9].
- Short-horizon bias in resolved questions: Questions that have resolved to date skew toward shorter horizons where AI has demonstrated the clearest advantages. The benchmark's final verdict on longer-horizon questions remains open [9].
- Temporal leakage risk: AI systems with access to web search or recent training data may inadvertently incorporate post-question information, inflating apparent accuracy [3].
These limitations do not invalidate ForecastBench as a benchmark — it remains the best available tool for tracking AI forecasting progress — but they mean that a 2.4-point Brier Index gap on this benchmark should not be interpreted as a 2.4-point gap in real-world operational forecasting capability. The actual gap in deployment contexts is likely larger.
6. Current Honest Status
As of May 31, 2026, the evidence supports the following precise characterization:
What is established: AI forecasting systems — specifically Google DeepMind's "green tree" entry and several competing frontier models — have closed approximately 87% of the performance gap that separated them from elite human superforecasters in October 2024. On the dataset-question sub-category of ForecastBench, AI has reached or exceeded superforecaster-level performance. AI clearly surpasses the median human crowd forecaster across the full benchmark.
What is not established: Overall parity on ForecastBench has not been achieved. The Superforecaster Median holds a Brier Index of 70.2 versus green tree's 67.8 as of May 30, 2026 — a 2.4-point gap that is small but consistent and reproducible across multiple leaderboard snapshots. On market questions, the gap is substantially larger (~50% accuracy advantage for superforecasters). The March 15, 2026 "parity" date cited in popular coverage is not corroborated by the overall leaderboard and appears to describe a sub-category milestone or a single snapshot that did not persist.
What is contested: Whether the remaining gap is methodologically meaningful or an artifact of benchmark design. ForecastBench's frozen human baseline, binary-question format, and multiple-attempt allowance for AI all introduce asymmetries that may overstate the remaining human advantage. Conversely, temporal leakage and short-horizon bias in resolved questions may overstate AI performance. The net direction of these biases is genuinely uncertain.
Central estimate for overall parity: ForecastBench's own linear trend projects overall parity around August 2027 (95% CI: March 2026–August 2028) [1]. Metaculus prediction markets place the median at November 2026 [10]. Superforecasters themselves estimate 2028 [9]. The honest answer is that overall parity on a rigorous general-purpose forecasting benchmark is likely 12–24 months away, with sub-category parity already achieved in data-rich domains.
References
[1] ForecastBench. forecastbench.org. https://forecastbench.org
[2] "\BenchmarkName: A Dynamic Benchmark of AI Forecasting Capabilities." https://arxiv.org/html/2409.19839v4
[3] "FORECASTBENCH:ADYNAMIC BENCHMARK OF AI ...." https://faculty.wharton.upenn.edu/wp-content/uploads/2026/02/ForecastBench_A_Dynamic_.pdf
[4] System and method for enhanced collaborative forecasting. image-ppubs.uspto.gov. https://image-ppubs.uspto.gov/dirsearch-public/print/downloadPdf/11941239
[5] Leaderboards - ForecastBench. forecastbench.org. https://forecastbench.org/leaderboards/index.html
[6] Welcome to May 22, 2026 - by Dr. Alex Wissner-Gross. theinnermostloop.substack.com. https://theinnermostloop.substack.com/p/welcome-to-may-22-2026
[7] "forecastbench:adynamic benchmark of ai forecasting ...." https://arxiv.org/pdf/2409.19839
[8] Making Forecasting Scores Easier to Interpret: Introducing the Brier Index. forecastingresearch.substack.com. https://forecastingresearch.substack.com/p/introducing-the-brier-index
[9] What Superforecasters Actually Said About ForecastBench - Good Judgment. goodjudgment.com. https://goodjudgment.com/what-superforecasters-actually-said-about-forecastbench
[10] When will LLMs beat superforecasters at ForecastBench?. metaculus.com. https://metaculus.com/questions/40290/when-will-llms-beat-superforecasters-at-forecastbench
[11] "The Role of AI in Financial Forecasting: ChatGPT's Potential and Challenges." https://arxiv.org/pdf/2411.13562
[12] Simulating the Survey of Professional Forecasters - San Francisco Fed. frbsf.org. https://frbsf.org/research-and-insights/publications/system-research-richmond-fed/2025/01/simulating-the-survey-of-professional-forecasters
[13] 73% of insurers see AI models as key to managing climate risks. fintech.global. https://fintech.global/2024/11/08/73-of-insurers-see-ai-models-as-key-to-managing-climate-risks
[14] Wave 5: Security and Geopolitics - Longitudinal Expert AI Panel. leap.forecastingresearch.org. https://leap.forecastingresearch.org/reports/wave5
[15] "AI-Augmented Predictions: LLM Assistants Improve Human Forecasting Accuracy." https://arxiv.org/pdf/2402.07862