Executive Summary
- "Green Tree" is real but mischaracterized: Google DeepMind does operate a submission labeled "green tree" on the ForecastBench tournament leaderboard, but the specific claim that it "hit parity with top human superforecasters on March 15, 2026" originates from a single aggregator newsletter, not from any DeepMind publication, peer-reviewed paper, or Forecasting Research Institute announcement — and the current leaderboard data refutes the parity claim outright.
- Current scores, as of late May 2026: On ForecastBench, the superforecaster median holds a Brier Index of 70.2 (≈ Brier score 0.089); DeepMind "green tree," the best-ranked AI system, scores 67.9 (≈ Brier score 0.103) — a gap of 2.3 Brier Index points that is meaningful and unresolved [1, 2, 3].
- Partial parity is real, full parity is not: AI has surpassed superforecasters on the "dataset questions" subset of ForecastBench, but trails significantly on "market questions," and superforecasters retain the overall #1 position on both the tournament and baseline leaderboards as of May 30, 2026 [2, 3].
- The most rigorous parity claim comes from Bridgewater's AIA Labs technical report (arXiv:2511.07678), which documents statistical indistinguishability from superforecasters on a specific question set — but this is a single-lab result on a constrained benchmark, not a general verdict [4].
- Practical implications are real but premature: Finance, insurance, and governance applications are actively incorporating AI forecasting tools, but practitioners and the Forecasting Research Institute explicitly caution that benchmark performance does not yet translate to the complex, continuously updated, judgment-heavy questions that institutional decision-makers actually face [5, 6, 7].
1. The "Green Tree" Claim: System, Date, and Benchmark
What the evidence actually shows
Multiple sources confirm that a Google DeepMind submission labeled "green tree" exists on the ForecastBench tournament leaderboard and is the highest-ranked AI system as of late May 2026 [2, 3]. On the tournament leaderboard updated May 23–30, 2026, "green tree" achieves an overall Brier Index of 67.8–67.9, ranking second overall behind the superforecaster median aggregate [2, 3].
The claim that "green tree" hit parity with superforecasters on or around March 15, 2026 is traceable to a single source: a personal tech aggregator newsletter dated May 22, 2026, which states that "DeepMind built an AI system called GreenTree that can predict the future as well as the best humans on Earth" and that "on March 15th, AI hit parity with superforecasters for the first time" [3]. This newsletter is not a DeepMind publication, not a Forecasting Research Institute announcement, and not a peer-reviewed paper. It is an aggregated commentary piece.
Searches of DeepMind's blog and publications pages for the relevant period surface weather forecasting research — WeatherNext 2, GenCast, GraphCast — but no announcement of a system named "Green Tree" achieving superforecaster parity in geopolitical or general-purpose event forecasting [8, 9, 10]. WeatherNext 2, DeepMind's most prominent recent forecasting release, is a meteorological model evaluated on standard atmospheric verification scores, not Brier scores on geopolitical questions — an entirely different domain [8].
What March 15, 2026 likely refers to
The most plausible reconstruction is that around mid-March 2026, the "green tree" submission first appeared on or near the superforecaster threshold on a subset of ForecastBench questions — specifically the "dataset questions" subset, where AI has since established a lead — and this was interpreted by secondary commentators as full parity. The Forecasting Research Institute's own analyses, published through its leaderboard and associated posts, do not characterize this as parity on the full benchmark [2, 3].
One provider's analysis notes that March 15, 2026 was "the first documented instance of an LLM matching superforecaster performance on the benchmark's questions" — but this characterization is itself drawn from secondary sources rather than primary FRI documentation, and should be treated as a plausible interpretation rather than an established fact.
Verdict on the "Green Tree" claim
| Claim Element | Status |
|---|---|
| System named "green tree" exists | Confirmed — ForecastBench leaderboard [2, 3] |
| Operated by Google DeepMind | Confirmed — listed as DeepMind submission [2] |
| Parity achieved on March 15, 2026 | Unverified — single newsletter source only [3] |
| Parity on full ForecastBench benchmark | Refuted — gap of 2.3 Brier Index points as of May 2026 [2] |
| Benchmark used | ForecastBench (Brier Index metric) [1, 2] |
| DeepMind official announcement | Not found [9, 10] |
2. ForecastBench: Current Scores and the Human–AI Gap
The benchmark defined
ForecastBench is a dynamic, contamination-free forecasting benchmark developed by the Forecasting Research Institute and documented in the working paper "ForecastBench: A Dynamic Benchmark of AI Forecasting Systems," circulated as a Wharton faculty working paper [1]. It evaluates probabilistic predictions on real-world events drawn from prediction markets, time series, and other sources, comparing large language models against two human baselines: the superforecaster median (top-tier forecasters from the Good Judgment Project tradition) and the public median [11, 1].
The benchmark splits questions into two categories:
- Dataset questions: data-rich, shorter-horizon questions where AI has structural advantages
- Market questions: questions drawn from prediction markets, which tend to be harder, more judgment-intensive, and where human advantages are larger [2, 3]
In March 2026, FRI introduced the Brier Index as its primary leaderboard metric, defined as (1 − √Brier score) × 100%, yielding a 0–100 scale where 100 represents perfect foresight and 50 represents always predicting 50% [12]. Lower raw Brier scores are better; higher Brier Index values are better. This metric change improved interpretability but introduced some discontinuity in historical comparisons.
Current leaderboard figures (as of May 23–30, 2026)
The ForecastBench tournament leaderboard, updated nightly, shows the following standings [1, 2, 3]:
| Entrant | Brier Index (Overall) | Approx. Brier Score | Rank |
|---|---|---|---|
| Superforecaster median | 70.2 | ≈ 0.089 | #1 |
| DeepMind "green tree" | 67.8–67.9 | ≈ 0.103 | #2 |
| DeepMind "yellow mouse" | 67.6 | ≈ 0.104 | #3 |
| xAI Grok 4.20 Preview | ≈ 67.4–67.6 | ≈ 0.104–0.105 | #3–4 |
| Public median | 64.5 | ≈ 0.117 | — |
| Claude Sonnet 4.5 (zero-shot) | 63.3 | ≈ 0.121 | — |
| OpenAI o3 (scratchpad) | 63.2 | ≈ 0.121 | — |
Sources: [1, 2, 3]
On the baseline leaderboard (no additional tools), superforecasters lead at approximately 69.9–70.6 Brier Index, with top LLMs remaining below this threshold [2].
Historical trajectory
The ForecastBench working paper's initial 200-question evaluation established the foundational baselines: superforecasters at a mean Brier score of 0.096, the general public at 0.121, and the top LLM at the time (Claude 3.5 Sonnet) at 0.122 — with superforecasters outperforming both at p < 0.001 [1]. By October 2025, the best AI model (GPT-4.5, released February 2025) had improved to 0.101, while the superforecaster benchmark stood at 0.081 — a roughly 20% human edge, itself down from approximately a 50% gap two years prior [1, 5].
As of January 29, 2026, the FRI leaderboard showed superforecasters still ranked #1, with two external AI submissions (xAI's Grok 4.20 Preview and Cassi's ensemble_2_crowdadj) tied at #2, trailing by 0.017 Brier points — a gap FRI analysts characterized as representing approximately one year of LLM progress at then-current rates [2].
The dataset vs. market question split
This distinction is critical for interpreting parity claims. On dataset questions, "green tree" has achieved a Brier Index sufficient to rank #1, surpassing the superforecaster median on that subset [2, 3]. This is a genuine and significant result.
On market questions, the gap is substantially larger. Good Judgment's analysis of this subset reports superforecasters achieving a Brier Index of 80.3 (Brier score ≈ 0.039) versus the nearest AI entrant at 75.8 (Brier score ≈ 0.059) — an AI error rate roughly 50% larger than the superforecasters' [5, 13]. Good Judgment explicitly identifies market questions as the subset most directly analogous to the work its institutional clients care about [5, 13].
One research prototype system labeled "BLF" achieves a state-of-the-art overall Brier Index of 69.4 on ForecastBench — 1.2 points below the 2024 superforecaster baseline of 70.6 — and notably reaches a Brier Index of 85.2 on market questions, which would represent a significant advance if confirmed [2]. However, BLF is described as a research prototype not yet standardized or widely deployed, and this figure should be treated as preliminary [2].
Projected parity timelines
FRI's linear trend projections estimate full LLM–superforecaster parity on the overall benchmark around August 2027, with a 95% confidence interval spanning March 2026 to August 2028 [2, 3]. More granular projections suggest:
- Dataset question parity: projected for approximately June 2026 [3]
- Market question parity: projected for approximately August 2026 [3]
- Overall benchmark parity: August 2027 (central estimate) [2]
If progress continued at the late-2025 rate, FRI analysts noted parity could arrive as early as November 2026 [5]. These projections carry substantial uncertainty; preliminary leads on ForecastBench have reversed as more questions resolved, and results fluctuate with sample composition [1].
3. The AIA Forecaster: The Strongest Published Parity Claim
The most rigorous published claim of AI–superforecaster parity comes not from DeepMind but from Bridgewater's AIA Labs, documented in the technical report "AIA Forecaster: Technical Report" (arXiv:2511.07678) [4, 14].
On the FB-7-21 question set within ForecastBench, the AIA Forecaster achieves a Brier score of 0.1125, compared to the superforecaster median of 0.1110 — a difference of 0.0015 that falls within statistical uncertainty intervals, making the two systems statistically indistinguishable [4]. For context, OpenAI's o3 scores 0.1096 on the same set, and previous LLM forecasters cluster around 0.107 [4].
The AIA Forecaster is an LLM-based system for judgmental forecasting using unstructured data [4]. Its performance represents a genuine advance: it is clearly better than both the general crowd and prior LLM systems on this benchmark, and its near-equivalence to the superforecaster median on FB-7-21 is the strongest quantitative evidence for parity currently in the literature [4].
However, several important caveats apply:
- FB-7-21 is a specific question subset, not the full ForecastBench benchmark
- This is a single-lab result from an organization (Bridgewater) with commercial interests in demonstrating AI forecasting capability
- The result has not been independently replicated on the full benchmark
- The AIA Forecaster's performance on market questions specifically — the hardest and most practically relevant subset — is not separately reported in the available documentation
4. Is "Parity" Established, Contested, or Overstated?
The honest status
The evidence supports a carefully differentiated verdict rather than a binary yes/no:
Where AI has reached or approached parity:
- On ForecastBench's dataset questions subset, "green tree" has achieved scores at or above the superforecaster median [2, 3]
- On the FB-7-21 question set, the AIA Forecaster is statistically indistinguishable from the superforecaster median [4]
- AI systems have surpassed the general public and median human forecaster across most ForecastBench configurations [1, 2]
- In specialized, data-rich domains (weather, structured time series), AI has demonstrably surpassed human baselines [8]
- One report notes that an AI from startup Mantic placed 4th out of 500+ participants in the Fall 2025 Metaculus Cup, beating the weighted average of all human-forecaster predictions in that contest [5]
Where AI has not reached parity:
- On the full ForecastBench tournament leaderboard, superforecasters retain the #1 position with a 2.3 Brier Index point lead as of May 30, 2026 [2, 3]
- On market questions — the most practically relevant subset — the AI error rate is approximately 50% larger than superforecasters' [5, 13]
- On complex, judgment-heavy, continuously updated questions of the type institutional decision-makers actually ask, Good Judgment's analysts doubt the gap will close within the next year [5, 13]
- AI systems currently operate as "solo" forecasters, lacking the group deliberation and aggregation techniques that boost human superforecaster accuracy by 10–25% [5]
- Human forecasters continuously update in response to new information; current AI systems typically provide one-off predictions [5]
- Earlier hybrid trials found superforecasters outperforming combined human–AI teams in some configurations [5]
Structural limitations of the benchmark itself:
- ForecastBench uses a frozen 2024 human baseline, meaning the human comparison group does not improve over time while AI systems do — creating an asymmetric comparison [2, 3]
- The benchmark is skewed toward short-horizon, data-rich questions that structurally favor AI [2, 3]
- The difficulty adjustment methodology adds uncertainty to cross-system comparisons [1, 2]
- ForecastBench does not capture teaming, iterative updating, or question formulation — all dimensions where human superforecasters have demonstrated advantages [2, 3]
Summary assessment
| Dimension | AI Status vs. Superforecasters |
|---|---|
| ForecastBench overall (May 2026) | Behind — 2.3 Brier Index points [2] |
| Dataset questions subset | At or above parity [2, 3] |
| Market questions subset | Significantly behind (~50% larger error rate) [5, 13] |
| FB-7-21 specific set (AIA Forecaster) | Statistical parity [4] |
| General public baseline | Surpassed [1, 2] |
| Complex judgment questions | Behind [5, 13] |
| Continuous updating | Behind [5] |
| Group deliberation | Not applicable to current AI [5] |
The claim that parity has been "reached" is overstated as a general proposition and contested by the primary benchmark data. It is accurate only for specific subsets and specific question configurations. The Forecasting Research Institute's own leaderboard — the primary evidence base for these claims — shows superforecasters still leading as of May 30, 2026 [2, 3].
5. Concrete Implications for Finance, Insurance, and Governance
Finance
AI forecasting systems are being actively integrated into quantitative finance workflows, with the primary use cases being macroeconomic scenario generation, earnings surprise prediction, and geopolitical risk pricing. The ForecastBench data showing AI performance in the high-0.09 to 0.10 Brier score range — approaching but not yet matching elite human accuracy — is directly relevant to systematic trading strategies where marginal predictive edge compounds over large numbers of decisions [5, 6].
The Bank for International Settlements has examined AI forecasting adoption in financial institutions, noting that the governance challenge is not merely technical accuracy but the auditability and explainability of probabilistic outputs [15]. Current AI forecasting systems produce calibrated probabilities that are in principle more auditable than human judgment, but the models' sensitivity to training data composition and prompt framing introduces new forms of systematic error that are harder to detect than human cognitive biases.
The practical implication of the current ~20% human edge on market questions is significant for finance: on questions directly analogous to financial market outcomes, superforecasters' Brier score of approximately 0.039 versus AI's approximately 0.059 represents a substantial accuracy differential that would translate to material performance differences in systematic strategies [5, 13].
Insurance
For insurance applications — catastrophe modeling, longevity risk, liability tail estimation — the relevant comparison is not the overall ForecastBench leaderboard but performance on low-frequency, high-consequence events where training data is sparse. This is precisely the domain where the benchmark's skew toward data-rich, short-horizon questions makes it least informative.
The AIA Forecaster's technical report documents performance on judgmental forecasting using unstructured data [4], which is closer to the insurance use case than standard ForecastBench questions. The statistical indistinguishability from superforecasters on FB-7-21 is therefore more relevant to insurance applications than the overall leaderboard gap — but the caveat that this is a single-lab result on a constrained question set applies with full force.
Actuarial bodies have not yet formally recognized AI forecasting systems as equivalent to human expert judgment for regulatory purposes, and the frozen 2024 human baseline in ForecastBench means the benchmark cannot capture improvements in human forecasting practice that may be occurring simultaneously [2].
Governance and Policy
Good Judgment's work on AI governance applications documents superforecasters' use in policy scenario planning, intelligence assessment, and regulatory foresight [6, 7]. The Longitudinal Expert AI Panel (LEAP) Wave 5 report on security and geopolitics provides the most direct evidence of AI forecasting in governance contexts, showing that AI systems are being used as inputs to — but not replacements for — human expert judgment in high-stakes policy settings [16].
The OECD has examined AI-assisted forecasting in migration anticipation and preparedness contexts, noting that model evaluation frameworks for policy use require different criteria than benchmark leaderboards: specifically, the ability to handle novel situations, explain reasoning to non-technical stakeholders, and update in response to policy interventions [17].
The governance implication of the current evidence is that AI forecasting tools are most defensibly used as augmentation rather than replacement of human superforecasters. The academic literature on AI-augmented predictions confirms that LLM assistants improve human forecasting accuracy [18], suggesting the near-term value proposition is hybrid rather than autonomous [18, 16].
6. Strongest Evidence For and Against Parity
Strongest evidence that parity has been reached
-
AIA Forecaster on FB-7-21 [4]: Brier score of 0.1125 versus superforecaster median of 0.1110 — statistically indistinguishable, with the AI outperforming prior LLM systems (o3 at 0.1096) and the general crowd by a clear margin.
-
Dataset question subset leadership [2, 3]: "Green tree" ranks #1 on ForecastBench's dataset questions subset, surpassing the superforecaster median on that category.
-
Rapid convergence trajectory [1, 5]: The gap has closed from approximately 50% two years ago to approximately 20% by late 2025, with FRI projecting full parity by August 2027 (central estimate) and potentially as early as November 2026 under optimistic assumptions.
-
Specialized tournament performance [5]: An AI system placed 4th out of 500+ in the Fall 2025 Metaculus Cup, beating the weighted average of all human-forecaster predictions.
-
BLF market-question performance [2]: A research prototype reportedly achieves a Brier Index of 85.2 on market questions, which would exceed the superforecaster market-question baseline of 80.3 — though this is a preliminary, undeployed system.
Strongest evidence that parity has not been reached
-
ForecastBench overall leaderboard, May 30, 2026 [2, 3]: Superforecasters lead at Brier Index 70.2 versus "green tree" at 67.8–67.9 — a 2.3-point gap that is not within noise.
-
Market questions gap [5, 13]: On the most practically relevant question subset, AI error rates are approximately 50% larger than superforecasters' (Brier score ≈ 0.059 vs. ≈ 0.039).
-
Frozen baseline asymmetry [2, 3]: The human baseline is fixed at 2024 performance; if superforecasters have continued improving, the true gap may be larger than the leaderboard suggests.
-
Benchmark structural limitations [2, 3]: ForecastBench favors short-horizon, data-rich questions; lacks teaming, updating, and question formulation — all dimensions where human advantages are documented.
-
No DeepMind confirmation [9, 10]: The "March 15, 2026 parity" claim has no primary-source confirmation from DeepMind, FRI, or any peer-reviewed publication.
-
Expert skepticism on complex questions [5, 13]: Good Judgment's analysts, who operate the superforecaster programs used as the human baseline, explicitly doubt the gap will close within the next year for the complex, fuzzy questions that institutional clients actually ask.
7. Current Honest Status
As of May 31, 2026, the honest status is as follows:
AI has achieved partial, domain-specific parity with elite human superforecasters on constrained benchmarks. On ForecastBench's dataset questions subset, the best AI systems (led by DeepMind "green tree") have matched or exceeded the superforecaster median. On a specific question set (FB-7-21), Bridgewater's AIA Forecaster is statistically indistinguishable from superforecasters. These are genuine, significant results that represent a qualitative shift from the state of AI forecasting two to three years ago [1, 2, 4].
Broad, sustained parity across the full benchmark has not been established. The superforecaster aggregate retains the #1 position on both the tournament and baseline ForecastBench leaderboards as of May 30, 2026, with a 2.3 Brier Index point lead. On market questions — the most practically relevant category — the human advantage is substantially larger [2, 3, 13].
The "Green Tree parity on March 15, 2026" claim is not verifiable from primary sources and is contradicted by the current leaderboard data showing a persistent gap. The claim appears to have originated from a secondary newsletter's interpretation of subset-level performance, not from any official announcement [9, 10, 3].
The trajectory strongly suggests full benchmark parity within 12–24 months if current improvement rates continue, with FRI's central estimate at August 2027 and optimistic scenarios pointing to late 2026 [5, 2]. Whether benchmark parity will translate to practical parity on the complex, continuously updated, judgment-intensive questions that matter most to finance, insurance, and governance practitioners remains an open and actively contested question [5, 13, 6, 7].
The most defensible summary for practitioners: AI forecasting systems have entered the same performance regime as elite human superforecasters on structured, data-rich questions, and are closing the gap on harder questions at a pace that warrants serious attention. They have not yet replaced superforecasters as the gold standard for the full range of real-world forecasting tasks that institutional decision-makers care about.
References
[1] "FORECASTBENCH:ADYNAMIC BENCHMARK OF AI ...." https://faculty.wharton.upenn.edu/wp-content/uploads/2026/02/ForecastBench_A_Dynamic_.pdf
[2] Leaderboards - ForecastBench. forecastbench.org. https://forecastbench.org/leaderboards/index.html
[3] ForecastBench. forecastbench.org. https://forecastbench.org
[4] "AIA Forecaster: Technical Report." https://arxiv.org/html/2511.07678v1
[5] Human vs AI Forecasts: What Leaders Need to Know. goodjudgment.com. https://goodjudgment.com/human-vs-ai-forecasts
[6] Superforecasting AI Governance s (goodjudgment.io). goodjudgment.io. https://goodjudgment.io/docs/Superforecasting_AI_Governance_s.pdf
[7] Superforecasting ai governance (goodjudgment.com). goodjudgment.com. https://goodjudgment.com/superforecasting-ai-governance
[8] WeatherNext 2: Google DeepMind’s most advanced forecasting model. blog.google. https://blog.google/innovation-and-ai/models-and-research/google-deepmind/weathernext-2
[9] News — Google DeepMind. deepmind.google. https://deepmind.google/blog
[10] Publications — Google DeepMind. deepmind.google. https://deepmind.google/research/publications
[11] "forecastbench:adynamic benchmark of ai forecasting ...." https://arxiv.org/pdf/2409.19839
[12] Making Forecasting Scores Easier to Interpret: Introducing the Brier Index. forecastingresearch.substack.com. https://forecastingresearch.substack.com/p/introducing-the-brier-index
[13] What superforecasters actually said about forecastbench (goodjudgment.com). goodjudgment.com. https://goodjudgment.com/what-superforecasters-actually-said-about-forecastbench
[14] "AIA Forecaster: Technical Report." https://arxiv.org/pdf/2511.07678
[15] Insights63 (bis.org). bis.org. https://bis.org/fsi/publ/insights63.pdf
[16] Wave 5: Security and Geopolitics - Longitudinal Expert AI Panel. leap.forecastingresearch.org. https://leap.forecastingresearch.org/reports/wave5
[17] How can the model be evaluated 1c893567 (oecd.org). oecd.org. https://oecd.org/en/publications/2026/03/migration-anticipation-and-preparedness_c7c13bc4/full-report/how-can-the-model-be-evaluated_1c893567.html
[18] "AI-Augmented Predictions: LLM Assistants Improve Human Forecasting Accuracy." https://arxiv.org/pdf/2402.07862