Executive Summary
- "Green Tree" is real but mischaracterized: Google DeepMind did submit a system codenamed "green tree" to ForecastBench, and it ranks #2 overall — but the claimed March 15, 2026 parity date is unverified; the only sourced reference to this milestone appears in a single newsletter dated May 22, 2026, not in any FRI, DeepMind, or ForecastBench official publication [1].
- The numbers show a persistent gap: As of late May 2026, the superforecaster median Brier Index on ForecastBench stands at approximately 70.2, versus "green tree's" 67.8–67.9 — a gap of roughly 2–3 index points that is small but not zero, and superforecasters lead even more substantially on market and geopolitical questions [2, 3, 4].
- "Parity" is contested and overstated: The AIA Forecaster technical report claims statistical indistinguishability from superforecasters on ForecastBench [5], but the official leaderboard, the Wharton ForecastBench paper, and Good Judgment Inc. all maintain that superforecasters retain the top position [6, 7, 8].
- Trajectory is clear, arrival is not: ForecastBench's own extrapolation puts full parity at approximately August 2027 (95% CI: March 2026–August 2028), and the FRI's Wave 5 expert panel median is "by end of 2030" [2, 9].
- Practical implications are real but prospective: Finance, insurance, and governance applications are being actively discussed and piloted, but no evidence supports large-scale replacement of human superforecaster teams as of May 2026 [7, 8, 10].
1. The "Green Tree" Claim: System, Date, and Benchmark
What Is Confirmed
Google DeepMind did submit a forecasting system to the Forecasting Research Institute's ForecastBench tournament leaderboard under the codename "green tree" [2, 3]. This is well-established across multiple independent sources. ForecastBench maintains public leaderboards that list AI submissions under anonymized codenames — "green tree" and a companion entry called "yellow mouse" are both attributed to DeepMind, while xAI's submission appears as "Grok 4.20 (Preview)" and CassiAI's as "ensemble_2_crowdadj" [2, 3, 4].
The benchmark used is ForecastBench itself: a dynamic, contamination-free benchmark of LLM forecasting accuracy on real-world events, scored via a difficulty-adjusted Brier Index on a 0–100 scale where higher values indicate better performance [11, 2, 12]. The benchmark draws on approximately 1,000 automatically generated and regularly updated forecasting questions, with hidden evaluation data released only after resolution to prevent memorization from training corpora [6, 2].
What Is Not Confirmed: The March 15, 2026 Date
The specific claim that "green tree" hit parity with top human superforecasters "around March 15, 2026" is not corroborated by any primary source. No FRI publication, no DeepMind blog post, and no official ForecastBench announcement confirms this date [2, 3, 13, 1]. The only sourced appearance of the "green tree parity" narrative traces to a single newsletter by Dr. Alex Wissner-Gross, "Welcome to May 22, 2026" — a document dated nearly ten weeks after the claimed event [1]. One provider's analysis notes that public announcements and leaderboard claims of parity or topping dataset questions occurred in mid-to-late May 2026, not mid-March [14, 15].
It is plausible that "green tree" first topped the dataset-question subset of ForecastBench sometime in the March–May 2026 window, given the trajectory of scores. But the specific date of March 15, 2026 should be treated as unverified, and the claim of "full parity" on the overall benchmark is contradicted by the leaderboard data described below.
DeepMind's Named Public Systems Are Meteorological
It is worth clarifying a potential source of confusion: DeepMind's publicly named and documented forecasting systems during this period — GraphCast and WeatherNext 2 — are meteorological models entirely unrelated to geopolitical or event superforecasting [16, 13]. WeatherNext 2, introduced by Google DeepMind and Google Research, generates probabilistic weather forecasts for up to 15 days ahead, surpasses the previous WeatherNext model on 99.9% of variables and lead times, and operates at up to 1-hour resolution [13]. GraphCast predicts weather conditions up to 10 days in advance more accurately and faster than ECMWF's gold-standard HRES system [16]. Neither system is "green tree," and neither is designed for the kind of geopolitical, economic, and social event forecasting that ForecastBench measures.
2. ForecastBench Scores: AI vs. Human Superforecasters
The Benchmark Architecture
ForecastBench was originally launched in September 2024 [11, 12] and received a major update in October 2025 [3, 12]. The February 2026 academic paper published through Wharton formalized the methodology [6, 7]. The benchmark maintains two primary human comparison groups: the general public and elite "superforecasters" drawn from organizations like Good Judgment Inc. [6, 2]. Performance is measured via the Brier score (mean squared error of probabilistic forecasts, where 0 is perfect and 0.25 is equivalent to always predicting 50%) and the Brier Index, defined as (1 − √Brier score) × 100, producing a 0–100 scale where higher is better [2, 3].
The Quantitative Record
The following table summarizes the documented score progression:
| Date | Entity | Brier Score | Brier Index | Source |
|---|---|---|---|---|
| October 2025 | GPT-4.5 (best model) | 0.101 | ~68.2 | [3, 4] |
| October 2025 | Superforecasters | 0.081 | ~71.5 | [3, 4] |
| January 29, 2026 | Superforecasters (#1) | ~0.086 | 70.6 | [17, 4] |
| January 29, 2026 | Grok 4.20 / ensemble_2_crowdadj (tied #2) | 0.103 | ~67.9 | [3, 4] |
| Early March 2026 | Superforecasters | 0.086 | 70.6 | [17, 4] |
| Early March 2026 | Best LLMs (CassiAI / Grok 4.20) | 0.103 | 67.9 | [17, 4] |
| Late May 2026 | Superforecasters (overall) | ~0.086 | 70.2 | [2, 3] |
| Late May 2026 | DeepMind "green tree" (#2 overall) | ~0.088–0.089 | 67.8–67.9 | [2, 3] |
Notes on the Wharton paper baseline: The ForecastBench academic paper [6, 7] evaluated a 200-item subset and found superforecasters at a mean Brier score of 0.096, the general public at 0.121, and Claude 3.5 Sonnet at 0.122. The difference between superforecasters (0.096) and the best LLMs (~0.122) was statistically significant at p < 0.001 [7]. The FRI's complementary leaderboard analysis, which uses the full dynamic question set, reports a somewhat better superforecaster score of 0.086 (Brier Index 70.6%) and best LLMs at 67.9% [17, 4] — the difference in baselines reflects different question sets and time windows, not a methodological contradiction.
On market questions specifically: Superforecasters lead substantially more on market and geopolitical questions than on general dataset questions. One analysis reports a superforecaster edge of approximately 80.3 versus the nearest AI at approximately 75.8 on the transformed scale for market questions, with traditional Brier scores showing AI's error rate approximately 50% higher on this category [11, 17]. The February 2026 analysis found superforecasters were nearly 50% more accurate than the nearest AI entrant on market questions [4, 18].
The Metaculus comparison: A separate arXiv study assessed state-of-the-art models on 464 forecasting questions from Metaculus [15]. OpenAI o3 achieved a Brier score of 0.1352 versus a human crowd baseline of 0.149 — confirming that frontier AI has surpassed the median human crowd [15]. However, Metaculus superforecasters have achieved Brier scores as low as 0.023 on that platform's question set [15], illustrating the substantial ceiling that elite human forecasters represent on well-curated, long-horizon questions.
The AIA Forecaster Claim
The AIA Forecaster technical report [5] makes the strongest AI-parity claim in the academic literature: it states that the AIA Forecaster achieves performance "statistically indistinguishable" from human superforecasters on ForecastBench and surpasses prior LLM baselines. However, the same paper acknowledges that on a harder benchmark built from liquid prediction markets, the AIA Forecaster underperforms market consensus — though an ensemble combining the AIA Forecaster with market consensus outperforms consensus alone [5]. This pattern — parity on the academic benchmark, underperformance on harder real-money markets — is a recurring theme in the evidence.
3. Is "Parity" Established, Contested, or Overstated?
The Strongest Evidence That Parity Has Been Reached
The most credible pro-parity evidence comes from three directions:
-
Score convergence on ForecastBench: The gap between the best AI (Brier Index ~67.9) and superforecasters (70.2–70.6) is now approximately 2–3 index points, down from roughly a 20% relative gap in October 2025 [17, 2, 3, 4]. The trajectory of improvement — roughly 0.017 Brier points per year as framed by FRI [18] — means the gap has closed substantially.
-
Dataset-question subset leadership: One provider's analysis, drawing on leaderboard snapshots, indicates that "green tree" ranked #1 on dataset questions in some snapshots, meaning AI has at minimum reached or exceeded the superforecaster aggregate on the non-market portion of ForecastBench [2, 3]. This is a meaningful milestone even if it does not constitute overall parity.
-
The AIA Forecaster paper: The peer-reviewed technical report [5] explicitly claims statistical indistinguishability from superforecasters on ForecastBench, providing the most formal academic support for a parity claim.
-
Tournament performance: An AI developed by start-up Mantic placed 4th out of 500+ participants in a major 2025 forecasting tournament and beat the wisdom-of-crowd average of all human forecasters in that event [19]. A 2026 study found that Gemini 2.5 Pro exceeded human venture evaluators in predicting startup success [17].
The Strongest Evidence That Parity Has Not Been Reached
The preponderance of primary-source evidence supports the conclusion that full, consistent parity has not been established:
-
The official leaderboard: As of late May 2026, the ForecastBench official standings show the superforecaster aggregate ahead of every AI submission by a small but non-zero margin on the overall leaderboard [2, 4]. No AI has exceeded the top human benchmark in a sustained, statistically significant way on the overall score [2].
-
Good Judgment Inc.'s position: Good Judgment Inc. — the organization that manages the superforecaster baseline — stated explicitly in late 2025 that "Superforecasters still lead," citing a superforecaster Brier score of approximately 0.081 versus the best model's approximately 0.10 [8]. Their analysis [10] also argues that ForecastBench does not capture the full range of superforecaster capabilities.
-
Market question gap: The substantially larger AI deficit on market and geopolitical questions — the categories most relevant to real-world decision-making in finance and governance — means that even if dataset-question parity has been achieved, the practically important gap remains [11, 17, 4].
-
Benchmark limitations: The human baseline on ForecastBench is frozen from 2024 data [17, 20]. The benchmark also has known advantages for AI on short-horizon, data-rich questions and does not capture teaming, iterative updating, question formulation, or long-horizon judgment — capabilities where human superforecasters are believed to hold advantages [17, 20, 10].
-
Expert forecasts of parity: ForecastBench's own extrapolation puts full parity at approximately August 2027 (95% CI: March 2026–August 2028) [2, 3]. The FRI's Wave 5 Longitudinal Expert AI Panel [9] reports a median expert prediction of "by the end of 2030" for AI to outperform superforecasters. Superforecasters themselves predict AI will overtake their benchmark by approximately 2028 [17, 20].
The Honest Assessment
The evidence supports a nuanced position: AI has achieved benchmark-specific, subset-level parity on the dataset-question component of ForecastBench, and the overall gap has narrowed to a range where statistical noise and methodological choices can make it appear to vanish in some analyses. The AIA Forecaster's claim of statistical indistinguishability is the strongest formal support for parity, but it applies to a specific benchmark under specific conditions. The official leaderboard, the Wharton paper, and Good Judgment Inc. all maintain that superforecasters retain the lead on the overall, contamination-controlled benchmark.
The claim that "green tree" hit parity on March 15, 2026 is unverified by primary sources and likely conflates subset-level performance with overall parity. The honest status as of May 31, 2026 is: near-parity on structured, short-to-medium-horizon questions; persistent human advantage on market, geopolitical, and long-horizon questions; full overall parity not yet established but plausibly 12–30 months away.
4. Implications for Finance, Insurance, and Governance
Finance
The most concrete near-term application is in trading and portfolio risk management. Financial institutions including Goldman Sachs have begun integrating AI-assisted forecasts into trading and macro views, including analysis of prediction market data from platforms like Kalshi and Polymarket [21]. AI forecasting improvements are cited as potentially enhancing market predictions, portfolio risk management (including VaR and tail risk estimation), credit assessment, and fraud detection [7, 8].
The systemic risk dimension is also being discussed: if AI systems begin moving markets autonomously through their predictions, new oversight regimes may be needed to prevent destabilizing feedback loops analogous to the flash crashes associated with algorithmic trading [7]. Regulators are already debating whether AI-generated probabilistic market forecasts should be regulated as financial advice or as automated trading signals [7]. The OECD has described AI systems as supporting ex ante policy evaluations by building predictive systems and simulations that help policymakers anticipate potential impacts before implementation [22].
The democratization argument is frequently cited: if superforecaster-level predictive insight becomes available at near-zero marginal cost through AI, the informational advantages currently held by well-resourced institutional investors would compress [7]. This cuts both ways — more efficient markets benefit allocative efficiency but reduce the returns to human analytical skill.
Insurance
Insurance applications are among the most directly actionable. AI underwriting — defined as the application of machine learning, NLP, and predictive analytics to evaluate risk, price policies, and automate approvals using 500 to 1,500 or more data variables per submission [18] — is already in deployment at various carriers. The prospective implication of superforecaster-level AI is more precise catastrophe modeling, improved solvency forecasting, and better-calibrated premium pricing for tail risks including natural disasters and pandemic scenarios [7, 23].
The Forecasting Research Institute's economic effects research [24] and industry analyses [23] both note that AI could improve probabilistic forecasting for claims, solvency, and catastrophe modeling. However, no evidence as of May 2026 supports large-scale replacement of human actuarial or underwriting teams — the current state is augmentation rather than substitution [8, 10].
Governance
The governance implications are the most speculative but potentially the most consequential. The OECD's framework for AI in policy evaluation [22] describes AI systems as enabling ex ante impact assessments that were previously too resource-intensive for routine policy analysis. If AI forecasting reaches reliable superforecaster-level performance on geopolitical and policy questions, governments would have access to institutional-grade probabilistic forecasts at near-zero cost [7].
The risks cited include over-reliance by decision-makers, erosion of public transparency if AI forecasts are used to pre-empt democratic deliberation, and the potential for bad actors to exploit advanced forecasts to game geopolitical events [7]. The Brookings Institution's analysis of public attitudes toward AI governance [25] notes that public trust in AI-generated forecasts for high-stakes decisions remains limited, which may constrain adoption even as technical capabilities advance.
One provider's analysis notes that no evidence of immediate deployment at scale replacing superforecaster teams in governance contexts has been reported as of May 2026 [8, 10]. The current practical state is pilot programs and advisory integration, not operational substitution.
5. Methodological Caveats and the Limits of "Parity" as a Concept
Several structural issues complicate any clean parity determination:
The frozen baseline problem: ForecastBench's human superforecaster baseline was established from 2024 data [17, 20]. If superforecasters have continued to improve — including by using AI assistance — the benchmark may understate the current human frontier. The ForecastBench paper [6] and Good Judgment Inc. [10] both note that the benchmark does not capture the full range of superforecaster capabilities, including iterative updating, collaborative forecasting, question decomposition, and explicit pre-mortem reasoning.
The pre-mortem gap: State-of-the-art models almost never perform explicit pre-mortems (systematically outlining how a prediction could be wrong) or weigh wildcard scenarios — practices that seasoned human forecasters employ routinely and that are believed to contribute substantially to their calibration on low-probability, high-impact events [8]. This gap is not captured by Brier scores on standard benchmark questions.
Question selection effects: AI systems show advantages on short-horizon, data-rich questions and disadvantages on long-horizon, sparse-data, or structurally novel questions [17, 20]. ForecastBench's question distribution may favor AI relative to the question types that matter most for real-world decision-making.
The Goodhart's Law risk: As AI systems are explicitly optimized against ForecastBench, the benchmark's validity as a measure of general forecasting capability may degrade [20]. This is a standard concern with any fixed benchmark that becomes a training target.
Prediction market evidence: The AIA Forecaster's underperformance relative to market consensus on liquid prediction markets [5] — where real money creates strong incentives for calibration — suggests that benchmark performance may overstate real-world forecasting capability. Prediction markets aggregate information from many participants with skin in the game; an AI that trails this consensus on liquid markets has not demonstrated practical superforecaster-level performance regardless of its ForecastBench score.
Summary Table: Evidence State as of May 31, 2026
| Dimension | AI Status | Human Status | Gap | Confidence |
|---|---|---|---|---|
| ForecastBench overall Brier Index | ~67.8–67.9 (green tree) | ~70.2 (superforecaster median) | ~2.3–2.4 points | High [2, 3] |
| ForecastBench dataset questions | At or near parity in some snapshots | Baseline ~70.6 | Near zero / contested | Moderate [2, 3] |
| ForecastBench market questions | ~75.8 (transformed scale) | ~80.3 | ~4.5 points | Moderate [11, 17] |
| Wharton paper (200-item subset) | 0.122 (Claude 3.5 Sonnet) | 0.096 (superforecasters) | 0.026 Brier (p<0.001) | High [6, 7] |
| Metaculus crowd comparison | 0.1352 (o3) — beats crowd | 0.149 (crowd) / 0.023 (superforecasters) | Beats crowd; far from elite | High [15] |
| Liquid prediction markets | Below market consensus | Market consensus | AI trails | Moderate [5] |
| "Green tree" March 15 parity date | Unverified | N/A | N/A | Low [1] |
| Projected full parity (ForecastBench) | ~August 2027 (median) | N/A | 95% CI: Mar 2026–Aug 2028 | Moderate [2, 3] |
| Expert panel median parity estimate | N/A | N/A | By end of 2030 | Moderate [9] |
The current honest status is that AI forecasting has made remarkable and well-documented progress, closing a gap that looked insurmountable three years ago. On structured, short-to-medium-horizon questions with rich data, frontier AI systems are now operating within the statistical neighborhood of elite human superforecasters. But the official leaderboard, the most rigorous academic paper on the benchmark, and the organizations that manage the human baseline all agree: superforecasters retain the lead, the gap on practically important question categories (markets, geopolitics) remains meaningful, and the claim of established parity as of March or May 2026 is not supported by primary-source evidence.
References
[1] Welcome to May 22, 2026 - by Dr. Alex Wissner-Gross. theinnermostloop.substack.com. https://theinnermostloop.substack.com/p/welcome-to-may-22-2026
[2] ForecastBench. forecastbench.org. https://forecastbench.org
[3] State-of-the-art model forecasting performance over time. forecastbench.org. https://forecastbench.org/explore/index.html
[4] Leaderboards - ForecastBench. forecastbench.org. https://forecastbench.org/leaderboards/index.html
[5] "AIA Forecaster: Technical Report." https://arxiv.org/html/2511.07678v1
[6] "FORECASTBENCH:ADYNAMIC BENCHMARK OF AI ...." https://faculty.wharton.upenn.edu/wp-content/uploads/2026/02/ForecastBench_A_Dynamic_.pdf
[7] Human vs AI Forecasts: What Leaders Need to Know. goodjudgment.com. https://goodjudgment.com/human-vs-ai-forecasts
[8] What superforecasters actually said about forecastbench (goodjudgment.com). goodjudgment.com. https://goodjudgment.com/what-superforecasters-actually-said-about-forecastbench
[9] Wave 5: Security and Geopolitics - Longitudinal Expert AI Panel. leap.forecastingresearch.org. https://leap.forecastingresearch.org/reports/wave5
[10] What forecastbench doesnt measure (goodjudgment.com). goodjudgment.com. https://goodjudgment.com/what-forecastbench-doesnt-measure
[11] "forecastbench:adynamic benchmark of ai forecasting ...." https://arxiv.org/pdf/2409.19839
[12] "\BenchmarkName: A Dynamic Benchmark of AI Forecasting Capabilities." https://arxiv.org/html/2409.19839v4
[13] WeatherNext 2: Google DeepMind’s most advanced forecasting model. blog.google. https://blog.google/innovation-and-ai/models-and-research/google-deepmind/weathernext-2
[14] "Advancing Event Forecasting through Massive Training of Large Language Models: Challenges, Solutions, and Broader Impacts." https://arxiv.org/pdf/2507.19477
[15] "AI-Augmented Predictions: LLM Assistants Improve Human Forecasting Accuracy." https://arxiv.org/pdf/2402.07862
[16] GraphCast: AI model for faster and more accurate global weather forecasting — Google DeepMind. deepmind.google. https://deepmind.google/blog/graphcast-ai-model-for-faster-and-more-accurate-global-weather-forecasting
[17] Making Forecasting Scores Easier to Interpret: Introducing the Brier Index. forecastingresearch.substack.com. https://forecastingresearch.substack.com/p/introducing-the-brier-index
[18] Research — Forecasting Research Institute. forecastingresearch.org. https://forecastingresearch.org/research
[19] AI Is Getting Scary Good at Making Predictions. theatlantic.com. https://theatlantic.com/technology/2026/02/ai-prediction-human-forecasters/685955
[20] How well can large language models predict the future?. forecastingresearch.substack.com. https://forecastingresearch.substack.com/p/ai-llm-forecasting-model-forecastbench-benchmark
[21] Goldman sachs using ai to analyze kalshi polymarket prediction markets (investmentnews.com). investmentnews.com. https://investmentnews.com/emerging-markets/goldman-sachs-using-ai-to-analyze-kalshi-polymarket-prediction-markets/266620
[22] Ai in policy evaluation c88cc2fd (oecd.org). oecd.org. https://oecd.org/en/publications/2025/06/governing-with-artificial-intelligence_398fa287/full-report/ai-in-policy-evaluation_c88cc2fd.html
[23] What Comes Next for Insurance AI | WaterStreet Company. waterstreetcompany.com. https://waterstreetcompany.com/what-comes-next-for-insurance-ai
[24] Economic effects of ai (forecastingresearch.org). forecastingresearch.org. https://forecastingresearch.org/research/economic-effects-of-ai
[25] What the public thinks about ai and the implications for governance (brookings.edu). brookings.edu. https://brookings.edu/articles/what-the-public-thinks-about-ai-and-the-implications-for-governance