May 31, 2026·14 min read·17 views·4 providers

AI vs Superforecasters — ForecastBench May 2026 Verdict

By May 30, 2026 ForecastBench shows DeepMind's 'green tree' at Brier Index 67.8 vs superforecasters 70.2. Claims of parity are selective and contested.

Key Finding

ForecastBench’s official leaderboard shows the Superforecaster median forecast leading overall, with Google DeepMind’s “green tree” submission ranked behind it as the top AI entrant, and ForecastBench uses difficulty-adjusted Brier scores converted into a Brier Index for comparison.

high confidenceSupported by anthropic, perplexity, openai, grok

Justin Furniss

@Parallect.ai and @SecureCoders. Founder. Hacker. Father. Seeker of all things AI

anthropicperplexityopenaigrok

Executive Summary

"Green Tree" is real but mischaracterized: Google DeepMind's "green tree" is an anonymized leaderboard entry on the Forecasting Research Institute's ForecastBench — not a standalone branded system — and the March 15, 2026 parity date cited in popular coverage appears to conflate a dataset-question milestone with overall benchmark parity.
Current ForecastBench standings (May 30, 2026): The Superforecaster Median holds an overall Brier Index of 70.2; green tree sits at 67.8 — a gap of 2.4 index points, placing AI firmly at #2 overall but not at parity.
Parity is contested and overstated: On the overall leaderboard, superforecasters retain a statistically meaningful lead. On the "dataset questions" sub-category, AI has reached or exceeded human performance; on the harder "market questions," superforecasters are nearly 50% more accurate than the best AI entry.
Methodological caveats are substantial: ForecastBench's human baseline is frozen at 2024, allows multiple AI submission attempts, and covers only binary yes/no questions — all factors that may flatter AI performance relative to real-world operational forecasting.
Practical implications are directionally real but unquantified: Finance, insurance, and governance applications are actively discussed, but no peer-reviewed study has yet published granular deployment figures or economic impact estimates tied specifically to the ForecastBench parity milestone.

1. The "Green Tree" System: What It Is and What It Is Not

1.1 Identity of the System

The name "green tree" refers to an anonymized or codename leaderboard entry on ForecastBench, the Forecasting Research Institute's dynamic benchmark of AI forecasting capabilities [1, 2]. It is not a separately branded, publicly announced standalone forecasting product from Google DeepMind. The ForecastBench leaderboard lists the entry as "google deepmind, green tree," indicating organizational attribution alongside an internal codename [3, 4]. Other DeepMind entries on the same leaderboard — reportedly including one styled "yellow mouse" — suggest the naming convention is an anonymization scheme for tournament submissions rather than product branding [1].

This distinction matters for evaluating the popular coverage. A May 30, 2026 commentary piece described "GreenTree" as a purpose-built AI system that "can predict the future as well as the best humans on Earth," framing the March 15, 2026 date as the moment "AI hit parity with superforecasters for the first time." That framing is a secondary-source interpretation, not a statement from DeepMind or the Forecasting Research Institute, and it overstates what the leaderboard data actually shows.

1.2 The March 15, 2026 Date: What Happened

The March 15, 2026 date cited in popular coverage does not correspond to a sustained overall parity event on ForecastBench. The available evidence points to two more precise interpretations:

Dataset-question leadership: On the sub-category of "dataset questions" — questions drawn from structured data sources rather than live prediction markets — green tree reached the top position or near-top position around this period [1, 2]. AI systems have structural advantages on data-rich, short-horizon questions, and this sub-category leadership was real.
A snapshot, not a sustained result: The overall ForecastBench leaderboard updated through May 30, 2026 shows the Superforecaster Median at Brier Index 70.2 and green tree at 67.8 [1, 5]. The March 15 milestone does not reflect sustained overall parity on the current leaderboard. One source reports the parity claim aligns with a specific milestone or dataset-question performance around that date, but explicitly notes it does not reflect sustained overall parity [1].

A separate single-source report places the parity or top-spot event closer to May 22, 2026, not March 15 — suggesting the popular coverage may have backdated or conflated multiple leaderboard updates [6]. Given the provisional nature of this dating, both dates should be treated with caution.

1.3 The Benchmark Used

ForecastBench uses difficulty-adjusted Brier scores converted into a Brier Index for cross-question comparability [3, 5]. The Brier Index is defined as (1 − √Brier score) × 100%, producing a 0–100% scale where 50% represents an uninformative baseline (always predicting 50%) and 100% represents perfect foresight [1, 2]. Lower raw Brier scores are better; higher Brier Index values are better. The difficulty adjustment is critical: it normalizes scores across questions of varying inherent predictability, allowing meaningful comparison between AI systems and human forecasters who may have answered overlapping but non-identical question sets [3].

2. ForecastBench: Current Scores and Leaderboard Status

2.1 The Benchmark's Design

ForecastBench is a dynamic, contamination-free benchmark run by the Forecasting Research Institute, consisting of 1,000 probabilistic questions that are automatically generated and updated [1, 3]. Questions span geopolitics, economics, technology, and science, with resolution horizons ranging from weeks to over a year [3]. The benchmark was originally launched in September 2024 and received a major update in October 2025 [7, 2]. It is now open to public submissions [2].

Critically, ForecastBench currently includes only binary yes/no questions, excluding point predictions, multiple-choice outcomes, quantile predictions, and full probability distributions [2]. This scope limitation is relevant to any claim about "general" forecasting parity.

The human reference baseline is a frozen superforecaster cohort surveyed primarily in 2024 [5]. Superforecasters on ForecastBench are defined as individuals with an established track record on prior forecasting platforms such as Good Judgment Open and the Good Judgment Project [3]. The benchmark allows multiple AI submission attempts, which is an asymmetry relative to the human baseline [5].

2.2 Score Progression Over Time

The trajectory of AI performance on ForecastBench is well-documented across multiple sources:

Date	Superforecaster Brier Score	Best AI Brier Score	Gap (Brier Index)
October 2024	~0.081	~0.101 (GPT-4.5)	~19%
October 2025	0.081	0.101 (GPT-4.5)	~20%
February 20, 2026	~0.086	0.102 (CassiAI ensemble_2_crowdadj)	~2.7 index pts
March 2026	0.086	0.103 (CassiAI / Grok 4.20 Preview)	~2.7 index pts
May 30, 2026	Brier Index 70.2	Brier Index 67.8 (green tree)	2.4 index pts

Sources: [3, 8, 2, 5, 1]

In October 2024, the top-performing LLMs lagged superforecasters by approximately 19% in Brier Index terms [1, 2]. By March 2026, superforecasters led with a Brier Index of 70.6% versus 67.9% for the best LLMs — a gap of 2.7 percentage points [8]. As of the May 30, 2026 leaderboard update, the gap had narrowed marginally to 2.4 points (70.2 vs. 67.8) [1, 5].

2.3 Current Leaderboard (May 30, 2026)

The ForecastBench overall Brier Index leaderboard as of May 30, 2026 [1, 5]:

Rank	Entry	Overall Brier Index
1	Superforecaster Median Forecast	70.2
2	Google DeepMind, green tree	67.8
3	xAI Grok 4.20 (Preview)	~67.4
4	Various ensembles	~67.3

On the dataset questions sub-category, green tree ranks #1, leading or matching superforecasters [1]. On the market questions sub-category, the gap is dramatically larger: superforecasters score approximately 0.39–0.40 (Brier Index equivalent) while the nearest AI entry scores approximately 0.59 — meaning superforecasters are nearly 50% more accurate on this harder sub-category [3, 5].

As a point of historical context, in October 2024 the median public forecast ranked #2 overall on ForecastBench, behind superforecasters and ahead of all LLMs. By the May 2026 update, the median public forecast had fallen to approximately #22, displaced by the rapid improvement of frontier AI systems [1].

2.4 February 2026 Snapshot: CassiAI and Grok

As of February 20, 2026, the leading AI entries were CassiAI's ensemble_2_crowdadj (Brier score 0.102) and xAI's Grok 4.20 (Preview), both tied for first among AI systems at joint-second overall — behind the superforecaster median but ahead of all other AI entries [3, 8]. By March 2026, these two entries remained at #2 with a Brier score of 0.103, while superforecasters held 0.086 [8]. Green tree's ascent to the #2 position (displacing CassiAI and Grok) appears to have occurred between March and May 2026.

3. Is Parity Established, Contested, or Overstated?

3.1 The Case That Parity Has Been Reached

The strongest evidence for parity rests on three pillars:

Sub-category leadership. On dataset questions — the data-rich, shorter-horizon subset of ForecastBench — green tree ranks #1, meaning AI has demonstrably surpassed the superforecaster median on this category [1]. This is not trivial: dataset questions constitute a meaningful share of the benchmark and represent the type of forecasting most directly applicable to structured analytical tasks.

Rapid gap closure. The gap between frontier AI and superforecasters compressed from approximately 19 percentage points in October 2024 to 2.4 Brier Index points by May 2026 — a roughly 87% reduction in the performance gap over 19 months [1, 2, 5]. The trajectory is unambiguous.

Crowd surpassing. AI systems have clearly surpassed the median human crowd forecaster, which itself is a meaningful threshold [1, 3]. By late 2025, large language models had already outperformed the median human forecaster, with only the top ~1% of human forecasters maintaining a consistent edge [3].

Near-statistical indistinguishability on some domains. On ForecastBench's live questions, one analysis reports that AI ensemble difficulty-adjusted Brier scores are statistically indistinguishable from the superforecaster team's on many individual domains [1].

3.2 The Case That Parity Has Not Been Reached

The evidence against a parity claim is, on balance, more robust for the overall benchmark:

The leaderboard is unambiguous. As of May 30, 2026, superforecasters hold Brier Index 70.2 versus green tree's 67.8 — a 2.4-point gap that is consistent across multiple leaderboard snapshots and multiple providers' readings of the data [1, 5]. This is not a rounding error.

Market questions reveal a large structural gap. On the harder "market questions" sub-category, superforecasters score approximately 0.39–0.40 while the best AI scores approximately 0.59 — a nearly 50% accuracy advantage for humans [3, 5]. These questions are arguably the most economically relevant, involving live prediction markets where information asymmetry and judgment under genuine uncertainty matter most.

Benchmark design favors AI. Good Judgment Inc. has published substantive critiques of ForecastBench as a measure of real-world forecasting capability [9]. Key limitations include: the human baseline is frozen at 2024 (meaning superforecasters cannot improve their scores as AI systems continue to submit); AI systems are permitted multiple submission attempts while humans are not; the benchmark covers only binary questions, excluding the full range of forecasting tasks; and it does not capture teaming, aggregation, or upstream question formulation — all areas where human superforecasters excel [5, 9]. Good Judgment's Dr. Warren Hatch has stated that "when the data is sparse and the environment is in flux, machines are backward looking by definition" [9].

Temporal leakage and benchmark validity concerns. A 2026 academic study flagged "pitfalls" in assessing LLM forecasters, specifically warning that temporal leakage — where AI systems inadvertently use information from after a question's reference date due to training data or tool access — can inflate apparent AI performance [3]. The researchers explicitly caution against prematurely concluding that LLMs match or exceed human performance on forecasting tasks.

Superforecasters' own assessment. Superforecasters surveyed through the LEAP (Longitudinal Expert AI Panel) program have a median forecast of 2028 for AI eventually surpassing their benchmark [9]. ForecastBench's own linear trend projection places overall LLM-superforecaster parity around August 2027, with a 95% confidence interval spanning March 2026 to August 2028 [1]. The lower bound of that CI encompasses the March 2026 date cited in popular coverage, but the central estimate is 17 months away.

Longer-horizon questions remain unresolved. ForecastBench questions resolved to date skew toward short-horizon, data-rich topics where AI has structural advantages. Many longer-range, judgment-heavy questions are still pending resolution, and the final scores on those questions may widen the gap [9].

3.3 Verdict: Contested and Partially Overstated

The honest status as of May 31, 2026 is this: AI has reached parity with superforecasters on a specific sub-category of ForecastBench (dataset questions) and has closed the overall gap dramatically, but has not achieved sustained overall parity on the benchmark's full leaderboard. The 2.4 Brier Index point gap is small but consistent. The popular framing of "March 15, 2026 parity" conflates sub-category leadership with overall benchmark parity, and the secondary-source commentary that generated this claim is not corroborated by the ForecastBench leaderboard data itself.

Metaculus prediction market contracts place the median expected date of AI-human forecasting parity at November 2026, while PredictStreet's linear trend had suggested parity around August 2027 [1, 10]. These market-based estimates bracket the uncertainty well.

4. Implications for Finance, Insurance, and Governance

4.1 What the Sources Actually Say

The concrete implications cited in the literature for finance, insurance, and governance are directionally consistent but largely qualitative. No peer-reviewed study in the available source set has published granular deployment figures or quantified economic impacts tied specifically to the ForecastBench parity milestone. The following represents the state of evidence as of late May 2026.

Finance. Academic work on AI in financial forecasting documents that large language models can process earnings calls, regulatory filings, and macroeconomic indicators to generate probabilistic forecasts of market-relevant events [11]. The Federal Reserve Bank of San Francisco has explored AI simulation of the Survey of Professional Forecasters, finding that LLM-generated forecasts can replicate the distributional properties of professional economist predictions on standard macroeconomic variables [12]. The implication is that AI could reduce the cost of maintaining large panels of professional forecasters for scenario analysis and stress testing. However, the market-question gap on ForecastBench — where superforecasters retain a ~50% accuracy advantage — is a direct caution against deploying current AI systems as drop-in replacements for human judgment on live financial prediction tasks [5].

Insurance. A 2024 survey found that 73% of insurers view AI models as key to managing climate risks [13]. Catastrophe modeling firms have begun integrating AI forecasting into their risk assessment pipelines, particularly for extreme weather events where DeepMind's WeatherNext 2 and GenCast models have demonstrated performance exceeding traditional physics-based numerical weather prediction ensembles on 0- to 15-day horizons [3]. The insurance application is most mature in this domain: probabilistic AI forecasts of hurricane tracks, flood extents, and wildfire spread are being incorporated into real-time pricing and reinsurance treaty negotiations. The connection to superforecaster-style general-purpose forecasting is more speculative — insurers are primarily deploying domain-specific models, not general LLM forecasters.

Governance. The governance implications are the least concretely documented. Commentary sources note that scalable, high-accuracy prediction of geopolitical and policy events could transform intelligence analysis, regulatory impact assessment, and legislative scenario planning [1]. The LEAP program's Wave 5 report on security and geopolitics documents expert AI panel assessments of geopolitical risks, suggesting an emerging institutional infrastructure for AI-assisted policy forecasting [14]. One source notes that municipal governments have begun considering AI oversight frameworks — the New York City Council reportedly considered a bill to create a municipal AI oversight office — though this is a governance-of-AI question rather than AI-for-governance [1].

4.2 Human-AI Collaboration as the Dominant Near-Term Model

Across all three domains, the most consistently cited implication is not AI replacement of human forecasters but human-AI collaboration. Research published in a peer-reviewed study on AI-augmented predictions found that LLM assistants improve human forecasting accuracy when used as decision-support tools rather than autonomous replacements [15]. The combination of AI's ability to process large structured datasets rapidly with human judgment on novel, sparse-data situations appears to outperform either alone — a finding consistent with the persistent superforecaster advantage on market questions, which are precisely the questions where human contextual judgment matters most [15, 9].

5. Broader Context: What ForecastBench Does and Does Not Measure

ForecastBench is the most rigorous publicly available benchmark for general-purpose AI forecasting, but it is a proxy for operational forecasting capability, not a full substitute [1, 3]. The benchmark's limitations are worth stating precisely:

Binary questions only: Real-world forecasting requires point estimates, probability distributions over continuous outcomes, and multi-outcome scenarios. ForecastBench's binary format systematically advantages AI systems that are well-calibrated on yes/no questions but may be poorly calibrated on the full distributional task [2].
Frozen human baseline: The superforecaster cohort was surveyed primarily in 2024. As AI systems continue to submit updated entries, the comparison is increasingly between current AI and 2024-vintage human performance [5].
No teaming or aggregation: Operational superforecasting at Good Judgment Inc. involves team deliberation, structured aggregation, and iterative updating — none of which is captured in the individual-forecast baseline used by ForecastBench [9].
Short-horizon bias in resolved questions: Questions that have resolved to date skew toward shorter horizons where AI has demonstrated the clearest advantages. The benchmark's final verdict on longer-horizon questions remains open [9].
Temporal leakage risk: AI systems with access to web search or recent training data may inadvertently incorporate post-question information, inflating apparent accuracy [3].

These limitations do not invalidate ForecastBench as a benchmark — it remains the best available tool for tracking AI forecasting progress — but they mean that a 2.4-point Brier Index gap on this benchmark should not be interpreted as a 2.4-point gap in real-world operational forecasting capability. The actual gap in deployment contexts is likely larger.

6. Current Honest Status

As of May 31, 2026, the evidence supports the following precise characterization:

What is established: AI forecasting systems — specifically Google DeepMind's "green tree" entry and several competing frontier models — have closed approximately 87% of the performance gap that separated them from elite human superforecasters in October 2024. On the dataset-question sub-category of ForecastBench, AI has reached or exceeded superforecaster-level performance. AI clearly surpasses the median human crowd forecaster across the full benchmark.

What is not established: Overall parity on ForecastBench has not been achieved. The Superforecaster Median holds a Brier Index of 70.2 versus green tree's 67.8 as of May 30, 2026 — a 2.4-point gap that is small but consistent and reproducible across multiple leaderboard snapshots. On market questions, the gap is substantially larger (~50% accuracy advantage for superforecasters). The March 15, 2026 "parity" date cited in popular coverage is not corroborated by the overall leaderboard and appears to describe a sub-category milestone or a single snapshot that did not persist.

What is contested: Whether the remaining gap is methodologically meaningful or an artifact of benchmark design. ForecastBench's frozen human baseline, binary-question format, and multiple-attempt allowance for AI all introduce asymmetries that may overstate the remaining human advantage. Conversely, temporal leakage and short-horizon bias in resolved questions may overstate AI performance. The net direction of these biases is genuinely uncertain.

Central estimate for overall parity: ForecastBench's own linear trend projects overall parity around August 2027 (95% CI: March 2026–August 2028) [1]. Metaculus prediction markets place the median at November 2026 [10]. Superforecasters themselves estimate 2028 [9]. The honest answer is that overall parity on a rigorous general-purpose forecasting benchmark is likely 12–24 months away, with sub-category parity already achieved in data-rich domains.

References

[1] ForecastBench. forecastbench.org. https://forecastbench.org

[2] "\BenchmarkName: A Dynamic Benchmark of AI Forecasting Capabilities." https://arxiv.org/html/2409.19839v4

[3] "FORECASTBENCH:ADYNAMIC BENCHMARK OF AI ...." https://faculty.wharton.upenn.edu/wp-content/uploads/2026/02/ForecastBench_A_Dynamic_.pdf

[4] System and method for enhanced collaborative forecasting. image-ppubs.uspto.gov. https://image-ppubs.uspto.gov/dirsearch-public/print/downloadPdf/11941239

[5] Leaderboards - ForecastBench. forecastbench.org. https://forecastbench.org/leaderboards/index.html

[6] Welcome to May 22, 2026 - by Dr. Alex Wissner-Gross. theinnermostloop.substack.com. https://theinnermostloop.substack.com/p/welcome-to-may-22-2026

[7] "forecastbench:adynamic benchmark of ai forecasting ...." https://arxiv.org/pdf/2409.19839

[8] Making Forecasting Scores Easier to Interpret: Introducing the Brier Index. forecastingresearch.substack.com. https://forecastingresearch.substack.com/p/introducing-the-brier-index

[9] What Superforecasters Actually Said About ForecastBench - Good Judgment. goodjudgment.com. https://goodjudgment.com/what-superforecasters-actually-said-about-forecastbench

[10] When will LLMs beat superforecasters at ForecastBench?. metaculus.com. https://metaculus.com/questions/40290/when-will-llms-beat-superforecasters-at-forecastbench

[11] "The Role of AI in Financial Forecasting: ChatGPT's Potential and Challenges." https://arxiv.org/pdf/2411.13562

[12] Simulating the Survey of Professional Forecasters - San Francisco Fed. frbsf.org. https://frbsf.org/research-and-insights/publications/system-research-richmond-fed/2025/01/simulating-the-survey-of-professional-forecasters

[13] 73% of insurers see AI models as key to managing climate risks. fintech.global. https://fintech.global/2024/11/08/73-of-insurers-see-ai-models-as-key-to-managing-climate-risks

[14] Wave 5: Security and Geopolitics - Longitudinal Expert AI Panel. leap.forecastingresearch.org. https://leap.forecastingresearch.org/reports/wave5

[15] "AI-Augmented Predictions: LLM Assistants Improve Human Forecasting Accuracy." https://arxiv.org/pdf/2402.07862

Evidence Explorer

Select a citation or claim to explore evidence.

Cross-provider analysis

How 4 providers compared on 191 claims across 87 topic clusters

Consensus

Contested

Unique

Low-conf

standard

Consensus findings (7)

Multiple providers independently confirmed these. Treat as the most reliable evidence.

Good Judgment said the gap between frontier LLMs and superforecasters was about 20% in October 2025, while PredictStreet’s linear trend suggested parity would arrive around August 2027, with a 95% CI from March 2026 to August 2028.
86%
anthropicperplexitygrok
[10][1][28][2][61]
ForecastBench’s official leaderboard shows the Superforecaster median forecast leading overall, with Google DeepMind’s “green tree” submission ranked behind it as the top AI entrant, and ForecastBench uses difficulty-adjusted Brier scores converted into a Brier Index for comparison.
84%
anthropicperplexityopenaigrok
[101][10][11][12][17][1][20][21][27][2][47][4][61][6][7][86]
The Brier Index is a monotonic 0–100% transformation of the difficulty-adjusted Brier score, defined as (1 − √Brier score) × 100%.
82%
anthropicopenaigrok
[12][1][61][7]
By ForecastBench’s official accounting, as of late May 2026 AI had not yet reached overall parity with elite human superforecasters; the human Superforecaster Median still held a small edge, though AI may have matched the human aggregate on at least one snapshot of the benchmark.
79%
anthropicperplexityopenaigrok
[1][27][61][7]
ForecastBench is a dynamic benchmark of LLM forecasting accuracy, consisting of 1,000 probabilistic questions that are automatically generated and updated.
77%
anthropicperplexitygrok
[11][17][1][27][29][4][7]
ForecastBench evaluates probabilistic forecasts on active geopolitical, economic, and scientific questions.
72%
perplexityopenaigrok
[11][1][4][7]
AI systems are matching or approaching top-tier human performance on short-horizon, data-rich forecasting tasks, with some niches showing parity or even an advantage.
71%
perplexityopenaigrok
[10][1][31][5][6][86]

Single-source insights (63)

Reported by only one provider. Treat as preliminary unless independently verified.

In March 2026, the best LLMs scored 67.9% in Brier Index terms.
74%
anthropic
[12][61]
In October 2024, ForecastBench showed that the top-performing LLMs lagged behind superforecasters by 19%.
73%
anthropic
[27][61][7]
The term “superforecaster” was popularized by Philip E. Tetlock and Dan Gardner in Superforecasting: The Art and Science of Prediction (Crown, 2015).
70%
perplexity
[1][28]
The U.S. Intelligence Advanced Research Projects Activity (IARPA) Good Judgment Project started multiple annual forecasting contests in 2011.
70%
perplexity
[1][28]
A May 30, 2026 Metatrends Substack post states that GreenTree can predict the future as well as the best humans on Earth.
69%
grok
Prediction markets reacted by early 2026 with a ~74% probability that an AI would win a major forecasting tournament by year’s end.
69%
openai
[2]
+ 57 more single-source insights

Low-confidence claims (29)

Weak signals the verifier flagged for hedged language in the report.

The report says the parity detail is provisional.
55%
anthropic
The report says the headline "Green Tree" event is real in outline.
55%
anthropic
A year earlier, the median public forecast was ranked #2 on ForecastBench, behind superforecasters and ahead of all LLMs.
56%
anthropic
ForecastBench is now open to public submissions.
56%
anthropic
A municipal AI oversight office would help ensure AI systems are used responsibly.
56%
openai
+ 24 more low-confidence claims

Go Deeper

Follow-up questions based on where providers disagreed or confidence was low.

Verify whether the reported DeepMind forecasting system was actually named “Green Tree” (or GreenTree/green tree), and pin down the exact event date and benchmark used for the claimed parity milestone

This is the most basic unresolved factual hinge: multiple weak signals suggest the headline event is real in outline, but the parity detail is provisional, the system may not be separately branded, and the date may be around May 22, 2026 rather than March 15, 2026. Confirming the system identity, date, and benchmark is necessary before interpreting any parity claim.

Low ConfidenceXS tier

Investigate this →

Recompute or source the current best AI Brier score on ForecastBench and compare it directly against the human superforecaster aggregate under the same scoring protocol, including whether the benchmark is binary-only and whether public submissions changed the leaderboard

The current best-AI-versus-human comparison remains unsettled: signals indicate ForecastBench may now be open to public submissions, may include only binary yes/no questions, and there are conflicting score snapshots (e.g., top LLMs at 67.9% Brier Index terms vs other reported ensemble values). A direct apples-to-apples comparison is needed to establish the honest status of parity.

Low ConfidenceS tier

Investigate this →

Test whether “parity” on forecasting is being overstated by checking if AI matches superforecasters only on subsets such as dataset questions, while still trailing on overall leaderboard and market questions

Several signals point to a narrow or partial form of parity: AI may rank #1 on dataset questions, gaps may be narrower or reversed on dataset questions, but superforecasters still lead overall and gaps may be larger on market questions. This follow-on would determine whether the parity claim holds across question types or only within favorable slices.

DisagreementM tier

Investigate this →

Assess whether human+AI combinations outperform standalone AI and standalone superforecasters on ForecastBench or comparable autonomous forecasting benchmarks, and identify which task classes benefit most from teaming or aggregation

If human+AI combinations are superior, that materially changes the interpretation of parity: the operational frontier may be augmentation rather than substitution. One weak signal also notes ForecastBench does not capture teaming or aggregation, so this question addresses an important omitted dimension in the current benchmark evidence.

ImplicationM tier

Investigate this →

Verify the concrete downstream implications cited for finance, insurance, and governance, and determine whether any primary sources quantify those impacts rather than stating them rhetorically

The weak signals suggest broad implications are being cited because of scalable, high-accuracy prediction, but there are also signals that no granular deployment figures or quantified economic impacts appear in the primary sources reviewed. This follow-on would separate evidence-based implications from speculative extrapolation.

ImplicationM tier

Investigate this →

Key Claims

Cross-provider analysis with confidence ratings and agreement tracking.

87 claims · sorted by confidence

high·anthropic, perplexity, openai, grok·forecastbench.org deepmind.google markets.financialcontent.com+14·

By ForecastBench’s official accounting, as of late May 2026 AI had not yet reached overall parity with elite human superforecasters; the human Superforecaster Median still held a small edge, though AI may have matched the human aggregate on at least one snapshot of the benchmark.

high·anthropic, perplexity, openai, grok·arxiv.org forecastbench.org press.airstreet.com+1·

Good Judgment said the gap between frontier LLMs and superforecasters was about 20% in October 2025, while PredictStreet’s linear trend suggested parity would arrive around August 2027, with a 95% CI from March 2026 to August 2028.

high·anthropic, perplexity, grok·linkedin.com markets.financialcontent.com goodjudgment.substack.com+3·

The Brier Index is a monotonic 0–100% transformation of the difficulty-adjusted Brier score, defined as (1 − √Brier score) × 100%.

high·anthropic, openai, grok·forecastbench.org press.airstreet.com forecastingresearch.substack.com+1·

ForecastBench is a dynamic benchmark of LLM forecasting accuracy, consisting of 1,000 probabilistic questions that are automatically generated and updated.

high·anthropic, perplexity, grok·forecastbench.org faculty.wharton.upenn.edu image-ppubs.uspto.gov+5·

ForecastBench evaluates probabilistic forecasts on active geopolitical, economic, and scientific questions.

high·perplexity, openai, grok·forecastbench.org faculty.wharton.upenn.edu articsledge.com+2·

AI systems are matching or approaching top-tier human performance on short-horizon, data-rich forecasting tasks, with some niches showing parity or even an advantage.

high·perplexity, openai, grok·forum.effectivealtruism.org linkedin.com metatrends.substack.com+6·

In the Brier Index, 50% is an uninformative baseline equivalent to always predicting 50%.

high·anthropic, openai·forecastbench.org press.airstreet.com forecastingresearch.substack.com+1·

In the Brier Index, 100% means perfect accuracy.

high·anthropic, openai·forecastbench.org press.airstreet.com forecastingresearch.substack.com+1·

Good Judgment Inc. says human superforecasters still maintain a narrow lead over DeepMind models in key areas, with only one DeepMind model above the Superforecaster median on the preliminary board as of April 2026.

high·perplexity, openai·goodjudgment.com goodjudgment.substack.com arxiv.org+2·

A May 30, 2026 Metatrends Substack post states that DeepMind built an AI system called GreenTree.

high·openai, grok·metatrends.substack.com leap.forecastingresearch.org·

In late 2025, GPT-4.5 was the top-performing AI model in the sample.

high·anthropic, openai·markets.financialcontent.com arxiv.org press.airstreet.com·

xAI’s Grok 4.20 (Preview) has an overall Brier Index of about 67.4–67.6 on the ForecastBench leaderboard.

high·perplexity, grok·faculty.wharton.upenn.edu image-ppubs.uspto.gov forecastbench.org+1·

On March 15, 2026, an AI system first matched the Brier-score performance of the best human superforecasters on the benchmark.

high·openai, grok·metatrends.substack.com leap.forecastingresearch.org·

On ForecastBench, the gap between human superforecasters and the best AI is about 2.4 Brier Index points (roughly 2–3 points).

high·openai, grok·forecastbench.org leap.forecastingresearch.org·

Sources

34 unique sources cited across 87 claims.

Academic6 sources

FORECASTBENCH:ADYNAMIC BENCHMARK OF AI ...

faculty.wharton.upenn.eduvia anthropic, perplexity, openai, grok

11 claims

\BenchmarkName: A Dynamic Benchmark of AI Forecasting Capabilities

arxiv.orgvia anthropic, perplexity, openai, grok

11 claims

AI-Augmented Predictions: LLM Assistants Improve Human Forecasting Accuracy

arxiv.orgvia anthropic, perplexity, openai, grok

4 claims

forecastbench:adynamic benchmark of ai forecasting ...

arxiv.orgvia anthropic, perplexity

3 claims

Prediction of emergency department revisits among child and youth mental health outpatients using deep learning techniques

pmc.ncbi.nlm.nih.govvia perplexity

2 claims

Google DeepMind - Wikipedia

en.wikipedia.orgvia anthropic, perplexity, openai, grok

1 claim

Government2 sources

System and method for enhanced collaborative forecasting

image-ppubs.uspto.govvia anthropic, perplexity, openai, grok

3 claims

Deep learning model for energy forecasting

image-ppubs.uspto.govvia perplexity

1 claim

News & Media8 sources

What Superforecasters Actually Said About ForecastBench

goodjudgment.substack.comvia anthropic, perplexity, grok, openai

13 claims

Tracking the singularity week of f45 (metatrends.substack.com)

metatrends.substack.comvia openai, grok, perplexity, anthropic

9 claims

FinancialContent - The Great Forecast Convergence: AI Closing the 20% Gap on Human Superforecasters

markets.financialcontent.comvia anthropic, perplexity, grok, openai

7 claims

Making Forecasting Scores Easier to Interpret: Introducing the Brier Index

forecastingresearch.substack.comvia anthropic, perplexity, openai, grok

7 claims

What forecastbench doesnt measure (goodjudgment.substack.com)

goodjudgment.substack.comvia perplexity, openai, grok, anthropic

6 claims

Predictstreet 2026 1 18 the great forecast convergence ai closing the 20 gap on human superforecasters (markets.financialcontent.com)

markets.financialcontent.comvia anthropic, perplexity, openai, grok

3 claims

Welcome to May 22, 2026 - by Dr. Alex Wissner-Gross

theinnermostloop.substack.comvia anthropic

1 claim

Llms are closing the gap on human (forecastingresearch.substack.com)

forecastingresearch.substack.comvia grok

1 claim

ForecastBench 2026green tree DeepMindAI vs superforecasters Brier scoreforecasting benchmarks AIAI forecasting parity 2026Brier Index comparisonAI in finance insurance governance

Share this research

Read by 17 researchers

AI vs Superforecasters — ForecastBench May 2026 Verdict

Executive Summary

1. The "Green Tree" System: What It Is and What It Is Not

1.1 Identity of the System

1.2 The March 15, 2026 Date: What Happened

1.3 The Benchmark Used

2. ForecastBench: Current Scores and Leaderboard Status

2.1 The Benchmark's Design

2.2 Score Progression Over Time

2.3 Current Leaderboard (May 30, 2026)

2.4 February 2026 Snapshot: CassiAI and Grok

3. Is Parity Established, Contested, or Overstated?

3.1 The Case That Parity Has Been Reached

3.2 The Case That Parity Has Not Been Reached

3.3 Verdict: Contested and Partially Overstated

4. Implications for Finance, Insurance, and Governance

4.1 What the Sources Actually Say

4.2 Human-AI Collaboration as the Dominant Near-Term Model

5. Broader Context: What ForecastBench Does and Does Not Measure

6. Current Honest Status

References

Evidence Explorer

Synthesized from 4 providers on May 31, 2026 using fast mode

Cross-provider analysis

Go Deeper

Verify whether the reported DeepMind forecasting system was actually named “Green Tree” (or GreenTree/green tree), and pin down the exact event date and benchmark used for the claimed parity milestone

Recompute or source the current best AI Brier score on ForecastBench and compare it directly against the human superforecaster aggregate under the same scoring protocol, including whether the benchmark is binary-only and whether public submissions changed the leaderboard

Test whether “parity” on forecasting is being overstated by checking if AI matches superforecasters only on subsets such as dataset questions, while still trailing on overall leaderboard and market questions

Assess whether human+AI combinations outperform standalone AI and standalone superforecasters on ForecastBench or comparable autonomous forecasting benchmarks, and identify which task classes benefit most from teaming or aggregation

Verify the concrete downstream implications cited for finance, insurance, and governance, and determine whether any primary sources quantify those impacts rather than stating them rhetorically

Key Claims

Sources

Topics