May 31, 2026·16 min read·21 views·4 providers

AI vs Superforecasters: Parity Status May 2026

Q: Verify whether the alleged DeepMind system name “Green Tree” and the March 15, 2026 parity claim appear in any official DeepMind publication, blog post, arXiv paper, or ForecastBench/Metaculus documentation, and identify the exact benchmark and score if they do.

The signal says the widely repeated “Green Tree” claim does not check out as a verifiable DeepMind announcement, but other sources still reference a March 15 parity event. This is a high-value provenance check because the headline claim may be based on misattribution or newsletter commentary rather than a real DeepMind release.

Q: Establish the exact current best AI Brier score on ForecastBench versus the frozen 2024 superforecaster baseline, including whether the comparison is on the full benchmark or only on a subset such as dataset questions versus market questions.

Multiple weak signals suggest the apparent parity depends on benchmark slice, question mix, or difficulty adjustment, with some sources saying AI is ahead on dataset questions but behind overall, and others citing Brier Index values in the 67–69 range. A precise score-and-baseline audit is needed to determine whether parity is actually established or only partial.

Q: Compare the Bridgewater AIA Forecaster technical report with the ForecastBench leaderboard to determine whether Bridgewater’s result is the strongest peer-grade evidence for parity, and whether its advantage over crowd and prior LLM systems holds across question types.

One signal says Bridgewater’s AIA Forecaster is the substantive, peer-grade evidence for parity and that it is clearly better than both the crowd and prior LLM systems. Because this is a single-source interpretation, it should be checked against the underlying technical report and any independent reproductions or leaderboard entries.

Q: Check whether the reported 2.7 Brier-Index point gap is consistent with the stated 0.1125 vs 0.1110 FB-7-21 scores and with the claim that the elite human benchmark scored 0.081, and clarify which normalization is being used.

The signals contain potentially conflicting numeric statements about the size of the gap, the human benchmark, and whether lower is better. A narrow numeric reconciliation would materially affect the honest status of parity and could resolve whether the alleged gap is trivial, meaningful, or a scale artifact.

Q: Assess whether the claimed implications for finance, insurance, and governance are supported by benchmark-specific evidence or are extrapolations from general forecasting performance, especially given that humans still outperform AI on market questions and complex “black swan” cases.

Several signals suggest the practical gap remains larger on market questions, that humans retain strengths on trend breaks and low-probability wildcard events, and that current AI is still effectively solo forecasting. This matters because downstream claims about finance, insurance, and governance depend on whether the benchmark is a good proxy for real decision environments.

Late May 2026: AI approaches but hasn't fully matched superforecasters—top LLMs trail ~2.3 Brier Index points; dataset vs market performance diverges.

Key Finding

In the ForecastBench leaderboard (as of late May 2026), superforecasters score about 70.2–70.9 on the Brier Index, while the best AI system, Google DeepMind’s "green tree," scores about 67.8–67.9, leaving a gap of about 0.017 difficulty-adjusted Brier score that the report says is roughly one year of LLM progress.

high confidenceSupported by anthropic, grok, perplexity, openai

Justin Furniss

@Parallect.ai and @SecureCoders. Founder. Hacker. Father. Seeker of all things AI

anthropicgrokperplexityopenai

Executive Summary

"Green Tree" is real but mischaracterized: Google DeepMind does operate a submission labeled "green tree" on the ForecastBench tournament leaderboard, but the specific claim that it "hit parity with top human superforecasters on March 15, 2026" originates from a single aggregator newsletter, not from any DeepMind publication, peer-reviewed paper, or Forecasting Research Institute announcement — and the current leaderboard data refutes the parity claim outright.
Current scores, as of late May 2026: On ForecastBench, the superforecaster median holds a Brier Index of 70.2 (≈ Brier score 0.089); DeepMind "green tree," the best-ranked AI system, scores 67.9 (≈ Brier score 0.103) — a gap of 2.3 Brier Index points that is meaningful and unresolved [1, 2, 3].
Partial parity is real, full parity is not: AI has surpassed superforecasters on the "dataset questions" subset of ForecastBench, but trails significantly on "market questions," and superforecasters retain the overall #1 position on both the tournament and baseline leaderboards as of May 30, 2026 [2, 3].
The most rigorous parity claim comes from Bridgewater's AIA Labs technical report (arXiv:2511.07678), which documents statistical indistinguishability from superforecasters on a specific question set — but this is a single-lab result on a constrained benchmark, not a general verdict [4].
Practical implications are real but premature: Finance, insurance, and governance applications are actively incorporating AI forecasting tools, but practitioners and the Forecasting Research Institute explicitly caution that benchmark performance does not yet translate to the complex, continuously updated, judgment-heavy questions that institutional decision-makers actually face [5, 6, 7].

1. The "Green Tree" Claim: System, Date, and Benchmark

What the evidence actually shows

Multiple sources confirm that a Google DeepMind submission labeled "green tree" exists on the ForecastBench tournament leaderboard and is the highest-ranked AI system as of late May 2026 [2, 3]. On the tournament leaderboard updated May 23–30, 2026, "green tree" achieves an overall Brier Index of 67.8–67.9, ranking second overall behind the superforecaster median aggregate [2, 3].

The claim that "green tree" hit parity with superforecasters on or around March 15, 2026 is traceable to a single source: a personal tech aggregator newsletter dated May 22, 2026, which states that "DeepMind built an AI system called GreenTree that can predict the future as well as the best humans on Earth" and that "on March 15th, AI hit parity with superforecasters for the first time" [3]. This newsletter is not a DeepMind publication, not a Forecasting Research Institute announcement, and not a peer-reviewed paper. It is an aggregated commentary piece.

Searches of DeepMind's blog and publications pages for the relevant period surface weather forecasting research — WeatherNext 2, GenCast, GraphCast — but no announcement of a system named "Green Tree" achieving superforecaster parity in geopolitical or general-purpose event forecasting [8, 9, 10]. WeatherNext 2, DeepMind's most prominent recent forecasting release, is a meteorological model evaluated on standard atmospheric verification scores, not Brier scores on geopolitical questions — an entirely different domain [8].

What March 15, 2026 likely refers to

The most plausible reconstruction is that around mid-March 2026, the "green tree" submission first appeared on or near the superforecaster threshold on a subset of ForecastBench questions — specifically the "dataset questions" subset, where AI has since established a lead — and this was interpreted by secondary commentators as full parity. The Forecasting Research Institute's own analyses, published through its leaderboard and associated posts, do not characterize this as parity on the full benchmark [2, 3].

One provider's analysis notes that March 15, 2026 was "the first documented instance of an LLM matching superforecaster performance on the benchmark's questions" — but this characterization is itself drawn from secondary sources rather than primary FRI documentation, and should be treated as a plausible interpretation rather than an established fact.

Verdict on the "Green Tree" claim

Claim Element	Status
System named "green tree" exists	Confirmed — ForecastBench leaderboard [2, 3]
Operated by Google DeepMind	Confirmed — listed as DeepMind submission [2]
Parity achieved on March 15, 2026	Unverified — single newsletter source only [3]
Parity on full ForecastBench benchmark	Refuted — gap of 2.3 Brier Index points as of May 2026 [2]
Benchmark used	ForecastBench (Brier Index metric) [1, 2]
DeepMind official announcement	Not found [9, 10]

2. ForecastBench: Current Scores and the Human–AI Gap

The benchmark defined

ForecastBench is a dynamic, contamination-free forecasting benchmark developed by the Forecasting Research Institute and documented in the working paper "ForecastBench: A Dynamic Benchmark of AI Forecasting Systems," circulated as a Wharton faculty working paper [1]. It evaluates probabilistic predictions on real-world events drawn from prediction markets, time series, and other sources, comparing large language models against two human baselines: the superforecaster median (top-tier forecasters from the Good Judgment Project tradition) and the public median [11, 1].

The benchmark splits questions into two categories:

Dataset questions: data-rich, shorter-horizon questions where AI has structural advantages
Market questions: questions drawn from prediction markets, which tend to be harder, more judgment-intensive, and where human advantages are larger [2, 3]

In March 2026, FRI introduced the Brier Index as its primary leaderboard metric, defined as (1 − √Brier score) × 100%, yielding a 0–100 scale where 100 represents perfect foresight and 50 represents always predicting 50% [12]. Lower raw Brier scores are better; higher Brier Index values are better. This metric change improved interpretability but introduced some discontinuity in historical comparisons.

Current leaderboard figures (as of May 23–30, 2026)

The ForecastBench tournament leaderboard, updated nightly, shows the following standings [1, 2, 3]:

Entrant	Brier Index (Overall)	Approx. Brier Score	Rank
Superforecaster median	70.2	≈ 0.089	#1
DeepMind "green tree"	67.8–67.9	≈ 0.103	#2
DeepMind "yellow mouse"	67.6	≈ 0.104	#3
xAI Grok 4.20 Preview	≈ 67.4–67.6	≈ 0.104–0.105	#3–4
Public median	64.5	≈ 0.117	—
Claude Sonnet 4.5 (zero-shot)	63.3	≈ 0.121	—
OpenAI o3 (scratchpad)	63.2	≈ 0.121	—

Sources: [1, 2, 3]

On the baseline leaderboard (no additional tools), superforecasters lead at approximately 69.9–70.6 Brier Index, with top LLMs remaining below this threshold [2].

Historical trajectory

The ForecastBench working paper's initial 200-question evaluation established the foundational baselines: superforecasters at a mean Brier score of 0.096, the general public at 0.121, and the top LLM at the time (Claude 3.5 Sonnet) at 0.122 — with superforecasters outperforming both at p < 0.001 [1]. By October 2025, the best AI model (GPT-4.5, released February 2025) had improved to 0.101, while the superforecaster benchmark stood at 0.081 — a roughly 20% human edge, itself down from approximately a 50% gap two years prior [1, 5].

As of January 29, 2026, the FRI leaderboard showed superforecasters still ranked #1, with two external AI submissions (xAI's Grok 4.20 Preview and Cassi's ensemble_2_crowdadj) tied at #2, trailing by 0.017 Brier points — a gap FRI analysts characterized as representing approximately one year of LLM progress at then-current rates [2].

The dataset vs. market question split

This distinction is critical for interpreting parity claims. On dataset questions, "green tree" has achieved a Brier Index sufficient to rank #1, surpassing the superforecaster median on that subset [2, 3]. This is a genuine and significant result.

On market questions, the gap is substantially larger. Good Judgment's analysis of this subset reports superforecasters achieving a Brier Index of 80.3 (Brier score ≈ 0.039) versus the nearest AI entrant at 75.8 (Brier score ≈ 0.059) — an AI error rate roughly 50% larger than the superforecasters' [5, 13]. Good Judgment explicitly identifies market questions as the subset most directly analogous to the work its institutional clients care about [5, 13].

One research prototype system labeled "BLF" achieves a state-of-the-art overall Brier Index of 69.4 on ForecastBench — 1.2 points below the 2024 superforecaster baseline of 70.6 — and notably reaches a Brier Index of 85.2 on market questions, which would represent a significant advance if confirmed [2]. However, BLF is described as a research prototype not yet standardized or widely deployed, and this figure should be treated as preliminary [2].

Projected parity timelines

FRI's linear trend projections estimate full LLM–superforecaster parity on the overall benchmark around August 2027, with a 95% confidence interval spanning March 2026 to August 2028 [2, 3]. More granular projections suggest:

Dataset question parity: projected for approximately June 2026 [3]
Market question parity: projected for approximately August 2026 [3]
Overall benchmark parity: August 2027 (central estimate) [2]

If progress continued at the late-2025 rate, FRI analysts noted parity could arrive as early as November 2026 [5]. These projections carry substantial uncertainty; preliminary leads on ForecastBench have reversed as more questions resolved, and results fluctuate with sample composition [1].

3. The AIA Forecaster: The Strongest Published Parity Claim

The most rigorous published claim of AI–superforecaster parity comes not from DeepMind but from Bridgewater's AIA Labs, documented in the technical report "AIA Forecaster: Technical Report" (arXiv:2511.07678) [4, 14].

On the FB-7-21 question set within ForecastBench, the AIA Forecaster achieves a Brier score of 0.1125, compared to the superforecaster median of 0.1110 — a difference of 0.0015 that falls within statistical uncertainty intervals, making the two systems statistically indistinguishable [4]. For context, OpenAI's o3 scores 0.1096 on the same set, and previous LLM forecasters cluster around 0.107 [4].

The AIA Forecaster is an LLM-based system for judgmental forecasting using unstructured data [4]. Its performance represents a genuine advance: it is clearly better than both the general crowd and prior LLM systems on this benchmark, and its near-equivalence to the superforecaster median on FB-7-21 is the strongest quantitative evidence for parity currently in the literature [4].

However, several important caveats apply:

FB-7-21 is a specific question subset, not the full ForecastBench benchmark
This is a single-lab result from an organization (Bridgewater) with commercial interests in demonstrating AI forecasting capability
The result has not been independently replicated on the full benchmark
The AIA Forecaster's performance on market questions specifically — the hardest and most practically relevant subset — is not separately reported in the available documentation

4. Is "Parity" Established, Contested, or Overstated?

The honest status

The evidence supports a carefully differentiated verdict rather than a binary yes/no:

Where AI has reached or approached parity:

On ForecastBench's dataset questions subset, "green tree" has achieved scores at or above the superforecaster median [2, 3]
On the FB-7-21 question set, the AIA Forecaster is statistically indistinguishable from the superforecaster median [4]
AI systems have surpassed the general public and median human forecaster across most ForecastBench configurations [1, 2]
In specialized, data-rich domains (weather, structured time series), AI has demonstrably surpassed human baselines [8]
One report notes that an AI from startup Mantic placed 4th out of 500+ participants in the Fall 2025 Metaculus Cup, beating the weighted average of all human-forecaster predictions in that contest [5]

Where AI has not reached parity:

On the full ForecastBench tournament leaderboard, superforecasters retain the #1 position with a 2.3 Brier Index point lead as of May 30, 2026 [2, 3]
On market questions — the most practically relevant subset — the AI error rate is approximately 50% larger than superforecasters' [5, 13]
On complex, judgment-heavy, continuously updated questions of the type institutional decision-makers actually ask, Good Judgment's analysts doubt the gap will close within the next year [5, 13]
AI systems currently operate as "solo" forecasters, lacking the group deliberation and aggregation techniques that boost human superforecaster accuracy by 10–25% [5]
Human forecasters continuously update in response to new information; current AI systems typically provide one-off predictions [5]
Earlier hybrid trials found superforecasters outperforming combined human–AI teams in some configurations [5]

Structural limitations of the benchmark itself:

ForecastBench uses a frozen 2024 human baseline, meaning the human comparison group does not improve over time while AI systems do — creating an asymmetric comparison [2, 3]
The benchmark is skewed toward short-horizon, data-rich questions that structurally favor AI [2, 3]
The difficulty adjustment methodology adds uncertainty to cross-system comparisons [1, 2]
ForecastBench does not capture teaming, iterative updating, or question formulation — all dimensions where human superforecasters have demonstrated advantages [2, 3]

Summary assessment

Dimension	AI Status vs. Superforecasters
ForecastBench overall (May 2026)	Behind — 2.3 Brier Index points [2]
Dataset questions subset	At or above parity [2, 3]
Market questions subset	Significantly behind (~50% larger error rate) [5, 13]
FB-7-21 specific set (AIA Forecaster)	Statistical parity [4]
General public baseline	Surpassed [1, 2]
Complex judgment questions	Behind [5, 13]
Continuous updating	Behind [5]
Group deliberation	Not applicable to current AI [5]

The claim that parity has been "reached" is overstated as a general proposition and contested by the primary benchmark data. It is accurate only for specific subsets and specific question configurations. The Forecasting Research Institute's own leaderboard — the primary evidence base for these claims — shows superforecasters still leading as of May 30, 2026 [2, 3].

5. Concrete Implications for Finance, Insurance, and Governance

Finance

AI forecasting systems are being actively integrated into quantitative finance workflows, with the primary use cases being macroeconomic scenario generation, earnings surprise prediction, and geopolitical risk pricing. The ForecastBench data showing AI performance in the high-0.09 to 0.10 Brier score range — approaching but not yet matching elite human accuracy — is directly relevant to systematic trading strategies where marginal predictive edge compounds over large numbers of decisions [5, 6].

The Bank for International Settlements has examined AI forecasting adoption in financial institutions, noting that the governance challenge is not merely technical accuracy but the auditability and explainability of probabilistic outputs [15]. Current AI forecasting systems produce calibrated probabilities that are in principle more auditable than human judgment, but the models' sensitivity to training data composition and prompt framing introduces new forms of systematic error that are harder to detect than human cognitive biases.

The practical implication of the current ~20% human edge on market questions is significant for finance: on questions directly analogous to financial market outcomes, superforecasters' Brier score of approximately 0.039 versus AI's approximately 0.059 represents a substantial accuracy differential that would translate to material performance differences in systematic strategies [5, 13].

Insurance

For insurance applications — catastrophe modeling, longevity risk, liability tail estimation — the relevant comparison is not the overall ForecastBench leaderboard but performance on low-frequency, high-consequence events where training data is sparse. This is precisely the domain where the benchmark's skew toward data-rich, short-horizon questions makes it least informative.

The AIA Forecaster's technical report documents performance on judgmental forecasting using unstructured data [4], which is closer to the insurance use case than standard ForecastBench questions. The statistical indistinguishability from superforecasters on FB-7-21 is therefore more relevant to insurance applications than the overall leaderboard gap — but the caveat that this is a single-lab result on a constrained question set applies with full force.

Actuarial bodies have not yet formally recognized AI forecasting systems as equivalent to human expert judgment for regulatory purposes, and the frozen 2024 human baseline in ForecastBench means the benchmark cannot capture improvements in human forecasting practice that may be occurring simultaneously [2].

Governance and Policy

Good Judgment's work on AI governance applications documents superforecasters' use in policy scenario planning, intelligence assessment, and regulatory foresight [6, 7]. The Longitudinal Expert AI Panel (LEAP) Wave 5 report on security and geopolitics provides the most direct evidence of AI forecasting in governance contexts, showing that AI systems are being used as inputs to — but not replacements for — human expert judgment in high-stakes policy settings [16].

The OECD has examined AI-assisted forecasting in migration anticipation and preparedness contexts, noting that model evaluation frameworks for policy use require different criteria than benchmark leaderboards: specifically, the ability to handle novel situations, explain reasoning to non-technical stakeholders, and update in response to policy interventions [17].

The governance implication of the current evidence is that AI forecasting tools are most defensibly used as augmentation rather than replacement of human superforecasters. The academic literature on AI-augmented predictions confirms that LLM assistants improve human forecasting accuracy [18], suggesting the near-term value proposition is hybrid rather than autonomous [18, 16].

6. Strongest Evidence For and Against Parity

Strongest evidence that parity has been reached

AIA Forecaster on FB-7-21 [4]: Brier score of 0.1125 versus superforecaster median of 0.1110 — statistically indistinguishable, with the AI outperforming prior LLM systems (o3 at 0.1096) and the general crowd by a clear margin.
Dataset question subset leadership [2, 3]: "Green tree" ranks #1 on ForecastBench's dataset questions subset, surpassing the superforecaster median on that category.
Rapid convergence trajectory [1, 5]: The gap has closed from approximately 50% two years ago to approximately 20% by late 2025, with FRI projecting full parity by August 2027 (central estimate) and potentially as early as November 2026 under optimistic assumptions.
Specialized tournament performance [5]: An AI system placed 4th out of 500+ in the Fall 2025 Metaculus Cup, beating the weighted average of all human-forecaster predictions.
BLF market-question performance [2]: A research prototype reportedly achieves a Brier Index of 85.2 on market questions, which would exceed the superforecaster market-question baseline of 80.3 — though this is a preliminary, undeployed system.

Strongest evidence that parity has not been reached

ForecastBench overall leaderboard, May 30, 2026 [2, 3]: Superforecasters lead at Brier Index 70.2 versus "green tree" at 67.8–67.9 — a 2.3-point gap that is not within noise.
Market questions gap [5, 13]: On the most practically relevant question subset, AI error rates are approximately 50% larger than superforecasters' (Brier score ≈ 0.059 vs. ≈ 0.039).
Frozen baseline asymmetry [2, 3]: The human baseline is fixed at 2024 performance; if superforecasters have continued improving, the true gap may be larger than the leaderboard suggests.
Benchmark structural limitations [2, 3]: ForecastBench favors short-horizon, data-rich questions; lacks teaming, updating, and question formulation — all dimensions where human advantages are documented.
No DeepMind confirmation [9, 10]: The "March 15, 2026 parity" claim has no primary-source confirmation from DeepMind, FRI, or any peer-reviewed publication.
Expert skepticism on complex questions [5, 13]: Good Judgment's analysts, who operate the superforecaster programs used as the human baseline, explicitly doubt the gap will close within the next year for the complex, fuzzy questions that institutional clients actually ask.

7. Current Honest Status

As of May 31, 2026, the honest status is as follows:

AI has achieved partial, domain-specific parity with elite human superforecasters on constrained benchmarks. On ForecastBench's dataset questions subset, the best AI systems (led by DeepMind "green tree") have matched or exceeded the superforecaster median. On a specific question set (FB-7-21), Bridgewater's AIA Forecaster is statistically indistinguishable from superforecasters. These are genuine, significant results that represent a qualitative shift from the state of AI forecasting two to three years ago [1, 2, 4].

Broad, sustained parity across the full benchmark has not been established. The superforecaster aggregate retains the #1 position on both the tournament and baseline ForecastBench leaderboards as of May 30, 2026, with a 2.3 Brier Index point lead. On market questions — the most practically relevant category — the human advantage is substantially larger [2, 3, 13].

The "Green Tree parity on March 15, 2026" claim is not verifiable from primary sources and is contradicted by the current leaderboard data showing a persistent gap. The claim appears to have originated from a secondary newsletter's interpretation of subset-level performance, not from any official announcement [9, 10, 3].

The trajectory strongly suggests full benchmark parity within 12–24 months if current improvement rates continue, with FRI's central estimate at August 2027 and optimistic scenarios pointing to late 2026 [5, 2]. Whether benchmark parity will translate to practical parity on the complex, continuously updated, judgment-intensive questions that matter most to finance, insurance, and governance practitioners remains an open and actively contested question [5, 13, 6, 7].

The most defensible summary for practitioners: AI forecasting systems have entered the same performance regime as elite human superforecasters on structured, data-rich questions, and are closing the gap on harder questions at a pace that warrants serious attention. They have not yet replaced superforecasters as the gold standard for the full range of real-world forecasting tasks that institutional decision-makers care about.

References

[1] "FORECASTBENCH:ADYNAMIC BENCHMARK OF AI ...." https://faculty.wharton.upenn.edu/wp-content/uploads/2026/02/ForecastBench_A_Dynamic_.pdf

[2] Leaderboards - ForecastBench. forecastbench.org. https://forecastbench.org/leaderboards/index.html

[3] ForecastBench. forecastbench.org. https://forecastbench.org

[4] "AIA Forecaster: Technical Report." https://arxiv.org/html/2511.07678v1

[5] Human vs AI Forecasts: What Leaders Need to Know. goodjudgment.com. https://goodjudgment.com/human-vs-ai-forecasts

[6] Superforecasting AI Governance s (goodjudgment.io). goodjudgment.io. https://goodjudgment.io/docs/Superforecasting_AI_Governance_s.pdf

[7] Superforecasting ai governance (goodjudgment.com). goodjudgment.com. https://goodjudgment.com/superforecasting-ai-governance

[8] WeatherNext 2: Google DeepMind’s most advanced forecasting model. blog.google. https://blog.google/innovation-and-ai/models-and-research/google-deepmind/weathernext-2

[9] News — Google DeepMind. deepmind.google. https://deepmind.google/blog

[10] Publications — Google DeepMind. deepmind.google. https://deepmind.google/research/publications

[11] "forecastbench:adynamic benchmark of ai forecasting ...." https://arxiv.org/pdf/2409.19839

[12] Making Forecasting Scores Easier to Interpret: Introducing the Brier Index. forecastingresearch.substack.com. https://forecastingresearch.substack.com/p/introducing-the-brier-index

[13] What superforecasters actually said about forecastbench (goodjudgment.com). goodjudgment.com. https://goodjudgment.com/what-superforecasters-actually-said-about-forecastbench

[14] "AIA Forecaster: Technical Report." https://arxiv.org/pdf/2511.07678

[15] Insights63 (bis.org). bis.org. https://bis.org/fsi/publ/insights63.pdf

[16] Wave 5: Security and Geopolitics - Longitudinal Expert AI Panel. leap.forecastingresearch.org. https://leap.forecastingresearch.org/reports/wave5

[17] How can the model be evaluated 1c893567 (oecd.org). oecd.org. https://oecd.org/en/publications/2026/03/migration-anticipation-and-preparedness_c7c13bc4/full-report/how-can-the-model-be-evaluated_1c893567.html

[18] "AI-Augmented Predictions: LLM Assistants Improve Human Forecasting Accuracy." https://arxiv.org/pdf/2402.07862

Evidence Explorer

Select a citation or claim to explore evidence.

Cross-provider analysis

How 4 providers compared on 193 claims across 109 topic clusters

Consensus

Contested

Unique

Low-conf

standard

Consensus findings (6)

Multiple providers independently confirmed these. Treat as the most reliable evidence.

The Brier Index is defined as (1 − √Brier score) × 100%, giving a 0–100% scale where higher values indicate better forecasts.
84%
anthropicperplexitygrokopenai
[16][1][2][5][6][7]
Superforecasters were ranked #1 on the forecasting tournament and baseline leaderboards as of May 30, 2026.
76%
anthropicgrokperplexity
[20][21][2]
As of late January 2026, superforecasters were still slightly ahead of state-of-the-art LLMs, by about 0.017 Brier points, though the gap was historically small and may have closed in some narrower domains.
75%
anthropicperplexityopenai
[11][12][21][28][3]
In the ForecastBench leaderboard (as of late May 2026), superforecasters score about 70.2–70.9 on the Brier Index, while the best AI system, Google DeepMind’s "green tree," scores about 67.8–67.9, leaving a gap of about 0.017 difficulty-adjusted Brier score that the report says is roughly one year of LLM progress.
74%
anthropicgrokperplexityopenai
[11][12][14][16][19][20][21][23][28][2][33][3][4][5][6]
As of late May 2026, AI had reached or approached parity with elite human superforecasters on some specific ForecastBench subsets, but broad, sustained parity across ForecastBench as a whole had not yet been conclusively established.
74%
anthropicgrokperplexityopenai
[11][12][19][1][21][24][28][2][3][6]
ForecastBench is a dynamic forecasting benchmark associated with the working paper “ForecastBench: A Dynamic Benchmark of AI Forecasting Systems” and the Forecasting Research Institute.
72%
grokperplexityopenai
[12][19][2][3]

Single-source insights (85)

Reported by only one provider. Treat as preliminary unless independently verified.

In the original Good Judgment Project experiments, superforecasters were defined operationally as roughly the top 2% of participants in multi-year geopolitical forecasting tournaments.
70%
perplexity
[20]
Anthropic’s Claude Sonnet 4.5 zero-shot configuration has a Brier Index of 63.3.
70%
perplexity
[19][28]
OpenAI’s o3 scratchpad entry has a Brier Index of 63.2.
70%
perplexity
[19][28]
In the original Good Judgment Project experiments, superforecasters showed accuracy improvements of around 30% over intelligence analysts with access to classified information.
70%
perplexity
[20]
OpenAI's o3 is 0.1096.
70%
anthropic
[38][39][40]
BLF outperforms baseline LLM entries such as Anthropic Claude Sonnet 4.5 and OpenAI o3 on the tournament leaderboard.
70%
perplexity
[17][23][25]
+ 79 more single-source insights

Low-confidence claims (28)

Weak signals the verifier flagged for hedged language in the report.

Rapid iteration continues.
56%
grok
Current AI models still effectively “solo” forecast.
58%
openai
The report says WeatherNext 2 is a weather model, not the geopolitical/event "superforecaster" domain in question.
58%
anthropic
The report says Bridgewater's AIA Forecaster is the substantive, peer-grade evidence for parity.
58%
anthropic
The report says AIA Forecaster is clearly better than both the crowd and prior LLM systems on this benchmark.
58%
anthropic
+ 23 more low-confidence claims

Go Deeper

Follow-up questions based on where providers disagreed or confidence was low.

Verify whether the alleged DeepMind system name “Green Tree” and the March 15, 2026 parity claim appear in any official DeepMind publication, blog post, arXiv paper, or ForecastBench/Metaculus documentation, and identify the exact benchmark and score if they do.

The signal says the widely repeated “Green Tree” claim does not check out as a verifiable DeepMind announcement, but other sources still reference a March 15 parity event. This is a high-value provenance check because the headline claim may be based on misattribution or newsletter commentary rather than a real DeepMind release.

DisagreementXS tier

Investigate this →

Establish the exact current best AI Brier score on ForecastBench versus the frozen 2024 superforecaster baseline, including whether the comparison is on the full benchmark or only on a subset such as dataset questions versus market questions.

Multiple weak signals suggest the apparent parity depends on benchmark slice, question mix, or difficulty adjustment, with some sources saying AI is ahead on dataset questions but behind overall, and others citing Brier Index values in the 67–69 range. A precise score-and-baseline audit is needed to determine whether parity is actually established or only partial.

Low ConfidenceM tier

Investigate this →

Compare the Bridgewater AIA Forecaster technical report with the ForecastBench leaderboard to determine whether Bridgewater’s result is the strongest peer-grade evidence for parity, and whether its advantage over crowd and prior LLM systems holds across question types.

One signal says Bridgewater’s AIA Forecaster is the substantive, peer-grade evidence for parity and that it is clearly better than both the crowd and prior LLM systems. Because this is a single-source interpretation, it should be checked against the underlying technical report and any independent reproductions or leaderboard entries.

ImplicationS tier

Investigate this →

Check whether the reported 2.7 Brier-Index point gap is consistent with the stated 0.1125 vs 0.1110 FB-7-21 scores and with the claim that the elite human benchmark scored 0.081, and clarify which normalization is being used.

The signals contain potentially conflicting numeric statements about the size of the gap, the human benchmark, and whether lower is better. A narrow numeric reconciliation would materially affect the honest status of parity and could resolve whether the alleged gap is trivial, meaningful, or a scale artifact.

DisagreementXS tier

Investigate this →

Assess whether the claimed implications for finance, insurance, and governance are supported by benchmark-specific evidence or are extrapolations from general forecasting performance, especially given that humans still outperform AI on market questions and complex “black swan” cases.

Several signals suggest the practical gap remains larger on market questions, that humans retain strengths on trend breaks and low-probability wildcard events, and that current AI is still effectively solo forecasting. This matters because downstream claims about finance, insurance, and governance depend on whether the benchmark is a good proxy for real decision environments.

ImplicationM tier

Investigate this →

Key Claims

Cross-provider analysis with confidence ratings and agreement tracking.

109 claims · sorted by confidence

The Brier Index is defined as (1 − √Brier score) × 100%, giving a 0–100% scale where higher values indicate better forecasts.

high·anthropic, perplexity, grok, openai·markets.financialcontent.com image-ppubs.uspto.gov forecastingresearch.substack.com+3·

high·anthropic, grok, perplexity, openai·goodjudgment.com forecastbench.org arxiv.org+12·

As of late May 2026, AI had reached or approached parity with elite human superforecasters on some specific ForecastBench subsets, but broad, sustained parity across ForecastBench as a whole had not yet been conclusively established.

high·anthropic, grok, perplexity, openai·goodjudgment.com forecastbench.org deepmind.google+7·

Superforecasters were ranked #1 on the forecasting tournament and baseline leaderboards as of May 30, 2026.

high·anthropic, grok, perplexity·forecastbench.org research.google forecastingresearch.substack.com·

As of late January 2026, superforecasters were still slightly ahead of state-of-the-art LLMs, by about 0.017 Brier points, though the gap was historically small and may have closed in some narrower domains.

high·anthropic, perplexity, openai·goodjudgment.com forecastbench.org goodjudgment.substack.com+2·

ForecastBench is a dynamic forecasting benchmark associated with the working paper “ForecastBench: A Dynamic Benchmark of AI Forecasting Systems” and the Forecasting Research Institute.

high·grok, perplexity, openai·goodjudgment.com deepmind.google forecastingresearch.substack.com+1·

A Brier Index of 50 means always predicting 50%.

high·anthropic, perplexity·digicrome.com forecastingresearch.substack.com·

A Brier Index of 100 corresponds to perfect foresight.

high·anthropic, perplexity·digicrome.com forecastingresearch.substack.com·

FRI’s post “LLMs Are Closing the Gap on Human Superforecasters” projects dataset-question parity for June 2026, market-question parity for August 2026, and full LLM-superforecaster parity with a 95% CI of March 2026–August 2028.

high·grok, perplexity·goodjudgment.substack.com forecastingresearch.substack.com forecastingresearch.substack.com·

FRI switched its primary leaderboard metric to the Brier Index in March 2026.

high·anthropic, perplexity·digicrome.com forecastingresearch.substack.com·

A personal tech newsletter post dated May 22, 2026 is the only source the report found that names a DeepMind "GreenTree" ("green tree") forecasting system.

high·anthropic, perplexity·theinnermostloop.substack.com forecastbench.org·

As of late May 2026, ForecastBench evidence suggests LLM-based systems have not yet fully matched elite human superforecasters, with linear trend projections estimating full parity around August 2027.

high·grok, perplexity·forecastingresearch.substack.com deepmind.google arxiv.org+2·

Anthropic’s Claude Sonnet 4.5 zero-shot configuration has a Brier Index of 63.3.

high·perplexity·deepmind.google markets.chroniclejournal.com·

In the original Good Judgment Project experiments, superforecasters were defined operationally as roughly the top 2% of participants in multi-year geopolitical forecasting tournaments.

high·perplexity·research.google·

OpenAI’s o3 scratchpad entry has a Brier Index of 63.2.

high·perplexity·deepmind.google markets.chroniclejournal.com·

Sources

33 unique sources cited across 109 claims.

Academic10 sources

AI-Augmented Predictions: LLM Assistants Improve Human Forecasting Accuracy

arxiv.orgvia anthropic, perplexity, openai, grok

16 claims

[2511.07678] AIA Forecaster: Technical Report

arxiv.orgvia anthropic, perplexity, grok, openai

8 claims

AIA Forecaster: Technical Report

arxiv.orgvia anthropic, perplexity, grok

7 claims

AIA Forecaster: Technical Report

arxiv.orgvia anthropic, perplexity, grok

7 claims

forecastbench:adynamic benchmark of ai forecasting ...

arxiv.orgvia anthropic, perplexity, grok, openai

6 claims

Advancing Event Forecasting through Massive Training of Large Language Models: Challenges, Solutions, and Broader Impacts

arxiv.orgvia anthropic, grok, perplexity, openai

6 claims

Inspiring from Galaxies to Green AI in Earth: Benchmarking Energy-Efficient Models for Galaxy Morphology Classification

mdpi.comvia perplexity

3 claims

FORECASTBENCH:ADYNAMIC BENCHMARK OF AI ...

faculty.wharton.upenn.eduvia grok

2 claims

Evaluating LLMs on Real-World Forecasting Against Expert Forecasters

arxiv.orgvia anthropic, grok, perplexity, openai

1 claim

Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs

arxiv.orgvia perplexity

1 claim

Government1 source

System and method for enhanced collaborative forecasting

image-ppubs.uspto.govvia anthropic, perplexity, grok, openai

5 claims

News & Media8 sources

A professional superforecaster walks (forecastingresearch.substack.com)

forecastingresearch.substack.comvia anthropic, perplexity, grok, openai

18 claims

FinancialContent - The Great Forecast Convergence: AI Closing the 20% Gap on Human Superforecasters

markets.financialcontent.comvia anthropic, perplexity, grok, openai

17 claims

What Superforecasters Actually Said About ForecastBench

goodjudgment.substack.comvia grok, perplexity, anthropic, openai

12 claims

User | chroniclejournal.com - The Death of the ‘Gut Feeling’: AI Agents Close the 20% Gap to Human Superforecasters

markets.chroniclejournal.comvia anthropic, perplexity, openai, grok

12 claims

Making Forecasting Scores Easier to Interpret: Introducing the Brier Index

forecastingresearch.substack.comvia anthropic, perplexity, grok, openai

9 claims

How well can large language models predict the future?

forecastingresearch.substack.comvia grok, openai

5 claims

Welcome to May 22, 2026 - by Dr. Alex Wissner-Gross

theinnermostloop.substack.comvia anthropic, perplexity

4 claims

Llms are closing the gap on human (forecastingresearch.substack.com)

forecastingresearch.substack.comvia openai

3 claims

AI forecasting paritysuperforecastersForecastBench Brier scoreDeepMind green treeforecasting benchmarks 2026AI forecasting finance insurance governanceLLM forecasting performance

Share this research

Read by 21 researchers

AI vs Superforecasters: Parity Status May 2026

Executive Summary

1. The "Green Tree" Claim: System, Date, and Benchmark

What the evidence actually shows

What March 15, 2026 likely refers to

Verdict on the "Green Tree" claim

2. ForecastBench: Current Scores and the Human–AI Gap

The benchmark defined

Current leaderboard figures (as of May 23–30, 2026)

Historical trajectory

The dataset vs. market question split

Projected parity timelines

3. The AIA Forecaster: The Strongest Published Parity Claim

4. Is "Parity" Established, Contested, or Overstated?

The honest status

Summary assessment

5. Concrete Implications for Finance, Insurance, and Governance

Finance

Insurance

Governance and Policy

6. Strongest Evidence For and Against Parity

Strongest evidence that parity has been reached

Strongest evidence that parity has not been reached

7. Current Honest Status

References

Evidence Explorer

Synthesized from 4 providers on May 31, 2026 using fast mode

Cross-provider analysis

Go Deeper

Verify whether the alleged DeepMind system name “Green Tree” and the March 15, 2026 parity claim appear in any official DeepMind publication, blog post, arXiv paper, or ForecastBench/Metaculus documentation, and identify the exact benchmark and score if they do.

Establish the exact current best AI Brier score on ForecastBench versus the frozen 2024 superforecaster baseline, including whether the comparison is on the full benchmark or only on a subset such as dataset questions versus market questions.

Compare the Bridgewater AIA Forecaster technical report with the ForecastBench leaderboard to determine whether Bridgewater’s result is the strongest peer-grade evidence for parity, and whether its advantage over crowd and prior LLM systems holds across question types.

Check whether the reported 2.7 Brier-Index point gap is consistent with the stated 0.1125 vs 0.1110 FB-7-21 scores and with the claim that the elite human benchmark scored 0.081, and clarify which normalization is being used.

Assess whether the claimed implications for finance, insurance, and governance are supported by benchmark-specific evidence or are extrapolations from general forecasting performance, especially given that humans still outperform AI on market questions and complex “black swan” cases.

Key Claims

Sources

Topics