I Fact-Checked a Podcast's AI Claim With 4 Models. Only the Swarm Caught It.

First, some love. I listen to Moonshots with Peter Diamandis and All-In every week. They are some of the best signal in tech. Smart people, real arguments, ideas you won't get anywhere else. This post is not a dunk on either show. It's the opposite. It's about what happens downstream of even a great podcast, when a sharp-sounding claim leaves the studio and lands in your research tool.

Here's a thesis I've been sitting on for a while. A great answer to a poorly constructed question is a bad answer.

And the most common way a question gets poorly constructed isn't malice. It's that you picked up a confident claim somewhere, a podcast or a tweet or a hallway conversation, and you ask the AI about it as if it's already true.

So I ran the experiment with a real one.

The claim

In a recent Moonshots episode the hosts covered a DeepMind forecasting system and said this, word for word (watch the moment, 51:38):

"...a system called Green Tree that can predict the future as well as the best humans on Earth... So on March 15th, AI hit parity with the superforecasters for the first time. Let me say that again. An AI can now predict geopolitical events, economic trends, political outcomes as good as the absolute best human minds."

That's a big claim, said as settled fact. A named system, a specific date, a clean milestone. Exactly the kind of thing you go and research. So I did what a normal person does. I took the claim more or less as I heard it and asked an AI to fill in the details:

"A DeepMind forecasting system (reported name Green Tree) was claimed to hit parity with top human superforecasters around March 15, 2026. Verify the system, the date, and the benchmark used."

Here's what makes it a hard question. "Green Tree" is real. It's an actual entry on the ForecastBench leaderboard. So a lazy check ("does Green Tree exist? yes") confirms the wrong thing. The load-bearing parts are hit parity, on March 15, and first time, and those are the parts that needed verifying. A good researcher separates the real entity from the unverified milestone. A careless one sees the real name and waves the whole claim through.

I ran the query through Parallect, which fans every question out to four frontier models in parallel (OpenAI, Perplexity, Grok, and Anthropic) and then writes one synthesized report from all four. Then I ran it again. And again. And again. Four times, same query, same day. The only thing changing was the models.

That's 16 individual model answers plus 4 synthesized reports. Don't take my word for any of this. Every run is public twice over. You can read the finished report on Parallect, and you can pull the raw, signed research bundle ¹ from prxhub:

Run	Read it	Download the bundle
1	Parallect report	prxhub bundle
2	Parallect report	prxhub bundle
3	Parallect report	prxhub bundle
4	Parallect report	prxhub bundle

The data

For every run I scored each model on one thing. Did it repeat "hit parity on March 15" as established fact, or did it flag it (note that the milestone has no DeepMind publication behind it and that superforecasters still lead the actual leaderboard)?

Run	OpenAI	Perplexity	Grok	Anthropic	Synthesis
1	repeated	flagged	repeated	flagged	caught
2	repeated	repeated	repeated	flagged	caught
3	repeated	flagged	flagged	flagged	caught
4	repeated	repeated	repeated	flagged	caught
Caught	0 / 4	2 / 4	1 / 4	4 / 4	4 / 4

Read the bottom row twice.

OpenAI repeated the claim as fact all four times. It wrote a fluent, well-organized, confidently-sourced report. One run opened with "In mid-March 2026, reports surfaced that a new Google DeepMind forecasting agent, code-named 'Green Tree,' had matched the accuracy of elite forecasters." That sentence reads like journalism. It's the podcast's claim, laundered into a citation. If you ran this query once through a single model, which is exactly what almost everyone does, you'd walk away more confident in the unverified claim than when you started.

Across all sixteen single-model runs, the claim survived 9 times. Worse than a coin flip.

The synthesis caught it 4 out of 4. Every time, the unified report led with the correction: "Green Tree is real but mischaracterized. Google DeepMind does have a 'green tree' entry on the ForecastBench leaderboard, but the specific 'hit parity on March 15' claim traces to a single secondary write-up, not any DeepMind publication, and the current leaderboard still shows superforecasters ahead."

For the record, that correction is right, and you can check it yourself. Green Tree is a genuine entry on the ForecastBench leaderboard, ranked near the top (about #2, Brier Index around 67.9). But the human superforecaster aggregate still leads (around 70.2), and there's no primary source for a dated "parity" event. ForecastBench is an independent, peer-reviewed benchmark from the Forecasting Research Institute (paper). Go look at the standings and judge for yourself. The podcast compressed a real, impressive result into a cleaner headline than the data supports. Easy to do live. Hard to catch if you only ask one model.

Why the synthesis wins, and it isn't "more models = smarter"

The naive read is "average four models, beat one." That's not it. Average four confident wrong answers and you get a confident wrong answer.

What actually happens is narrower. In every run, at least one model doubted the milestone. Usually Anthropic, sometimes Perplexity or Grok. On one run, Grok went straight to the ForecastBench leaderboard and checked the standings before answering. The synthesis step reads all four reports side by side, sees three asserting a fact while one says "I checked the leaderboard and that's not what it shows," and treats that disagreement as a reason to dig instead of a vote to average away.

One skeptic in the room is enough, if something is listening for the skeptic. A single model has no skeptic. It is its own only witness, and on this question OpenAI's only witness was confidently wrong four times running.

That's the whole mechanism. The question carried a shaky premise. Three of four models, on a typical run, answered the question as asked. Beautifully, uselessly. The synthesis answered the question that should have been asked: wait, is the Green Tree parity event even real? No. So the genuinely useful answer to the original question is "your premise doesn't hold," and only the multi-model path got there every time.

A great answer to a bad question is a bad answer. The whole game is noticing the question was bad.

Could this be wrong?

The result is clean enough that I want to attack it myself.

Is the scoring just keyword-matching? No. I read the actual sentences. OpenAI writing "reports surfaced that a... agent had matched the accuracy of elite forecasters" is a repeat. Grok pulling the live leaderboard before answering is a flag. The labels reflect what the models did, not which words they used.

Is n=4 enough? For a rate, no. I would not publish "OpenAI confirms X% of shaky premises" off one claim run four times. What n=4 gives you is qualitative and stark. The synthesis caught it every time, OpenAI missed it every time, and the other two were a coin flip (Perplexity 2 of 4, Grok 1 of 4). The direction isn't ambiguous. The precise frequency needs more claims across more topics. That's the next experiment.

Did the synthesis just get lucky? It's a separate authoring step that reads all four reports, and it caught the claim on the two runs where only one of the four providers flagged it. Twice it had to pull a single dissenting thread out of three confident affirmations and follow it. That's not luck riding clean inputs.

Isn't this unfair to the podcast? That's the objection I care about most, so let me be plain. The hosts didn't fabricate anything. They reported a real, genuinely impressive DeepMind result and rounded it up to a crisp milestone in the flow of a live conversation. That's normal. Podcasts compress, and that's part of why they're great. The failure I'm measuring isn't theirs. It's what happens when that compressed claim meets a single AI model that's eager to confirm whatever you bring it.

What this means if you research with AI

Single-model deep research has a failure mode that's invisible by design. When your question carries an assumption, the model usually hands you a polished answer built on top of it, and the polish is what fools you. The report looks identical whether the premise was solid or not. You can't see the assumption it never questioned.

Picking a better model doesn't save you. OpenAI is excellent. It confirmed this four times not because it's weak, but because a single model has no one to argue with.

What saves you is making the models check each other. That's the whole reason I built Parallect. Not to crown the one best model, but to put several of them in a room so the one that smells something off can stop the others from repeating it.

I didn't even have to invent a trap. I just believed something I heard on a show I love and asked about it the way anyone would. One model confirmed it every time. The room caught it every time.

One more thing, and it matters

Don't let the fact-check bury the actual story, because the Moonshots crew were onto something real.

Strip away the too-clean date and Green Tree is a DeepMind system sitting at #2 on a serious, contamination-resistant forecasting benchmark, a couple of points behind the best humans on Earth and closing fast. A few years ago that gap was a chasm. It is now a rounding error on the easy questions and a short, shrinking lead on the hard ones. The headline ran ahead of the data by a few months. The trend the besties were pointing at is dead right, and it's one of the more important things happening in AI right now.

So this isn't a gotcha. It's a thank-you. I'd never have dug into ForecastBench, the Brier Index, or how close the machines actually are if Peter and the crew hadn't put it in front of me. They got me to look. The tools just helped me look carefully.

That's the relationship I want with all of this. Great people surface the signal. Good tools keep you honest about the details. You need both.

Parallect runs your research question across Perplexity, OpenAI, Grok, Gemini, and Anthropic in parallel, then writes one report that has to reconcile what they disagree on. The four runs above were standard L-tier queries. Heard a bold claim somewhere? Bring it and check it at parallect.ai.

Every claim in this post is checkable. The four full research runs are public (1, 2, 3, 4), the podcast moment is linked to the timestamp, and the ForecastBench standings are public. Don't trust us. Look.

prxhub is an open registry for research bundles. A bundle is a portable, cryptographically signed file containing the full research run: every provider's report, the extracted claims, the sources, and the synthesis. Publishing to prxhub means anyone can download the exact artifact, inspect it, or re-run their own analysis on it. The Parallect link is the readable version; the prxhub bundle is the receipts. ↩