Listen on your favorite platform
Episode Summary
An honest look at where AI actually stands in April 2026, and why the MIT Project NANDA study reported that roughly 95% of enterprise AI pilots produce no measurable return. The headline is anchoring every bear pitch on Wall Street, but the study itself is more textured than the number suggests.
Frontier capability has moved fast. Claude Opus 4.7 and GPT-5.4 both shipped this spring with meaningful reliability gains; Chinese open-weight models are within shouting distance of the US frontier; METR's long-horizon task evaluations put top models at three to five hours of effective autonomous work, up from about one hour a year ago. But the binding constraint on AI for business owners right now is not capability. It is reliability. And reliability is moving slower than the capability curve suggests.
This is Part 1 of a two-part deep dive pulled from our cornerstone analysis "The State of AI in April 2026: Capability Is Not the Problem."
Chapters
- 00:00The thesis: capability is not the problem
- 02:30Where we actually are right now (Opus 4.7, GPT-5.4, Gemini 3.1 Pro)
- 10:00The forces pushing capability forward (compute, post-training, interpretability)
- 18:00The honest limits: hallucination rates, long-horizon reliability, evaluation gaps
- 28:00The MIT Project NANDA study, properly
- 34:00The economics: $700B of circular capital
- 45:00Power as the next binding constraint
Topics Covered
- The real state of the frontier: Claude Opus 4.7's 70% on CursorBench, GPT-5.4's 67% on WebArena-Verified, Gemini 3.1 Pro, DeepSeek V4, Qwen 3.5, GLM-5 all within noise on GPQA Diamond.
- Why METR's "time horizon" metric matters more than one-shot benchmarks. The length of task a model can reliably complete, which is a better proxy for real-world usefulness than any single score.
- Three forces driving current capability gains: compute scaling through Blackwell and Rubin, post-training eating pre-training, mechanistic interpretability finally getting somewhere.
- The limits keynote demos hide: median enterprise LLM hallucinates 15-50% on open-ended questions; five-step agents hold 77% reliability while twenty-step agents drop to 36% if each step is 95%.
- What the MIT Project NANDA study actually found: bespoke builds failed, vendor-delivered integrations worked two-thirds of the time, shadow AI was quietly succeeding.
- The circular-capital loop: $700B of 2026 hyperscaler capex, NVIDIA investing in OpenAI that rents from Oracle that buys from NVIDIA.
- Power as the new constraint: roughly half of planned 2026 US data center builds expected to slip or cancel because the electrical grid cannot support them.
Key Numbers Cited
- $30B - Anthropic's revenue run rate as of April 2026, up from ~$9B at end of 2025
- $24B - OpenAI's current revenue run rate
- $700B - combined 2026 capex guidance from Amazon, Alphabet, Microsoft, Meta
- $1.15T - OpenAI's forward compute commitments through 2035
- 320 min - Claude Opus 4.5's METR time horizon (up from ~60 min a year earlier)
- 94.2% / 94.4% / 94.3% - Opus 4.7 / GPT-5.4 Pro / Gemini 3.1 Pro on GPQA Diamond (within noise)
- 95% - MIT Project NANDA's headline AI pilot failure rate (with real nuance behind it)
- ~50% - planned 2026 US data center builds expected to slip due to grid limits
Related Reading on Prentus AI