I have been running AI agents in my own business for most of the last year. I read the benchmarks the day they post. I talk to operators shipping this stuff and operators who tried and quietly stopped. What I want to do here is give you the clearest, most honest picture I can of where AI actually is in April 2026, and where I think it is going. Not doomer, not accelerationist, not the safe middle take either. The real picture, as best I can assemble it from primary sources. If I have a take I will tell you it is a take.
Where we actually are right now
Start with capability. On April 16, Anthropic released Claude Opus 4.7. It processes images up to roughly 3.75 megapixels, resolves "3x more production tasks" than Opus 4.6 on SWE-bench, scores 70% on CursorBench (up from 58%), and in Anthropic's words, "works coherently for hours" on autonomous tasks. CNBC called it "broadly less capable" than Anthropic's still-unreleased Claude Mythos model, which is only being given to a handful of vetted partners. Read that one more time. The model Anthropic did release, their most powerful generally available system, is a model they consider a step down from what they have already built but will not ship.
On the OpenAI side, GPT-5.4 arrived March 5 with native computer-use capabilities, a 1 million-token context window, and a reported 67.3% success rate on WebArena-Verified. Google's Gemini 3.1 Pro and Chinese open-weight models (GLM-5, DeepSeek V4, Qwen 3.5) are within shouting distance on most benchmarks. On graduate-level reasoning (GPQA Diamond), the three frontier US labs are inside the noise of each other: Opus 4.7 at 94.2%, GPT-5.4 Pro at 94.4%, Gemini 3.1 Pro at 94.3%. The interesting question is no longer "can the model answer this" but "will it reliably answer this when it is part of a ten-step workflow a junior analyst kicked off two hours ago."
That is a very different question. And the honest answer in April 2026 is: sometimes.
The METR research group, which evaluates frontier models on long-horizon autonomous tasks, measures what they call a model's "time horizon": the length of task at which the model completes things roughly half the time. As of the most recent round of reporting, Claude Opus 4.5 sits at about 320 minutes on METR's suite and GPT-5 at 214 minutes, with Opus 4.7 expected to move the number further. That is real progress; a year ago the figure was closer to an hour. It is also less than it sounds. Reliability degrades super-linearly with task duration. The model that can do a one-hour task 80% of the time does the eight-hour version of that task maybe 20% of the time, because errors compound. In practice, this means frontier agents today can genuinely do real work, but they need human checkpoints more often than most demos suggest.
Put another way: the difference between "the model can do the task" and "I would let the model do the task unsupervised in front of a paying customer" is still very large. That gap is the entire story.
The forces pushing it forward
Three things are driving capability gains right now, and they are stacking.
Compute is scaling and being booked out years in advance. The four largest hyperscalers - Amazon, Alphabet, Microsoft, Meta - are planning combined capex near $700 billion in 2026, roughly double what they spent in 2025. NVIDIA sits on top of that pile; its advanced packaging supply at TSMC is essentially booked through 2027, leaving AMD and everyone else fighting over the scraps. The Vera Rubin platform, the follow-on to Blackwell, is now in volume production. In practical terms, this means the frontier labs are going to have materially more compute to train with in 2026 than they had in 2025, and materially more compute to serve inference on, which matters because agentic workloads burn tokens the way a cargo ship burns diesel.
Post-training is eating pre-training. A lot of the 2025-2026 capability gains did not come from scaling the base model. They came from better reinforcement learning on tool use, better curriculum design, and synthetic-data pipelines. GPT-5.4's computer-use mode was not a function of a bigger model. It was OpenAI training specifically on the task of driving a computer, then iterating. Good news for cost and speed. Bad news: post-training gains are easier to copy than pre-training gains, which is why Chinese labs are closing ground so quickly.
Mechanistic interpretability is finally getting somewhere. Anthropic's interpretability team has spent two years publishing increasingly precise papers on the features and circuits inside Claude. Not solved, not close. But "black box" is less true than it was, and Anthropic now uses interpretability tools in pre-deployment safety evaluations. This matters because the largest obstacle to deploying AI in serious industries is "we cannot explain why it did that." We are moving, slowly, toward a world where we can.
The honest limits
Now the hard part. The stuff that does not show up in the keynote demos.
Hallucination has not been solved. It has barely moved. On well-scoped summarization tasks, the best models hallucinate 0.7% to 1.5% of the time. That sounds great until you realize the median enterprise LLM hallucinates 15% to 50%+ of the time on real-world questions, and in legal and medical domains the rates run much higher. Retrieval-augmented generation (RAG) brings those numbers down meaningfully but does not bring them to zero. The business implication is simple and painful: if a task has a cost of being wrong that is higher than the cost of being slow, you still need a human in the loop. That is true for frontier models too.
Long-horizon reliability is where agents fall apart. It is hard to build a five-step agent that works 90% of the time. It is shockingly hard to build a twenty-step agent that works 90% of the time: if each step has a 95% success rate, end-to-end reliability is about 36%. The market has been slow to internalize this. Most production deployments solve it with aggressive scoping, heavy retry logic, human-in-the-loop checkpoints, or all three. "I set a goal and the agent does it" still mostly does not survive contact with messy real-world systems.
The MIT study everyone is arguing about is not wrong, just incomplete. MIT's Project NANDA found in late 2025 that roughly 95% of enterprise AI pilot projects were producing no measurable return. That headline anchored every bear pitch on Wall Street. What the study actually found was more textured: bespoke internal builds were failing; vendor-delivered systems with workflow integration succeeded about two-thirds of the time; a large "shadow AI" economy of employees using ChatGPT personally was doing fine. The lesson is not "AI does not work." It is "generic AI does not integrate, and without integration, generic AI does not produce measurable P&L impact." The same lesson every prior wave of enterprise software taught.
Evaluation is worse than it looks. Most benchmarks you see in press releases are saturating. SWE-bench scores have climbed so fast the field is rolling out SWE-bench Pro; WebArena became WebArena-Verified; MMLU is effectively dead as a frontier benchmark. Labs know how to train for benchmarks. What is harder to measure is whether a model is good at messy, underspecified, context-dependent work. The gap between "scored 83% on GDPval" and "reliably produced the week's board deck" is the gap that matters, and it does not have a clean benchmark.
The economics: a lot of money, moving in circles
The single most important financial fact about AI in April 2026 is that the customers and the suppliers increasingly live in the same loop.
Anthropic hit a $30 billion revenue run rate this month, up from about $9 billion at the end of 2025. OpenAI is at roughly $24 billion. Those are astonishing numbers; they put both companies in the same revenue league as many Fortune 100 firms, after less than a decade of commercial existence. Eight of the Fortune 10 are Anthropic customers. Over 1,000 companies now spend $1 million-plus a year with them.
But follow the money a little further. OpenAI has signed forward compute commitments of roughly $1.15 trillion through 2035, to Broadcom, Oracle, Microsoft, NVIDIA, AMD, Amazon, and CoreWeave. NVIDIA then invests in OpenAI; OpenAI rents compute from Oracle; Oracle buys hardware from NVIDIA; NVIDIA books that as revenue, the stock goes up, the valuation rises, and the cycle continues. Bloomberg has mapped the deal web and the picture is unambiguous. A significant slice of "AI revenue" is really AI suppliers paying other AI suppliers with money originally raised from equity investors who believe the cycle reflects real end-customer demand. Some of it does. Not all of it.
The comparison I hear most often is the dot-com fiber build-out of 1999. That is imperfect - the data centers are useful, the GPUs are revenue-generating, productivity gains are real in a way the 1999 build-out's were not yet - but it is not crazy. If end-customer demand grows into the capacity, the cycle is healthy. If it does not grow fast enough, revenue backs out at once and the chain shudders. My honest read: demand is real, but the magnitude of spending is running ahead of the pace of measurable enterprise adoption. Mismatches resolve one way or the other.
One more thing worth knowing: the next binding constraint is not money. It is power. Roughly half of the US data center builds planned for 2026 are expected to slip or cancel because the electrical grid cannot support them on the timeline the hyperscalers want. Transformer lead times are running out to five years. Hyperscalers are increasingly building dedicated power, including buying failed nuclear plants. The infrastructure story for the next 24 months is less about chips (though those matter) and more about whether the United States can physically energize what has already been ordered.
Geopolitics and policy
Three developments matter here.
First, the money keeps flowing into the US labs, but the model landscape has been decisively de-monopolized by China. DeepSeek V4, Qwen 3.5, Kimi K2.5, GLM-5 - all within striking distance of US frontier performance on most benchmarks, all released with open weights under permissive licenses. Chinese open-weight models reportedly now account for roughly 30% of global model usage on aggregator platforms, up from essentially zero in late 2024. This is not a national-security story anymore; it is a commoditization story. If you are a small business and your AI vendor raises prices, you now have a credible, near-frontier alternative you can host yourself.
Second, US policy decisively shifted toward federal preemption. The Trump administration's December 2025 executive order (since followed by a legislative framework in March 2026) explicitly targets state AI laws the administration considers inconsistent with federal innovation priorities. The practical effect for business owners: fewer compliance surprises in the short term, more concentration of power in federal regulators, and continued uncertainty about what the durable regulatory regime looks like. The EU's AI Act, by contrast, moves into broader enforcement in August 2026, with penalty provisions for general-purpose AI model providers taking effect. If you sell into Europe, you should already be talking to your lawyer. If you sell only domestically in the US, your 2026 exposure is lower than it was a year ago.
Third, and this gets the least press: model providers themselves are becoming regulatory entities. Anthropic's Acceptable Use Policy, OpenAI's content rules, and Google's safety filters collectively determine what tens of millions of businesses can and cannot do with AI. When Anthropic added cybersecurity-specific safeguards to Opus 4.7, that was not a government action. It was a private-company policy change that reshaped what legitimate security firms can and cannot ask the model to do. I do not have an opinion on whether this is good or bad. I do have an opinion that business owners should understand: your product's capabilities are a function of your model vendor's policies, and those can change with a blog post.
What this means for humans
The honest answer about jobs is: we do not know yet, but the first signals are in, and they are mixed.
Goldman Sachs analysis cited by multiple outlets suggests AI has contributed to a net reduction of roughly 16,000 US jobs per month since mid-2025, concentrated in routine office work. At the same time, the aggregate labor market keeps adding jobs, primarily in healthcare and construction. Entry-level white-collar hiring is visibly softer than mid-career hiring, which is consistent with AI being a closer substitute for the work a twenty-two year old would otherwise do than the work a forty-two year old does. McKinsey surveys suggest roughly a third of companies expect workforce reductions of 3% or more within the next year attributable in part to AI. The World Economic Forum's 2025 projections suggested net job creation through 2030, though with massive internal reshuffling.
My read: the aggregate labor numbers in 2026 will not look like a crisis. The distribution will look uncomfortable. Certain roles - customer service, basic contract review, entry-level research, routine code - are seeing real wage and hiring pressure now. Others - skilled trades, senior judgment work, work that requires physical presence, work that requires legal or fiduciary accountability - are basically untouched. If you are a business owner, the important practical fact is that you can now, today, reasonably get the output of a junior employee from an AI for about 1-5% of the cost. Whether you want to take that trade is a separate question. Those outputs still require supervision, and they do not build the pipeline of senior judgment you will need in ten years.
On education: the actual risk is not cheating. It is that students who lean on AI through school come out with strong "prompting" skills and weak underlying comprehension. The students who use AI as a tutor (I have watched my own nephew do this) are learning faster than any cohort before them. The students who use it as a substitute for thinking are hollowing out. The gap between the two is widening, and schools are, in my experience, mostly not equipped to tell the difference.
On trust: this is the sleeper issue. AI will be right often enough to be useful, wrong often enough that high-stakes use without verification will occasionally produce disasters. The first few of those will shape public opinion for a decade. Anyone deploying AI in high-consequence environments - medicine, law, finance, critical infrastructure - is placing a bet on their verification process, not on the model.
What I think happens next
I am going to hold these loosely, because the last two years should have taught everyone some humility about AI forecasting. But here is how I see the next 12 to 24 months.
Capability gains continue but start feeling less dramatic. We will get Opus 5, GPT-6, Gemini 4. They will be better. The public perception will increasingly be "still hallucinates, still can't reliably do my job," because the gap between benchmark progress and felt reliability progress will widen. The median user will stop noticing version bumps. Power users will keep noticing because their use cases sit closer to the frontier.
Agentic workflows become real in narrow verticals. Coding, customer support, research, and specific analytical workflows will see production AI agents that actually work 95%+ reliably inside tight scopes. General-purpose "AI employees" will continue to disappoint relative to marketing. The successful pattern will look much more like "well-designed automation with AI steps inside it" than "autonomous agent that figures it out." I think the field will stop using the word "agent" so loosely by 2027.
Some version of an AI capex correction arrives. Not a 2000-style crater, because the underlying demand is real and the assets are productive. But the 2026 spending pace is out ahead of the 2026 revenue pace, and the market generally notices that kind of thing within eighteen months. A hyperscaler or two will slow capex growth, the cycle will partially unwind, and the private AI lab valuations that are priced for 100x growth will reprice to something closer to 30x. That would be healthy. My confidence on this: moderate.
The regulatory regime splinters further. The EU will enforce under the AI Act; the US will continue federal preemption with relatively light-touch rules; China will continue state-directed development. The biggest practical consequence for business owners is that AI compliance will increasingly look like tax: a cost of doing business that varies by jurisdiction, with specialist advisors needed to navigate it. The idea of a single global AI regulatory regime is, I think, done.
Interpretability and reliability, not raw capability, become the competitive edge. This is my strongest take. The lab that ships a model that is less capable on benchmarks but more reliable end-to-end on real work - fewer hallucinations, more honest refusals, better calibration on its own uncertainty - will outsell the lab that leads on raw capability. Enterprises will pay a premium for "boring" AI. Anthropic has been positioning for this for two years. I expect it to be vindicated.
A practical roadmap for the business owner reading this
If you are running a business and you want to make one concrete set of decisions out of all of this, here is mine.
Start small and real. Pick one repetitive, low-stakes task that you actually do every week. Build an AI workflow for it. You will learn more from shipping one imperfect automation than from reading fifty industry reports. Start Here is our landing page for that; this list of use cases gives you concrete starting points.
Do not build anything that cannot tolerate occasional wrongness. Or, if you must, build verification in as a first-class feature rather than an afterthought. Human-in-the-loop is not a limitation; in April 2026 it is the correct architecture for most commercial AI deployments.
Do not lock yourself to a single vendor more than you have to. The model market has gotten genuinely competitive over the last year. Frontier labs are two-to-three-month updates apart on capability. Open-weight models are within an order of magnitude and sometimes much closer. Building abstraction layers into your stack so you can swap models is, right now, one of the highest-leverage architectural decisions you can make. Our coverage of the OpenClaw concept layer and of NemoClaw's security architecture both come at this from different angles.
Do not sit out. The second-order costs of being the last business in your category to adopt will, in my best estimate, exceed the first-order costs of adopting imperfectly right now. Not by a lot, but by enough. Shipping something real this quarter, even a mediocre something, buys you learning that your competitors who waited will have to pay for later.
And keep your eyes on reliability, not on headline capability. When you read about a new model, ask: "what is the reliability number on a task I actually care about?" If that number is not cited, assume it is worse than the capability number you are being shown.
The bottom line
Here is what I think the story of April 2026 is, in one paragraph.
The models can do breathtaking things. The models cannot reliably do most of those things alone, for most business purposes, without a human checking their work. Hundreds of billions of dollars are being spent on infrastructure that assumes the gap between capability and reliability will close faster than it is actually closing. Some of that capital will earn huge returns. Some of it will be written off. The winners of the next two years will be the operators who build the right verification and integration around imperfect AI, not the operators who waited for perfect AI or the operators who trusted AI to be perfect. If you are a thoughtful business owner, your job right now is to stop reading about it and build one thing.
That is the whole picture, as honestly as I know how to tell it.