Namsang LABS
Radar · #ai #ai-engineering #harness-engineering #conference #agent

AI Engineer Europe 2026 — Code Became Free, but Quality Got More Expensive

· Sangkyoon Nam

From April 8 to 10, AI Engineer Europe 2026 debuted in London. Over 100 talks, 23 workshops, 11 tracks. It was the widest lens available on where AI engineering stands right now. The message running through the conference was “Models are good enough. You are not ready.” The models are sufficient. It’s the organizations and environments that aren’t ready. Here are five inflection points that emerged from the key talks.

#1. “Code is free” vs “Bad code is the most expensive it’s ever been”

Opposing claims came from the same stage.

OpenAI’s Ryan Lopopolo barred his team from touching their editors for nine months, having them build software using only agents.

“Code is free. We have an abundance of code to solve the problems that we come across in our day-to-day.”

— Ryan Lopopolo, OpenAI

After GPT 5.2, models can perform the full job of a software engineer, and implementation is no longer a scarce resource. What’s scarce has narrowed to three things: human time, human and model attention, and the model’s context window. In the past, P0/P1/P2 got handled first and P3 was deferred forever. Now you run all four in parallel and pick the best result. The engineer’s role shifts from implementation to systems thinking, design, and delegation.

That same afternoon, TypeScript educator Matt Pocock pushed back.

“I don’t think this is right. I think code is not cheap. In fact, bad code is the most expensive it’s ever been.”

— Matt Pocock

AI works far better on good codebases. As production cost converges to zero, the value gap created by quality grows exponentially. Deep modules, stable boundaries, clear contracts. What used to be “nice to have” virtues are now competitive variables in the AI era.

The two claims aren’t contradictory — they’re two sides of the same coin. The cost of producing code converges to zero, while the value of code quality actually goes up.

Related talks followed. Pi’s Mario Zechner said “agents don’t feel pain.” Human developer discomfort served as a quality feedback loop, but agents skip that pain entirely. Flask creator Armin Ronacher said “Friction is your judgment.” Without friction, you can’t steer. Linear CTO Tuomas Artman introduced Quality Wednesday and a zero-bug policy, noting that “35 quality issues came out of a single small menu.”

It was striking to watch people who build agents telling everyone else to slow down.

#2. Harness Engineering — A New Axis of Differentiation

A term crystallized in Ryan Lopopolo’s keynote: Harness Engineering. Designing the environment (harness) in which agents operate matters more than the agents themselves in determining real-world success.

Numbers came with it. Lopopolo’s team of 7 built a 1-million-line codebase over 5 months. Human-written code: 0%. Human-reviewed code: also 0%. Daily token consumption: 1 billion. 1,500 PRs were generated and merged by agents. What made this scale possible was harness design.

There’s a lineage to this trend.

  • Vibe Coding (2025, Karpathy) — describe in natural language, get code
  • Agentic Engineering (early 2026) — orchestrate agents
  • Harness Engineering (2026.04, Lopopolo) — design the environment where agents operate

Concrete advice was shared. Make your codebase legible to agents. ADRs, persona-oriented docs, historical tickets and code review logs. Every record that trained human engineers becomes a path agents can follow. Since the context window is scarce, use the same patterns consistently. That way the model needs less attention to activate. Large-scale refactoring now costs almost nothing. Migrations that dragged on for six months can be handled by 15 agents running simultaneously.

Claude Code was frequently cited as an example. The terminal is the execution environment, the filesystem is context, git worktree is isolation, hooks are guardrails. The harness isn’t the model — it’s this entire environment. OpenAI Codex and Cursor share the same structure, but because each harness differs, the user experience differs too.

As model performance converges, differentiation shifts to the harness. This also explains why organizations that just added ChatGPT Enterprise seats without designing a harness aren’t seeing results.

#3. Agent-Native Infrastructure Is Already Here

Vercel CTO Malte Ubl opened his keynote with a single number: over 60% of Vercel’s website visitors are AI agents.

“There was always all this stuff we wanted to automate, but not all of it was economically viable to do with traditional software. But it is with agents.”

— Malte Ubl, Vercel CTO

The result: we’ve reached the point where agents are both creating and consuming software.

The domains agents are entering extend well beyond web infrastructure. GenCast, introduced by DeepMind’s Raia Hadsell, showed 97% accuracy compared to physics-based weather models. AI is moving into a domain that numerical weather prediction has held for 65 years.

Peter Steinberger of OpenClaw, an open-source coding agent, showed the other side of this reality. Alongside surging installations, 1,142 security advisories piled up — 16.6 per day. Many of them were AI-generated slop reports. Agent-created noise pours in at the same rate agents contribute to open source. The message: security and governance need to be redesigned.

#4. The Token Economy Evolves — Routing and Code Mode

The question is shifting from “which model to use” to “how to mix multiple models.”

Anthropic shared data on two combinations. A Haiku + Opus mix more than doubled BrowseComp scores. Low-cost Haiku sweeps the web, and Opus only steps in at difficult judgment points. A Sonnet + Opus combination optimized both performance and cost on SWE-bench Multilingual.

This pattern is called Cheap Executor + Expensive Advisor. Execution is cheap, advice is expensive. It’s the point where cost optimization and performance optimization align.

A more radical approach also emerged. Code Mode, introduced by Sunil Pai, has agents generate and execute code directly instead of calling tools. On a typical task, 1.2M tokens dropped to 1K. A 99.9% reduction.

“It stopped generating a program and it instead started inhabiting the state machine.”

— Sunil Pai

An approach that could shake MCP’s standing once again.

The opposing warning was also present. Gergely Orosz of Pragmatic Engineer pointed out token maxing. Big tech companies like Meta and Microsoft have started measuring developer productivity as “how many tokens were consumed.” Using more doesn’t mean doing better, but when the metric becomes the goal, tokens just get wasted. The AI version of the classic Goodhart’s Law.

#5. Benchmark Reality Check — ClawBench 6.5%

ClawBench, presented by Peter Gostev (Arena), evaluates agents on 153 real online tasks.

  • Existing sandbox benchmark accuracy: 70%
  • Real website accuracy: 6.5%

Single digits. The gap between how well agents perform in toy environments versus the real web was laid bare in numbers. The same talk showed that “even when top models compete head-to-head, 9% of the time both sides are dissatisfied.” Benchmark scores going up doesn’t mean user satisfaction follows.

Evidence from the other direction also exists. On the MirrorCode benchmark, Claude Opus 4.6 rebuilt a 16,000-line bioinformatics toolkit from scratch. Estimated human effort: several weeks.

Dominant where it excels, dismal where it doesn’t. The era of talking in averages is ending, replaced by the era of mapping “where does it excel.”

#Wrapping Up

Looking across the five talks, one direction emerges. Differentiation is shifting from models to environments. Model selection is becoming routing design; agent implementation is becoming harness design; benchmark trust is becoming domain-specific real-world validation.

For organizations considering agent adoption, the Progressive Autonomy framework shared at the conference is a useful reference. Start with Shadow Mode, progress through Advisory and Controlled Autonomy, then reach Expanded Autonomy. Don’t deploy at full autonomy from day one — build evidence as you move up the levels.

And multiple speakers said the same thing: since agents handle speed, we invest in quality. The observation is that the era when speed and quality were a trade-off is ending.

#Talk Guide

Twenty notable talks, curated.

#Day 1 (4/9) — Keynotes & OpenClaw

SpeakerAffiliationTitle / One-linerVideo
Malte UblVercel CTOThe New Application Layer — 60% of Vercel traffic is AI agentsYouTube
Raia HadsellGoogle DeepMind VPFrontier AI — GenCast surpasses physics simulations at 97% accuracyYouTube
Ryan LopopoloOpenAIHarness Engineering — Code is free, what’s scarce is human timeYouTube
Peter SteinbergerOpenAIOpenClaw Update — 1,142 security advisories and AI slop reportsYouTube
Vincent KocComet MLDark Factory — 3,000 commits a day via parallel agent managementYouTube
Maggie AppletonGitHub NextOne Developer, Two Dozen Agents, Zero Alignment
Radek SienkiewiczVelvetSharkHanding agents the keys to your life — 3,000 Obsidian notes as a knowledge baseYouTube
Gergely OroszPragmatic EngineerToken Maxing — Big tech’s new Goodhart’s LawYouTube
Matt PocockTypeScript educatorBad code is the most expensive it’s ever beenYouTube
Sunil PaiCode Mode — Generate code instead of tool calls, 1.2M tokens to 1KYouTube

#Day 2 (4/10) — MCP, Quality, Agent Orchestration

SpeakerAffiliationTitle / One-linerVideo
Omar SansevieroGoogle DeepMindGemma 4 — DeepMind’s open model family
David Soria ParraAnthropic (MCP creator)The Future of MCP — Progressive Discovery and MCP AppsYouTube
Ido SalomonMCP AppsAgentCraft — Putting the Orc in Agent Orchestration
Mario ZechnerPi creatorBuilding Pi in a World of Slop — Agents don’t feel painYouTube
Armin RonacherFlask creatorThe Friction Is Your Judgment — You can’t steer without frictionYouTube
CursorReplacing 12,000 lines of code with 200 lines of markdown skillsYouTube
LukeFactoryOrchestrator-Worker-Validator — A 16-day autonomous mission systemYouTube
Sarah ChangCerebrasFast models make verification free — The 1,200 tok/s eraYouTube
Tuomas ArtmanLinear CTOQuality Wednesday and zero-bug policy — 35 issues from a single small menuYouTube
Peter GostevArenaClawBench — Sandbox 70% vs real web 6.5%YouTube

#References

Share this post