Engineering notes from the agent era

Most posts here are about what changes when the agent becomes a first-class collaborator: local code intelligence, testing systems that analyze behavior over time, knowledge tooling that compounds. The rest of the career (cloud, low-latency, mobile) shows up when something's worth writing down.

An editorial illustration on warm cream paper showing three concentric ink-line rectangular frames nested inside one another like a stepped box-in-box diagram. The outermost frame is labelled microVM boundary and carries a small chip glyph in its corner, the middle frame is labelled OS sandbox boundary, and the innermost frame is labelled git worktree with a small git-branch glyph, holding one tiny contained robot figure marked with a code glyph. A small all-caps serif title reading SANDBOXING CODING AGENTS sits in the upper left.

Jun 23, 2026 · 23 min read

Sandboxing Coding Agents: The 9-Second Argument for Isolation

A Cursor agent deleted PocketOS's prod database and backups in 9 seconds. Prompt rules aren't a boundary; the kernel is. An isolation ladder for coding agents.

An editorial illustration on warm cream paper showing a single markdown document, marked with a small hash glyph, resting on one pan of an old balance scale while the other pan holds a short stack of coin-like token discs. The two pans sit almost level with the document side dipping only slightly, used as a visual metaphor for whether a CLAUDE.md context file earns the tokens it costs. A small all-caps serif title reading DOES YOUR CLAUDE.MD HELP sits in the upper left, and one thin ink-blue line runs along the bottom margin.

Jun 20 21 min read

Does Your CLAUDE.md Actually Help? The Research Says Maybe Not

An ETH Zurich study found context files cut coding-agent success up to 2% and raised cost 20-23%. Field data disagrees. When your CLAUDE.md earns its tokens.

Claude Code
AGENTS.md
context engineering
AI agents
developer productivity

An editorial illustration on warm cream paper showing three labelled arrows emerging from a single origin point in the lower-left third of the frame. One arrow points up to a small black-ink icon of a hand at a keyboard and is labelled vertical. One arrow points right to three small stylised agent silhouettes in a row and is labelled horizontal. A third arrow rendered with a slight angled projection to suggest depth points outward to a fresh blank page labelled temporal session, and is the only element rendered in ink-blue. A small all-caps serif title reads THE SESSION HANDOFF in the upper-left margin and a thin ink-blue line at the bottom reads ATTENTION BUDGET, used as a visual metaphor for the three orthogonal axes of agent handoff and the temporal cut this post argues for.

Jun 18 28 min read

The Session Handoff: When Your Attention Budget Is Spent

Anthropic frames context as an attention budget. Lost-in-the-middle costs 30%+ retrieval accuracy. Three axes of handoff, five triggers, one cut.

AI agents
context engineering
Claude Code
session handoff
attention

An editorial illustration on warm cream paper showing a stylised pager device in the centre of the frame. The pager's dial is divided into five wedge-shaped segments, each carrying a single black-ink icon for one of the five SRE primitives applied to coding agents: a circle for blast radius, a gauge for confidence threshold, a clipboard for runbook, an arrow for escalation, and a magnifying glass for postmortem. A small all-caps title reads AGENT SRE in the upper-left margin. A thin ink-blue horizontal line runs across the bottom of the frame labelled ONCALL FOR AUTONOMOUS CODING AGENTS, used as a visual metaphor for borrowing SRE doctrine for the coding-agent fleet.

Jun 13 27 min read

Agent SRE: Oncall and Escalation for Coding Agents

Anthropic's Apr 23 2026 postmortem took 6 weeks to detect. Five SRE primitives map to Claude Code hooks and turn agent oncall into a shipped artifact.

AI agents
SRE
oncall
Claude Code
hooks

An editorial schematic on warm cream paper showing four labelled rungs stacked top to bottom: a wavy prompt rung at the top in faded ink, a structured skill rung below it with a routing-description tag, a deterministic-gated hook rung with an exit-2 stamp, and a fully-deterministic tool rung at the bottom drawn as a sharp-edged callable; an arrow on the right side travels in both directions to show promotion and demotion, used as a metaphor for the named four-rung architecture that ties skills, hooks, and MCP together

Jun 11 34 min read

The Promotion Ladder: Prompt, Skill, Hook, Tool

Four rungs trade flexibility for determinism. AGENTIF puts the best model at under 30% adherence; well-designed schema cuts MCP cost 99.9%. Match the rung.

skills
hooks
mcp
agents
claude-code

An editorial schematic on warm cream paper showing a single agent session timeline running left to right, four numbered cue markers firing along it at separate moments, and a human hand glyph reaching down to a keyboard at the second cue, with the unused right half of the timeline labelled reclaim, used as a metaphor for the keyboard-reclaim decision as a per-task confidence interval

Jun 7 33 min read

The Handoff Problem: When to Take the Keyboard Back

Developers use AI in 60% of work but fully delegate only 0 to 20% of tasks. Four named cues for the moment you reclaim: drift, scope, novel error, 80%.

autonomy
agents
claude-code
AI pair programming
agentic coding

An editorial schematic on warm cream paper showing three parallel agent-session traces: the top trace ends at a muted red square labelled restart, the middle trace continues with an ochre arrow labelled resume, and the bottom trace forks at a checkpoint glyph into a pinned suspended branch and an ink-blue corrected branch that reaches a crosshair goal target labelled fork, used as a metaphor for the three recovery paths after a drift threshold breach

Jun 5 31 min read

Long-Running Autonomous Agents: Drift, Checkpointing, Recovery

METR pegs Opus 4.6 at a 14.5-hour 50%-time-horizon. Pass@1 collapses 24 points on long tasks. Drift as eval, checkpoint on threshold, fork not restart.

AI agents
Claude Code
agent reliability
long-horizon agents
agentic coding

An editorial illustration of a tall stable stack of identical paper folders sitting beneath a thin band of fluttering loose pages, with a cache-breakpoint line drawn between them, used as a metaphor for the stable prefix and ephemeral suffix discipline of cache-aware prompting

Jun 3 23 min read

Cache-Aware Prompting: Engineering for 90%+ Hit Rate

ProjectDiscovery moved from 7% to 84% cache hit rate without changing the model. The discipline, the workload taxonomy, the five named failure modes.

AI engineering
prompt engineering
developer productivity
Anthropic
cost optimization

An editorial illustration contrasting a quiet single-meter desk gauge labelled ccusage on the left with a 4 by 3 grid of team-level meters on the right where one runs into a red over-budget zone, used as a metaphor for personal versus team-tier AI cost observability

Jun 1 23 min read

Agent Cost Observability: From Personal Token Budget to Team-Wide

98% of FinOps teams now manage AI spend, up from 31% in 2024. Solo ccusage solved one developer; here is the three-axis, two-cap recipe for 200.

FinOps
AI engineering
developer productivity
observability
platform engineering

An editorial illustration on warm cream paper showing on the left a row of small generic boxes labelled tool one and tool two and tool three each tagged with the words DOES X and on the right a smaller set of differently shaped index cards labelled PURPOSE and WHEN TO USE and INPUTS and OUTPUTS and EXAMPLES and ANTI-EXAMPLES with neat tabs along their edges. A horizontal arrow between the two halves is labelled SCHEMA IS THE PROMPT. The image is used as a metaphor for tool descriptions functioning as prompt engineering for AI coding agents rather than as docstrings written for humans.

May 31 22 min read

Tool Design for Agents: Schema Is the Prompt

97% of MCP tool descriptions have at least one code smell. 56% fail to state purpose. The description is the prompt your agent reads to pick a tool. Here is the rubric.

AI agents
MCP
developer tools
prompt engineering
platform engineering

An editorial illustration on warm cream paper showing on the left side a folder tree drawn as nested boxes with arrows that all point downward through child directories and end at flat rectangles labelled STRING and STRING and STRING, and on the right side a graph of small circular nodes labelled symbol and type and module and commit and owner connected by short curved edges labelled CALLS and REFERENCES and DECLARED-IN and LAST-TOUCHED-BY, an arrow between the two halves labelled PROJECT GRAPH, used as a metaphor for the filesystem abstraction versus the graph abstraction that coding agents actually need.

May 30 29 min read

The Project Graph: What Agents Need That Filesystems Can't Give

40 questions, two large repos, three LLM judges. Code-intelligence: judge 7.12 vs default 6.30 (+0.82), 29% faster, +8% tokens. Cites 50% vs CodeGraph's 32%.

AI agents
developer tools
MCP
code intelligence
platform engineering

An editorial illustration on warm cream paper showing five tall narrow lanes side by side under a small all-caps title reading FIVE FAILURE MODES OF AUTONOMOUS CODING AGENTS. The lanes are labelled left to right CONTEXT BLEED, SCOPE CREEP, SILENT COMPLETION, CASCADE ERROR, MODEL DRIFT. Each lane contains a single black-ink icon for its detection substrate: a clipboard with a checkmark for SubagentStop hook, a clipboard with a checkmark for PostToolUse hook, a circular checkmark seal for eval, a clipboard with a stop-hand for PreToolUse hook, and an analog clock face for the daily golden-prompt eval. A single thin ink-blue horizontal line runs across all five lanes and is labelled RETRO TEMPLATE on the right margin, used as a visual metaphor for the failure-mode taxonomy and its one-page incident retrospective.

May 28 28 min read

The Five Failure Modes of Autonomous Coding Agents

Five named failure modes for autonomous coding agents, each with a real 2026 incident, a detection signal, and a retro template you can drop into CLAUDE.md today.

AI agents
incident response
Claude Code
evals
hooks

An editorial split-frame illustration: a bright, friendly desk with a laptop running a finished app on the left labelled localhost, separated by a thin vertical line from a darker server room with a billing alarm light glowing red on the right labelled production, used as a metaphor for the gap between an AI-built MVP and a deployed production application.

May 27 29 min read

From Localhost to Production: The Handoff Brief for AI-Built Apps

45% of AI-generated code ships OWASP vulns. 380K vibe-coded apps public right now. The seven-gap handoff brief for builders and engineers.

AI engineering
vibe coding
production
security
developer experience

An editorial still-life of a three-tier workshop tableau on aged paper: a wooden letterpress typecase drawer of hundreds of identical metal sorts at the base, a small row of pre-cast composite component blocks in the middle, and a single completed galley page on top marked with handwritten proofreader's annotations, used as a metaphor for the layered substrate beneath AI-generated UI: Tailwind atoms, shadcn primitives, and AI-edited compositions

May 25 23 min read

UI Libraries vs AI-Generated Components: The Tailwind Substrate

Tailwind 51%, v0 4M users, shadcn passed Chakra. The library-vs-AI debate is the wrong frame: substrate placement is the right one. The four-quadrant framework.

frontend
AI engineering
Tailwind
shadcn
developer productivity

An editorial illustration contrasting a wall of identical generic boxes labelled API mirror with a smaller set of differently-shaped purpose-built tools labelled tool shape, used as a metaphor for two opposing approaches to MCP server design

May 23 24 min read

MCP Server for Your Codebase: Tool-Shape, Not API-Mirror

Cloudflare's first MCP server would have eaten 1.17M input tokens. Their redesign got it to roughly 1,000. Here is the framework, applied to a codebase server.

MCP
model context protocol
AI engineering
developer tools
platform engineering

A schematic gate-valve illustration with four wavy probabilistic input pipes labelled CLAUDE.md, skill, subagent, and memory converging into a hard-edged gate body marked exit 2 equals stop, then exiting as a single deterministic line, used as a metaphor for hooks as the only deterministic substrate in Claude Code

May 21 20 min read

Claude Code Hooks: The Only Deterministic Substrate

The best frontier model follows under 30% of agentic instructions perfectly. Hooks run as code on every matched event regardless. Here is the substrate map.

Claude Code
AI agents
hooks
policy enforcement
agentic coding

A test bench with four labelled assertion gauges (skill trigger, subagent spawn, hook firing, MCP reachability) wired to a Claude Code session, used as a metaphor for control-plane evals against agentic machinery

May 19 22 min read

Agent Evals: A Test Suite for Your Claude Code Setup

Observability says what happened. Evals say if the right thing happened. 89% ship the first, 52% the second. Four control-plane evals for Claude Code.

Claude Code
AI agents
evals
agentic coding
developer productivity

A worn approval prompt gives way to structured policy gates for AI coding agent permissions

May 19 17 min read

The Permission Prompt Is Dying in AI Coding Agents

Claude Code users approve 93% of prompts. For AI coding agents, prompt walls failed as governance; safety is policy: allow, gate, block, log.

Claude Code
AI agents
permissions
policy enforcement
agentic coding

An editorial illustration on warm cream paper showing on the left side a row of five identical small server boxes each carrying a heavy duplicated cube labelled MODEL, stacked in a fragile column with bandages and a rope labelled LEADER ELECTION tying two of them together, and on the right side a single tall daemon box on a wide stable base with one cube of the same size feeding many thin pipes that fan out to five subagent silhouettes, an arrow between the two halves labelled V4 PIVOT, used as a metaphor for the stdio per-subprocess MCP model versus a single shared daemon

May 18 26 min read

Stdio MCP Doesn't Scale: Dropping 3,662 Lines for a Daemon

Five subagents across three repos loaded 2.6 GB of duplicated embedding models. v4 deleted the stdio path; the daemon shares everything. Here is the migration.

MCP
model context protocol
developer tools
AI engineering
platform engineering

Top-down photo of an open notebook spread. The left page is titled Mythos Audit May 11 2026 and shows five handwritten entries with four of them crossed out and one tagged low. The right page is titled AI-Coded CVE Log Nov 2025 to May 2026 and lists CVE identifiers with their CVSS scores. A red mechanical pencil rests across both pages along a thin red horizontal rule, used as a metaphor for the dual-angle read on AI as both bug-finder and bug-creator.

May 16 22 min read

Claude Mythos vs. the CVE Surge: AI Security in May 2026

On May 11 curl's Daniel Stenberg called Anthropic's Mythos report mostly marketing. The same six months delivered the CurXecute RCE, the Claude Code chain, and a 35-CVE March.

AI security
CVE
Claude
Copilot
AppSec
developer productivity

A workshop wall with four labelled drawers (working, episodic, semantic, procedural) feeding a Claude Code session, used as a metaphor for the four-memory architecture mapped onto MEMORY.md, JSONL transcripts, the wiki, and skills

May 15 22 min read

Agent Memory Architecture: Four Memories, Four Fixes

200K-context models rot by 50K tokens. Coding agents hit 150K in 35 minutes. Map four memories onto Claude Code: MEMORY.md, .remember, JSONL, skills.

Claude Code
AI agents
memory architecture
context engineering
agentic coding

An editorial illustration of two pools connected by a pipe: the larger left-hand pool is labelled Subscription Claude Code (interactive); the smaller right-hand pool is labelled Programmatic Credit at API rates; a vertical red bar labelled June 15 2026 severs the pipe between them, used as a metaphor for the split between interactive and programmatic Claude usage.

May 15 21 min read

Anthropic Just Metered the Agent SDK: What Breaks on June 15

On May 13 Anthropic split Claude subscriptions into interactive and programmatic pools. Power users call it a 25x cost cut. Here is the strategic read.

AI engineering
Claude
Codex
agent SDK
developer productivity

An editorial illustration of a four-dial dashboard with three of the dials visibly disconnected from the underlying meters they claim to read, used as a metaphor for DORA's four classic metrics drifting from the system they were built to measure

May 13 23 min read

DORA in the Agent Era: Three Metrics Stop Measuring

DORA's four metrics measured human-paced delivery. With agents writing 46% of code and review time up 441% YoY, three no longer measure what they claim.

DORA metrics
AI engineering
developer productivity
DevEx
engineering management

A single index card pinned to a workshop wall with seventeen lines of failing test handwritten on it, used as a metaphor for the failing test as the durable artefact for follow-up agent work

May 11 18 min read

Agentic TDD: When the Failing Test Is the Spec

Spec-driven was last week's new feature. Today's spec: 17 lines of failing test. Artefact-driven TDD for follow-up agent work, against the 1.7x AI-issue rate.

Claude Code
AI agents
TDD
agentic coding
developer productivity

A workshop bench with three labelled drawers (brainstorm, design, plan) feeding a single output, used as a metaphor for the durable artefact pipeline that produces an agent-executable plan

May 10 21 min read

Spec-Driven Agent Development: Brainstorm, Design, Plan

PR 23 told its reviewers to use MCP. They didn't. Per-agent tool calls jumped from 0.6 to 2.6 after a design doc surfaced the wiring bug prompts hid for weeks.

Claude Code
AI agents
spec-driven development
agentic coding
developer productivity

A wooden card-index drawer on a warm linen surface, packed with cream archival cards and a single card tilted forward, a physical metaphor for progressive disclosure

May 8 21 min read

Agent Skills: Progressive Disclosure That Actually Scales

Naive skill loading costs roughly 22x more tokens than progressive disclosure, and the attention math gets worse with every model upgrade. The pattern, the catalog math, and the authoring mistakes that break it.

agent skills
Claude Code
context engineering
progressive disclosure
AI agents

An overhead view of a precision manufacturing floor with three parallel conveyor lanes, used as a metaphor for the wave-based dispatch pattern in subagent-driven development

May 8 22 min read

Subagent-Driven Development: How to Fan Out a Feature Build

Subagents fan out feature builds at ~15x token cost. Wave dispatch, a frozen plan, and the five failure modes specific to subagent-driven development.

Claude Code
AI agents
subagents
developer productivity
agentic coding

A stack of well-worn engineering books on a desk in warm daylight, suggesting the durable layer beneath any paradigm

May 6 20 min read

Engineering That Outlasts the Paradigm

Trust in AI accuracy hit 29% the same year vibe coding became Word of the Year. Both numbers describe the same mistake. Engineering outlasts the paradigm.

AI agents
software engineering
agentic coding
career
thought leadership

A flock of starlings rendering an emergent coordination pattern across an open landscape, used as a metaphor for parallel subagent dispatch

May 4 21 min read

Subagent Patterns: When to Spawn vs Stay In-Context

Multi-agent burns 15x more tokens than chat. Five-question decision tree, 2026 token math, and three reproducible failure modes for Claude Code subagents.

Claude Code
AI agents
subagents
developer productivity
agentic coding

Paper with the line 'Use REST.' intact above red redaction bars covering the rest of the paragraph.

May 1 19 min read

We Tried to Cut Claude's Output 50%. We Got 5%. So Did Anthropic.

We aimed for 50% Claude output compression. We hit 4.7%. Anthropic hit the same wall and reverted at 3%. Here is the data and the failure mode.

claude-code
llm-output-compression
prompt-engineering
claude-skills
anthropic

An editorial illustration of a code repository as a layered query graph with typed signatures, file boundaries, and import edges traced by a precise probe

Apr 29 14 min read

Your Codebase Is the Agent's Operating Environment

Frontier agents hit 90% on SWE-Bench Verified and 21% on SWE-EVO. The variable is the shape of the codebase, not the size of the model.

AI agents
monorepo
code intelligence
code graph
developer tooling

A clean editorial illustration of a code diff flowing through a fine-mesh sieve catching mechanical defects, then through a brass jeweler's loupe inspecting one architectural piece

Apr 26 19 min read

AI Reviews the Diff. Humans Review the Decision.

AI code-review adoption tripled to 51.4% in 2025, but 31% of PRs now merge unreviewed. Honest market scan, security posture, and a Claude Code DIY recipe.

AI PR review
AI code review
Claude Code
GitHub Actions
developer tooling

A clean editorial illustration evoking record, replay, diff, and judge as a temporal evaluation loop

Apr 25 20 min read

Backtesting AI Agents: Replay to Catch Regressions

54% of enterprises ship AI agents in production. Most cannot tell when a CLAUDE.md edit silently regresses behavior. Backtesting is the missing discipline.

agent evaluation
backtesting
Claude Code
AI agents
regression testing
LLM as judge

A 2x2 decision matrix mapping the four context surfaces (CLAUDE.md, skills, memory, MCP) against frequency and stability axes

Apr 25 18 min read

Context Engineering in Practice: Where Does Each Piece Go?

Context engineering became the #1 2026 skill shift. Anthropic's research notes context exhibits n² token relationships. Here's the per-surface decision framework.

context engineering
Claude Code
MCP
AI agents
developer productivity

A sleek, conceptual representation of a stacked architectural infrastructure system

Apr 18 31 min read

Treat AI as a Team Member, Not a Chat Window

84% of developers use AI, 46% distrust it. The right scaffolding (constitution, skills, memory, MCP, subagents) turns an assistant into a team member.

AI agents
developer productivity
Claude Code
MCP
team workflows

A premium, sophisticated abstract representation of a glowing technical dashboard or progress ring tracking data tokens

Apr 16 11 min read

How to Track Claude Code 5-Hour Window Usage

40.8% of devs use Claude Code, but the 5-hour window is opaque. Build a local dashboard that parses transcripts, estimates your token budget, and rolls up team-wide cost via Grafana Loki.

claude-code
token-usage
developer-tools
ai-coding
cost-tracking

Abstract 3D visualization of code structure as a glowing interconnected graph of symbols and relationships

Apr 9 31 min read

Your AI Agent Is Flying Blind Without Local Code Intelligence

84% of developers use AI tools but 46% distrust the output. Three on-device models, 32 MCP tools, 9.93/10 relevance, and zero source code leaving your machine.

local code intelligence
AI agents
MCP
code search
developer tools

A dynamic, high-tech abstract visualization of a glowing interconnected wiki graph forming from raw data blocks

Apr 8 11 min read

Building an LLM Wiki: From Karpathy's Gist to a Working CLI

I turned Andrej Karpathy's LLM wiki concept into a Bun CLI (~500 lines of TypeScript) that automatically builds a persistent knowledge base from Claude Code sessions, files, and URLs.

llm
cli
knowledge-management
claude-code
bun

A clean, minimal editorial illustration of a data timeline and temporal behavior patterns

Apr 2 15 min read

How Do You Test Systems That Analyze Behavior Over Time?

Backtesting borrows from quant finance to catch temporal bugs unit tests miss. Poor US software quality costs $2.41T per year. Here's the technique.

backtesting
software-engineering
data-pipelines
temporal-data
regression-testing
synthetic-data
developer-tooling