Three Weeks with a Coding Agent: What Actually Works

I spent the last few weeks building small tools and products almost entirely through coding agents — Claude Code, Codex, Gemini, Copilot. Across a dozen projects spanning CLI utilities, macOS apps, an iOS sleep tracker, Apple Watch apps, a full-stack web application, and a computer vision pipeline for ring sizing, I developed a workflow that reliably gets things done. Here’s what I learned.

The Memory Problem

The first thing you hit with any coding agent is amnesia. Each conversation starts blank. The agent that just built your authentication flow has no idea it exists when you open a new session.

I tried session IDs. I tried relying on tool-specific project modes. None of it stuck. What works is brutally simple: local files.

Every project gets three documents — PRD.md, Plan.md, and Progress.md — plus an AGENTS.md that defines the agent’s workflow constraints. I call this the 3P protocol. At the start of every session, the agent reads these files. At the end, it updates Progress.md and commits. The state lives in the repo, not in any platform’s memory. This means I can switch between Claude Code and Codex mid-project without losing context. The files are the context.

I templatized this into a base repo. New project setup takes under a minute: clone, reinitialize git, write the PRD, generate the plan, run /init, start building.

The Workflow That Holds

The AGENTS.md file enforces a task workflow with five steps for new features:

Read PRD.md, Plan.md, Progress.md before coding
Summarize current project state before implementation
Implement, then build and test
Update Progress.md
Commit with a clear message

For bug fixes, it’s simpler: summarize the bug and proposed fix, implement, test, update, commit. The key constraint is line 3 of every AGENTS.md I write: Do not change scope unless explicitly instructed. Without this, agents wander. They refactor what doesn’t need refactoring. They add features you didn’t ask for. Scope discipline is the single most important rule.

Where Agents Excel

Raw implementation speed across unfamiliar platforms is remarkable. I built my first Apple Watch app, my first iOS app, and my first full-stack web application (Echo English, a language learning tool on Next.js + Supabase) within this period. The agent handles boilerplate, platform-specific APIs, and framework conventions far faster than I could by reading documentation.

Code refactoring is another strong suit. During the Ring Sizer project, I noticed the vision pipeline was littered with hard-coded pixel coordinates for debug overlays. One prompt — “refactor to use variables instead of hard-coded coordinates” — and it was done cleanly. Same with Progress.md files that grow unwieldy: “refactor and clean” works.

The Ring Sizer project also revealed that agents can run structured experiments. I provided a test dataset of real finger measurements with known ring sizes, along with evaluation criteria, and Claude Code executed the full test suite and produced a calibration report. The agent doesn’t just write code — it can be a competent research assistant when given clear data and clear objectives.

Where the Human Stays Essential

The Ring Sizer was my first production-grade project built with an agent, and it taught me the most important lesson: directional judgment is the bottleneck, not execution speed.

When the computer vision pipeline consistently overestimated finger width by 0.25cm — enough to push a size 11 prediction to size 13 — the agent could not diagnose the root cause on its own. I had to reason about the physical setup: the finger, being closer to the camera than the reference card, appeared larger due to perspective distortion. That insight reframed the problem from “fix the edge detection” to “apply systematic bias correction,” which led to a linear regression calibration approach that worked.

The pattern repeated: the agent is fast at executing a direction, but choosing the right direction remains a human job. Providing that directional judgment — even a rough one — dramatically improves collaboration efficiency.

The 10-Minute Rule

Debugging deserves its own heuristic. I learned to never trust a single model’s debugging instinct beyond two attempts. After two failed tries, the model’s responses get longer and less useful — it’s spinning. At that point, switch models. If both models fail, the approach itself is wrong. Change the approach, not the prompt.

No model loyalty. Sometimes Claude Code solves it. Sometimes Codex does. The pragmatic move is always to switch early rather than burn time on diminishing returns.