Deep Dive with Claude Code: Two Months on a Production App

2026/05/26

Earlier this year I wrote Three Weeks with a Coding Agent — a quick sketch of a workflow that was starting to work. This is the longer follow-up. The Ring Sizer Project is now winding down, and it’s the first product-grade application I’ve shipped almost entirely through Claude Code. The day-to-day of being a software engineer has changed enough that it’s worth a careful note.

The Shape of the Project

Two months. 285 commits. Six versioned milestones (v0 → v6). About 12k lines of Python plus 5k of JS/CSS/HTML.

A non-trivial product surface for a single person:

Mapped to a traditional org: PM, designer, CV/algorithm engineer, backend engineer, frontend engineer, mobile developer, QA, DevOps. Here it was one person plus Claude Code, with Claude Design pulled in for visual work.

How a One-Person Team Actually Works

Version docs are the agent’s working memory

Every major direction is its own folder — doc/v0/, doc/v4/, doc/v5/, doc/v6/ — with PRD.md (what we’re trying to do), Plan.md (how), and Progress.md (what actually happened, including the failures). The repo’s CLAUDE.md makes “read the v-folder before coding” step zero of every reboot.

This sounds like documentation theater. It’s not. The Progress.md files are written for the next session of the agent, not for me. They include things like:

“While reading src/edge_refinement.py in preparation for Phase 2, I discovered that the MediaPipe mask is not used as the search constraint in Sobel edge refinement today. extract_ring_zone_roi() builds roi_mask = np.ones((h, w)) * 255… The ‘mask-constrained mode’ is therefore actually ‘rectangle-constrained mode.’”

That note — found by the agent itself while reading code before a refactor — drove the single biggest robustness gain of v4. Without the Progress.md convention it would have been a one-conversation insight, lost the moment the context window cleared. With it, it became a durable architectural finding I can hand to the next session in two seconds.

Workflows crystallize into skills

The repo has a .claude/skills/ folder with two skills so far: recalibrate and cleanup-supabase-f. The first encodes an eight-step calibration workflow I burned myself on once — a “double-calibration trap” where I almost re-fit the linear model on already-corrected measurements. The skill now warns loudly on an identity-fit (slope≈1, intercept≈0) and forces --no-calibration on the batch.

Tribal knowledge that used to live in my head — “remember to do X before Y, and watch out for Z” — becomes an executable thing the agent runs verbatim. The asymmetry is enormous: I have to learn a procedure once to write the skill; the agent executes it perfectly every time after.

The Agent as Research Assistant

The biggest mental update: this isn’t just code generation. Given clear data and clear objectives, Claude Code is a competent experimental analyst.

Picking a distance-gate threshold. v5 needed a “move closer” prompt for the mobile capture coach. The threshold needed to be data-driven, not eyeballed. The agent wrote script/analyze_hand_span.py, ran the full SAM pipeline across 72 KOL images, bucketed by authoritative fail_reason, and produced this:

bucketnP10P50P90
success300.2390.2820.327
card_too_small200.2000.2230.249
card_not_detected120.2330.2780.336
card_not_parallel80.2050.2320.256

Conclusion: pin HAND_SPAN_RATIO_MIN = 0.239 (P10 of the success cohort). Catches 80% of card_too_small failures, passes 93% of currently-successful captures, correctly does nothing to the unrelated card_not_detected / card_not_parallel distributions. That’s the actual reasoning chain that landed in Progress.md, with the trade-off (“tightening to ≥0.249 catches 90% of failures but blocks 25–30% of small-hand users — bad tradeoff”) spelled out.

Choosing a segmentation backend. v4 needed to decide between SAM 2.1 Tiny and Small. The agent ran a single-image comparison with point-prompt mode, measured IoU and latency, and produced a table. SAM Tiny scored 0.987 IoU at 0.6s; Small scored 0.982 at 0.7s — visually indistinguishable. Decision: ship Tiny. No need to spend 50% more compute for nothing.

I would not have written either of these scripts if I were doing this myself. They take a couple of hours each. The agent writes them in a couple of minutes, and the conclusion is good enough to commit.

Where the Human Stays Essential

Diagnostic judgment doesn’t transfer. The very first calibration result was off by 0.25cm — enough to push a size-11 prediction to size 13. The agent could not find the cause. I had to look at the geometry myself: the finger sits closer to the camera than the reference card, so it photographs larger due to perspective distortion. That insight reframed the entire problem from “fix the edge detection” to “fit a linear bias correction” — a tractable problem the agent could then execute on cleanly.

Same pattern, smaller scale, last week: two KOL submissions failed because the users had tattoos on their palms. The single-point SAM prompt was landing on the ink and latching onto the tattoo’s shape instead of the hand. I noticed the pattern from the admin dashboard; the agent then proposed and implemented a coverage-gated fallback (if <70% of the 21 MediaPipe landmarks fall inside the mask, retry with a 6-point prompt) plus a 232-line regression-test harness. The pattern recognition was mine; the implementation was the agent’s.

Scope discipline is constant work. Without explicit pushback, the agent will refactor what doesn’t need refactoring and add what you didn’t ask for. The single most-quoted rule in my AGENTS.md template is Do not change scope unless explicitly instructed. This rule is not optional.

Trust but verify, especially on external state. The mobile capture coach uses MediaPipe Hands from a CDN. The agent pinned @mediapipe/[email protected] — a version number from training-data memory that turned out not to exist on jsdelivr. Cost: half a debugging session. The lesson banked in Progress.md: verify versions against the registry before pinning, surface actual error strings on the failure-state chip so the next regression is debuggable from the device, not the DevTools console.

Two-strike rule for debugging. If the model fails twice on a bug, the third response is almost always longer and less useful — it’s spinning. At that point: switch models, or change approach. No model loyalty.

What the Job Becomes

Every engineer working this way gets pushed, whether they want to or not, into the manager seat. And there’s a real difference between an individual contributor and a manager — the core skill of management is leverage: getting work out of people whose specific skills are better than yours in some dimensions.

That’s the skill that suddenly matters. The agent codes faster than I do, knows more frameworks than I do, writes cleaner debug tooling than I do. None of that is the bottleneck. The bottleneck is: do I know which experiment to run, which direction is worth pursuing, when to stop polishing, what the product should actually feel like, which failure modes are worth fixing, which to write off as user-education problems.

A few concrete shifts I’ve noticed in how I work:

The Opportunity Window

The case that’s most obvious to me: a founder with a solid product foundation and direct contact with users can now move at a speed that previously required a funded team. The team or individual — already has the audience, already has the product instinct, just needs the execution to keep up — can now use AI to read reviews and surface demand, Claude Code to implement features, Claude Design to ship UI, and often outpace a full-time team on both speed and quality.

This is genuinely the right moment for solo founders. Not because the tools make founding easy — they don’t — but because the leverage they grant to a single operator with good taste is unprecedented. The bottleneck is no longer headcount. It’s whether you have the directional judgment to use the leverage well.

The skill to build, urgently, is the one that translates into “what should we do next, and how do I phrase that so the system can execute on it.” Everything downstream of that is becoming cheap.