Autonomous Iteration

Testing coding agents' ability to improve a feature automatically

June 10, 2026Ted SpareExperiment

Method
Examples
1. SaaS Landing Page
2. Performance Optimization
3. Product Onboarding Flow
A skill for all scenarios

A coding agent’s first draft is usually good, but it’s rarely the best it could be.

People close that gap by hand: look at the work, notice what’s off, fix the worst thing, look again.

We wanted to see if the agent could run that loop on its own, so we ran a series of experiments across UI, performance, and UX.

Method

Our core idea is quite simple:

a coding agent writes code,
another agent critiques the output,
the original agent addresses the feedback.

We repeat until the critic is happy.

Any coding agent will do, so long as it can use a computer, see screenshots, and deploy sub-agents.

We used Pi for its simplicity.

We’ll develop the prompt over the course of three examples, in order of increasing difficulty.

Examples

1. SaaS Landing Page

We start with a straightforward example: building a marketing page for a software product.

By default, coding agents tend to produce the familiar purple-gradient look.

We want pages to be distinct.

The meta-prompt we used is captured as a SKILL.md:

---
name: iterate-ui
description: Build a UI, screenshot it, critique the screenshot, and improve it.
---
 
# To iterate on UI
 
1. Build the feature.
2. Run the dev server and take a screenshot.
3. Look at the screenshot and list what looks bad.
4. Fix the worst issues.
5. Repeat until it looks good.

You’ll notice it’s intentionally simple and UI-specific, but it steers clear of defining what's good.

We will make it more general in the coming examples.

Here is the result of three rounds of iteration:

Iteration 1 (one-shot)

Iteration 2

Iteration 3

“build a mobile landing page for Cadence, a habit tracker for teams”

The one-shot result covers basic copy but lacks hierarchy or variety.

The third iteration is scannable, with a clear H1. It leads with a representation of the product itself. It adds social trust as ratings, user logos, and testimonials. Its CTA nudges the user forward with a low time commitment and the option to back out.

The site feels more polished, even without an explicit rubric.

We’ve successfully squeezed quality out of the agent.

In the next example, we get quantitative.

2. Performance Optimization

In this example, we have the coding agent build an interactive table.

We render it with 10 000 rows and measure how long it takes to load and filter.

The agent's initial solution is slow and costly.

To improve it, we add the following to our meta-prompt. Note that this introduces some definition of what we consider "good". We will generalize it in the final example.

...
 
# To iterate on performance
 
1. Define a metric (e.g. p50 filter latency in ms) and write a benchmark for it.
2. Write a test suite that pins behavior. Ensure it passes.
3. Each pass: form one hypothesis, change one thing, re-run tests + benchmark.
   Keep the change only if tests pass AND the metric improved. Otherwise revert.
4. Stop when you hit the target or gains flatten. Log before/after numbers each pass.

After two iterations, the model recalls common optimization techniques:

memoize computed data,
debounce the search input, and
virtualize the table (render only visible rows plus a small buffer).

Here, the improvement is drastic:

Iteration 1 (one-shot)data recomputed on each keystroke

mount

328 ms

filter

299 ms

Iteration 2memoized data + debounced filter

mount

70 ms

filter

42 ms

Iteration 3virtualized list (only visible rows rendered)

mount

25 ms

filter

3 ms

“build a React table that lists transactions with a search filter, sortable columns, and column sums"

With a benchmarking harness, the agent improved load speed by 13x and search performance by almost 100x.

You can try each iteration here.

In the next example, we task the agent with iterating on a full-stack, product-level feature.

3. Product Onboarding Flow

To improve an onboarding flow requires all of

a model of the underlying product,
the ability to click through the flow and take screenshots, and
the ability to judge (2) against (1).

The agent's initial solution is dense and high-friction.

Try stepping through each iteration below:

Iteration 1

Cadence

Create your account

First name

Last name

Work email

Password

Confirm password

Phone number

Company

Job title

How did you hear about us?

Iteration 2

Cadence

Step 1 of 4

Create your account

Full name

Work email

Password

Iteration 3

Cadence

Account1 of 4

Let's set up Cadence

Just your email to start — takes about a minute.

Work email

“scaffold an onboarding flow for Cadence - create an account, set up a team, and pick some habits"

The agent added optimistic validation, sensible defaults, and made optional steps skippable. It also removed fields that can be completed later.

In general, these changes reduce friction.

Friction was probably the correct metric to focus on.

A skill for all scenarios

We want the agent to be able to improve any feature without an explicit rubric, which means identifying which metrics to push forward.

To do so, we generalize the meta-prompt to whatever the agent thinks is highest-impact.

Given a static page, polish the UI.
Given a pre-built component, make it fast.
Given a marketing feature, remove friction.

Below is the combined skill:

---
name: improve-feature
description: Improve a feature across iterations. After building the feature, define how to judge it, measure the current state, fix the highest-impact problem, and re-measure until it meets the bar or gains flatten.
when_to_use: After producing any artifact (a UI, a component, an API, a flow) where the first attempt is unlikely to be the best one.
---
 
# Iterate
 
The first version is a draft. Your job is the loop that follows.
 
## 1. Define the bar
Before improving anything, write down how this artifact will be judged:
- Visual → a 1–5 rubric (hierarchy, contrast, type scale, spacing, CTA clarity,
  trust signals, "does it look generic?"). Always screenshot at a fixed viewport.
- Performance → one measurable metric (e.g. p50 latency in ms) plus a benchmark
  that prints it, and a test suite that pins correct behavior.
- Flow / interactive → a concrete task to complete; success = the artifact can be
  used end-to-end without confusion. Drive it with browser/computer use, not a glance.
 
## 2. Measure the current state
Run it and capture evidence (screenshot, benchmark number, walkthrough recording).
Write down the top problems, ranked by impact.
 
## 3. Change one thing
Fix only the highest-impact problem this pass. Don't refactor opportunistically.
 
## 4. Re-measure and gate
Re-capture the same evidence at the same settings.
- Keep the change only if it improved the bar AND broke no guardrail (tests stay green).
- Otherwise revert and try the next hypothesis.
 
## 5. Stop deliberately
Stop when you meet the bar, run out of budget (cap the passes), or gains flatten.
Keep a short changelog: what was wrong, what you changed, before → after.

We are confident this skill can be improved - maybe even automatically.

If this sparked an idea for your roadmap, let's talk.

Rubric is an applied AI lab helping teams build and ship intelligent products.

Keep reading

Intelligent applications need new interaction patterns

Challenging the Chatbot

July 7, 2026

Analysis

The promise and peril of collocating code and coding agent

Every Server Deserves a Coding Agent

May 1, 2026

Experiment