Why the highest-leverage human work is no longer creation. It’s defining what good looks like.

‘Is this design beautiful?’ is hard [for Claude] to answer, but ‘does this follow our principles for good design?’ gives it something concrete to grade against.”

This quote is from one of the most important blog articles released in the past few months, largely unheralded and ignored by the x.com AI influencer-sphere. 

It’s written by an Anthropic engineer, Prithvi Rajasekaran, describing how he stopped his autonomous coding agent from producing mediocre work. He’d been watching Claude build full-stack applications for hours at a time. Technically correct and functional… but completely generic.  When asked to evaluate its own designs, Claude would catalog a list of legitimate problems and then give itself a passing grade anyway.

The fix wasn’t any changes to the prompt, planning approach, or model. It was a better definition of “good.”

The engineer broke “beautiful” into four criteria (design quality, originality, craft, functionality) and built a separate evaluator agent calibrated with examples of what each criterion looked like when done well. The system looped: build, evaluate, revise, repeat. And then he walked away.

Hours later, Claude was producing full-stack applications autonomously. It caught its own bugs, iterated on design, and even scraped entire approaches when the evaluator agent pushed back hard enough. No human in the loop.

 This article has been on my mind for the past 2 months. Not just because I build agents for clients, but because I live in the consulting world of deliverables: strategy decks, governance frameworks, roadmaps, etc… success of which is almost all dependent on the art and soft skills of judgment, expertise, trust building, and the ability to read a room.

The existential question I — and I have to imagine others – circle back to is this: what can AI actually replicate here? Is the consultant truly at risk? Claude can do some tasks (like building a deck or drafting an email), but what about the parts that necessitates taste, context, or political awareness?

Strikingly, the answer I’ve come to is Yes, AI can do these things. Taking advantage will demand something most organizations have yet to do: “define beautiful”. Transparent is moving in that direction, decomposing our own work, and advising our clients on how they can do the same. 

The Definition

At a developer conference last May, Anthropic’s Hannah Moran offered what Simon Willison called the first useful definition of “agent” he’d heard after years of buzzword soup: agents are models using tools in a loop.

That’s it. Not a swarm. Not a framework. A model, some tools, and a loop.

Deceptively simple. The model and the tools improve on their own every few months. The interesting part, to me, is the loop. How does the loop know when to stop? How does it distinguish good output from bad? What does it optimize against?

To answer those questions, one needs to determine everything downstream: enabling you to walk away from the machine while it runs, or whether you’re chained to it, prompting the next step, forever. And the output gets more jagged the further you move to solving problems that don’t have clean numerical outputs.

The challenge, then, is to give the loop a compass that works even when the metrics are messy.

The Ratchet

Andrej Karpathy, one of the most respected and discussed AI researchers, built a thing called autoresearch. The setup is almost comically simple: 630 lines of Python, one GPU, a markdown file describing research directions, and a single metric (validation loss, measured in bits per byte).

The agent proposes a change to the training code, trains the model for five minutes, checks whether the metric improved. If yes, it keeps the change. If no, it reverts. Like a ratchet wrench: progress clicks forward and locks, bad moves get discarded. Then it does it again.

Karpathy ran seven hundred experiments over two days. No human in the loop. No prompting between runs. Karpathy had already hand-tuned this model across two decades of experience, and the agent still found optimizations he’d missed. Settings and interactions he’d walked past, made visible only through systematic exploration at a scale no human could match.

On the No Priors podcast, Karpathy described what might be the most important mental model for anyone thinking about AI and work right now: the goal is to take yourself outside the system entirely. Not to prompt the next thing and the next, but rather to arrange things such that they’re completely autonomous. Maximize the machine’s throughput by getting yourself out of the way.

Why did it (auto research) work? Because the evaluation function was unambiguous. Loss goes down or it doesn’t. The agent didn’t need taste or context or political awareness. It needs a number and a direction.

Karpathy was honest about the limitation: “This is extremely well suited to anything that has objective metrics that are easy to evaluate. If you can’t evaluate it, then you can’t auto research it.”

If you can’t evaluate it, you can’t automate it. Hold that thought.

The Persistent Loop

A few months ago, a technique called Ralph Wiggum (yes, named after the Simpsons character) went viral in the AI coding community. Developers were using it to steer Claude on long-running tasks, building real working applications overnight. The technique, originally from Geoffrey Huntley, is a bash loop:

while :; do cat PROMPT.md | claude-code ; done

You give the agent a task and a “completion promise,” a specific condition that signals done. The agent works, tries to exit, gets sent back in, and loops until the promise is fulfilled or you hit a maximum number of iterations.

OpenAI’s Codex /goal command is having a similar moment. Developers are setting up GPT 5.5 with a completion “goal”, and the system will run for hours to accomplish it. 

The mechanic here is simple.

Graphic showing the shift from directing AI step-by-step to designing systems that guide AI toward outcomes.

You’re not doing the work. You’re defining what done looks like, and critically, what good looks like. As Huntley puts it, Ralph is “deterministically bad in an undeterministic world.” You don’t fix Ralph by giving better instructions mid-run. You fix Ralph by tuning the evaluation criteria before the run starts.

For software, the completion promise is often “all tests pass.” But someone had to write those tests. The quality of the autonomous run is entirely determined by the quality of the evaluation criteria defined upfront. Garbage tests, garbage output. Thoughtful tests, thoughtful output. An entire ecosystem of planning and  “spec-driven development” frameworks (eg. BMAD, GSD, SpecKit, Superpowers) have sprung up trying to solve precisely this: how do you break features into testable chunks, give an agent enough structured definition of “done”, and let it run for hours without a human checking in?

Keif Morris, writing on martinfowler.com, gets the framing right. The move isn’t from humans doing the work to machines doing the work. It’s from humans in the loop to humans on the loop. The distinction is critical. In the loop means you’re a bottleneck, reviewing every line, every prompt, every permission. On the loop means you’ve defined the criteria well enough to step back, building and managing the system rather than operating inside it.

The pattern is consistent. In ML, the human writes a markdown file and a metric. In software, the human writes specs and tests. The agent does the work. The human defines what “good” means. And the quality of that definition determines everything.

But in all cases thus far, “good” is still binary. Loss goes down or it doesn’t. Tests pass or they fail. What happens when “done” isn’t a number and “good” isn’t pass/fail?

Defining Beautiful

This is where it gets fun. And this is where the human role inverts.

Remember the four criteria from : design quality, originality, craft, functionality. The decomposition alone wasn’t enough. The evaluator agent had to be honest, and early versions weren’t. They’d identify legitimate issues, then approve the work anyway. The fix was to read the evaluator’s logs, find places where its judgment diverged from a human’s, and update the prompts. Multiple rounds, and real human work to get it right. 

Once calibrated, the result was a system that produced full-stack applications from a one-sentence prompt over multi-hour autonomous sessions. And then something unexpected happened. Prompted to create a website for a Dutch art museum, the ninth iteration the generator had produced a clean, polished landing page. Expectations met. On the tenth cycle, however, it scrapped the original approach entirely. Claude reimagined the website as a spatial experience: a 3D room with a checkered floor rendered in CSS perspective, artwork hung on the walls, doorway-based navigation between gallery rooms. A creative leap.

And it was only possible because the evaluator had been pushing the generator away from safe defaults for nine rounds. The evaluation criteria not only caught bugs, but was designed to create pressure toward originality. And once the generator exhausted its conventional ideas under that pressure, it took a risk.

Here’s the thing I’m sitting with, and I hope I’ve brought you along thus far: The human labor, the irreplaceable contribution, wasn’t directed at the creation itself. It was directed at defining what and how the work was being evaluated. The human didn’t design the website. The human designed the criteria that made the website possible.

Beauty, which we’ve always treated as the untouchable, divine domain of human judgment, became four graded criteria. The subjective became evaluable. Not perfectly, but enough for the system to self-correct and produce meaningfully better output. 

The Frontier

If you can’t evaluate it, you can’t automate it.

Most knowledge work lives in the zone where there’s no compiler, no loss function, no test suite. The instinct is to say that’s exactly why you need humans doing this work. It feels categorically different. Too messy. Too contextual. Too human.

But that’s exactly what the Anthropic team said about frontend design before they broke “beautiful” into four criteria and built a grading system. Of course consulting operates at a higher level of abstraction than frontend design… More stakeholders, more contradictions, and very real relational consequences for getting it wrong. The decomposition is definitely harder. But harder doesn’t mean categorically different.

The real question isn’t “can this be evaluated?” It’s “have you done the work to make it so?” And this is exactly the approach that we’re taking at Transparent.

Getting Out

Ready or not, the agents are here. The loops work. The tools for keeping an AI running for hours on a task without human intervention exist and have been improving by the month. That is no longer the bottleneck.

The bottleneck is evaluation. Whether your organization has done the hard, unglamorous work of defining what “good” looks like, with enough specificity that a system can measure it, learn from it, and improve.

Marketing organizations have never had to. When humans do the work, the person doing the work implicitly has taste. They know the client. They read the room. They make a thousand small judgment calls that never get written down because they never needed to be. But you can’t hand “I’ll know it when I see it” to an agent. You have to write it down.

The highest-leverage human work in an organization is no longer the creation of the thing. It’s the decomposition of judgments into criteria that can steer AI. Defining what good looks like. Building the rubrics. Creating the golden records. Engineering the evaluation system that makes autonomous creation possible. 

Our tangible suggestion: Stop waiting for a ‘smarter’ model or the perfect AI tool, and start building a smarter evaluation system. Your mandate is no longer to manage the output, but to engineer the criteria that define it.

Oliver Amidei, SVP of AI Solutions