Your agent shouldn't grade its own homework

The Forge

In a previous issue we put a Managed Agent on a schedule so the work runs while your laptop is asleep. There's an obvious next question: how do you know the output is any good when you weren't there to read it?

Outcomes is the answer, and it shipped for Managed Agents alongside the orchestration features. Here's the problem it solves. Agents are good at producing things that look done. Ask one for a cited research brief and you get a tidy document with footnotes. Look closer and a topic gets thin coverage, a quote drifts from its source, a citation leans on a press release instead of the actual filing. Catching that has always meant a manual review loop: you read the output, spot what's off, and prompt again. Most of what you say in those rounds is feedback you could have written down before the agent started.

That's the move. You define what "done" looks like as a rubric, and the platform provisions a separate grader in its own context window. The grader can't see the writer's reasoning and has no idea what shortcuts it took. After each writer turn the grader rereads the artifact against the rubric and either passes it or hands back a per-criterion list of gaps. The writer revises, the loop runs again, up to a cap you set.

The reason this beats a self-reflection prompt is separation. A writer that knows the criteria is still grading its own work. It will say it passed whenever it believes it did, and it won't go refetch a URL it already cited or notice the quote it remembers is slightly off from the quote on the page. The grader has no choice but to do those checks. It opens with a fresh context window and nothing but the rubric and the artifact.

The short version: when you can write down what good looks like, stop reviewing the output by hand and make the agent prove it hit the bar.

The Blueprint

Three pieces get you a grade-and-revise loop: an agent that does the work, a session, and a define_outcome event carrying your rubric. Every Managed Agents request needs the managed-agents-2026-04-01 beta header. The SDK sets it for you.

Step 1: create the writer. Give it the toolset it needs to actually do and verify the work. The grader gets spun up automatically with the same model and tools, so if the writer can fetch a page, so can the grader.

import anthropic

client = anthropic.Anthropic()
BETAS = ["managed-agents-2026-04-01"]

env = client.beta.environments.create(
    name="brief-env",
    config={"type": "anthropic_cloud", "networking": {"type": "unrestricted"}},
)

writer = client.beta.agents.create(
    name="Research Analyst",
    model="claude-opus-4-8",
    system=(
        "You write one-page competitor briefs. Cite every factual claim with an "
        "inline footnote [n]. End with a Sources section, one entry per line: "
        '[n] "verbatim quote, 25 words or fewer" - Title - URL. Only cite pages '
        "you actually fetched. Save the brief to /mnt/session/outputs/brief.md."
    ),
    tools=[{
        "type": "agent_toolset_20260401",
        "configs": [
            {"name": "web_search"}, {"name": "web_fetch"},
            {"name": "read"}, {"name": "write"},
        ],
    }],
    betas=BETAS,
)

Step 2: write the rubric. This is the centerpiece, and it's the only lever you have on the grader. The default failure mode is a grader that approves everything, so every criterion has to force the grader to produce evidence. A line like "covers pricing" lets the grader skim, see a paragraph about pricing, and pass it without opening a source. A line that says "states a specific dollar figure with a fetched citation" makes it earn the pass.

# Competitor brief rubric

You are reviewing a one-page brief at /mnt/session/outputs/brief.md.
The brief covers one competitor. This rubric defines what counts as
sufficient coverage and how to verify the citations.

## Coverage (each item names a specific bar)
1. Pricing: at least one concrete price or plan tier, in dollars.
2. Positioning: who they say they sell to, in their own words.
3. Recent move: a product, funding, or hiring change from the last 90 days.
4. Weakness: one cited criticism from a source that is NOT the company.
5. Source mix: at least one primary source (the company site, a filing,
   or a release), not all secondary coverage.

## Citation check (for every [n] in Sources)
a. LIVE: fetch the URL. LIVE only if it returns the readable page directly.
   DEAD if 404, parked, login-walled, paywalled, 403, or JS-only.
   Do NOT corroborate via mirrors or search snippets. The cited URL must fetch.
b. VERBATIM: search the page for the quoted string. QUOTE_MATCH if the exact
   string appears (curly vs straight quotes are equivalent), NOT_FOUND otherwise.
c. SUPPORTS: the quote actually backs the claim it's attached to, not tangential.

## Output format
Line 1: Coverage N/5. Citations M/K verified.
Then one bullet per failed item: name it and the specific bar it missed,
one sentence each. Example: "Item 3 Recent move - MISSING. No dated event
from the last 90 days."

Step 3: start a session and define the outcome. The agent begins work the moment it receives the event. No extra message needed.

session = client.beta.sessions.create(
    agent={"type": "agent", "id": writer.id, "version": writer.version},
    environment_id=env.id,
    title="Brief: competitor X",
    betas=BETAS,
)

client.beta.sessions.events.send(
    session.id,
    betas=BETAS,
    events=[{
        "type": "user.define_outcome",
        "description": "Write a one-page competitor brief on <competitor> at /mnt/session/outputs/brief.md.",
        "rubric": {"type": "text", "content": RUBRIC},
        "max_iterations": 5,   # default 3, max 20
    }],
)

That's the whole loop. The writer drafts, the grader rereads the file against your rubric and returns satisfied or needs_revision with per-criterion feedback, and the writer revises until it passes or hits the cap. You watch it on the event stream: span.outcome_evaluation_start when the grader picks up a pass, span.outcome_evaluation_end with the result and an explanation when it finishes one. When it lands on satisfied, the file at /mnt/session/outputs/ is the version that cleared your bar, not the first thing the model produced.

In Anthropic's own cookbook run of this pattern, the writer's first draft cited a third-party news article for a company's net loss. The rubric demanded the SEC filing itself. The grader bounced it, the writer found a sec.gov URL, the grader bounced it again because the URL was an 8-K press-release exhibit and not the 10-K the rubric asked for, and only on the third pass did the writer find the actual filing. The task alone would have waved both versions through. The rubric is what caught them.

The Anvil

Now the part the launch demos skip: where this bites, and how to stop the bleeding.

A lenient rubric is worse than no rubric. If your criteria can be satisfied by skimming, the grader skims, passes the first draft, and the loop never runs. You paid for a grader and got a rubber stamp. The fix is to make every criterion demand a concrete artifact: a dollar figure, a fetched page, a file:line reference, a traced formula. "The data looks good" cannot be evaluated. "The CSV has a price column with numeric values" can. Write criteria you could hand to a stranger and get the same verdict.

The description and the rubric have to agree. The description tells the writer what to make. The rubric tells the grader how to check it. If they contradict each other, say the description asks for inline output and the rubric grades a file at /mnt/session/outputs/, the loop returns failed instead of thrashing. Make them point at the same artifact, same location, same format, every time.

Raising max_iterations is not how you fix a loop that won't converge. The default is 3, the max is 20. If every pass hits the cap with the grader flagging the same kind of issue, the writer can't act on the feedback and you're paying for iterations that don't move. That's a rubric problem, not a budget problem. Read the grader's explanation, find the line it keeps failing, and make that line clearer or more reachable. A grader that's too strict costs you an extra loop. One that's too lenient ends the loop with the bad version still in place.

It's still a Managed Agent, with the same catch from last time. The grade-and-revise loop runs real sessions with real tool calls. The cookbook run took about 13 minutes across three passes. This is for work where quality matters more than the few minutes and tokens it costs, not for a one-line answer. And the stateful-by-design tradeoff still holds: Managed Agents are not eligible for Zero Data Retention or a HIPAA BAA right now, so if the brief touches a client's regulated data, plan the cleanup before you point an agent at it.

The rule of thumb: Outcomes fits when you can turn "good" into lines a stranger could check. If you can't write the rubric, the grader can't enforce one, and you're back to reading the output by hand.

Sparks

A few more things worth your attention this week:

The rubric doesn't have to live inline. Upload it once through the Files API (beta header files-api-2025-04-14) and pass rubric: {"type": "file", "file_id": ...} so it's reusable across sessions and reviewable like code. Check your rubrics into the repo and treat a rubric change like a code change.
Don't have a rubric yet? Hand Claude a known-good example of the deliverable and ask it to analyze what makes it good, then turn that analysis into criteria. That middle ground beats writing a rubric from a blank page.
One outcome runs at a time, but you can chain them. After a loop terminates the session is conversational again, and a new user.define_outcome starts the next one against the same history.
Multi-agent orchestration is the other half of this release: a lead agent that fans work out to specialist subagents with their own models, prompts, and tools. We said in Issue #12 we'd build one, and pairing it with a grader is the natural shape. We'll wire a planner to its specialists in a future issue.

The Smith's Take

For a long time, "done" was a judgment you made by reading the output. The agent produced something, you decided whether it was good enough, and the standard lived in your head. Outcomes moves that standard out of your head and into a rubric the agent has to satisfy before the session goes idle. "Done" stops being a vibe and becomes a spec.

The builders who get value from this aren't the ones writing longer system prompts hoping the model tries harder. They're the ones who took the review they were already doing by hand, the checklist they run in their head every time they read a draft, and wrote it down where a stateless grader can enforce it every single time. The standard doesn't drift, doesn't get tired on the tenth brief, and doesn't take your shortcuts.

Pick one deliverable you produce on a cadence, the weekly competitor scan, the data-quality report, the cited brief, and write the rubric you'd use to reject a junior's first draft. Run it as an outcome this week and read the grader's feedback on the first pass. You'll learn more about your own standard from watching it get enforced than from any prompt you could write.

Build agents that actually work.

Michael