Skip to content

Agent Checks

Agent checks review an agent configuration and surface advisory findings while you build: structural problems (duplicated instructions, conflicting style guidance), completeness gaps (tool references that do not exist), and cost warnings (oversized prompts).

Checks are advisory only. Findings never block saving, publishing, or version creation.

  • Agent editor → Preview tab: a Checks card lists findings for the current draft, updating as you edit.
  • API: POST /v1/agents/preview returns a findings array alongside the resolved system prompt and tools.
  • MCP / platform commands: the preview_agent and analyze_agent commands return the same findings, so agents and automations can review configurations programmatically.

Each finding includes:

FieldDescription
rule_idStable rule identifier, e.g. prompt.duplicate_paragraphs
severitywarning, info, or suggestion — there is no error; checks never block
categorystructure, completeness, effectiveness, safety, or cost
messageHuman-readable explanation
locationThe config field (and byte span, when applicable) the finding points at

The Analyze button on the Checks card runs a deeper on-demand review using the platform’s internal utility LLM (requires UTILITY_OPENAI_API_KEY on the deployment). Three scoped checkers run in parallel:

RuleWhat it catches
llm.contradictionInstructions that cannot both be followed, including conflicts between the prompt and capability contributions
llm.structureRedundancy, verbosity, vague instructions, and structure that buries critical rules
llm.tool_guidancePrompt guidance that misdescribes available tools or assumes functionality no tool provides

LLM findings can carry a suggested replacement for the offending text; when the finding is anchored to a span of your prompt, an Apply fix button replaces it in place. Analysis is available via POST /v1/agents/analyze, which returns built-in and LLM findings merged.

The reviewed prompt is treated strictly as data: checker outputs are bounded, severities are clamped, and findings are advisory text only.

A health check runs the agent for real. It generates a handful of smoke-test cases from the agent’s description, system prompt, and capabilities, runs each as an actual session against the agent’s configured model, and scores the result two ways:

  • Deterministic: the agent produced a non-empty answer and finished within a turn budget.
  • AI judge: the platform’s utility LLM grades the agent’s final response against a rubric generated for that case.

A case passes only when both checks pass. The run surfaces a score card (pass rate, passed count, average score, average turns) and a per-case list — each case links to the real session so you can inspect the full conversation, tool calls, and events.

Health checks are asynchronous (they run several real sessions and take a minute or two). Trigger one and poll for the result:

MethodPathDescription
POST/v1/agents/{agent_id}/health-checksStart a run; returns a pending run with an id
GET/v1/agents/{agent_id}/health-checks/{run_id}Poll the run; status goes pending → running → completed/failed
GET/v1/agents/{agent_id}/health-checksList recent runs for the agent

Runs are stored per agent and keyed by the resolved config hash. Health checks require the utility LLM (UTILITY_OPENAI_API_KEY) to generate and judge cases, and the agent’s own model must be usable. They are advisory: a low score never blocks anything.

Checks run against the resolved configuration — after harness and capability contributions are merged — so they can catch issues that span layers.

RuleSeverityWhat it catches
prompt.emptyinfoAgent has no system prompt of its own
prompt.very_longwarningAuthored prompt over 32 KiB, sent on every model turn
prompt.resolved_very_longinfoFull prompt over 96 KiB after harness/capability contributions
prompt.template_variableswarning{{placeholder}} text that would reach the model literally
prompt.duplicate_paragraphswarningThe same paragraph appears more than once
prompt.restates_contributioninfoPrompt duplicates text already contributed by the harness or a capability
prompt.conflicting_styleinfoAsks for both brevity and detail without stating conditions
tools.unknown_referenceinfoPrompt references a tool that no enabled tool or capability provides
tools.duplicate_nameswarningTwo tools share a name, so the model cannot distinguish them

High-cardinality rules (prompt.duplicate_paragraphs, tools.unknown_reference, tools.duplicate_names) cap how many findings they emit. When the cap is exceeded they add a single companion info finding with the rule ID suffixed .summary (e.g. prompt.duplicate_paragraphs.summary) noting that only the first N were shown, so a large prompt cannot amplify into an unbounded response.

A later phase adds org-configurable rules: per-rule enable/severity settings for the built-ins plus custom declarative and natural-language-rubric rules.