Building Agent Skills: Intent, Determinism, and Stability

The Core Idea

Agent Skills package workflows into reusable instructions your AI agent can follow. The question is: how much structure do you actually need?

The answer depends on three things: Intent (how clearly you know what you want), Determinism (how much mechanical work you offload to tools), and Stability (how well the skill holds up when conditions change). These map to layers you add incrementally.

The Mental Model

An Agent Skill has three layers:

Intent -> Markdown instructions that describe what should happen
Determinism -> Tools and scripts for steps that should be mechanical
Stability -> Tests and AI evaluations that catch drift

The shape is progressive. Every skill starts at Level 0. You add layers when you need them, not before.

  flowchart TD
  A([Start: I have or am developing<br/>an Agent Skill]) --> Q1{Only for you<br/>and you're happy manually<br/>reviewing outputs?}

  Q1 -- Yes --> L0["Level 0: Intent<br/>Markdown instructions only<br/>Clear inputs/outputs<br/>Examples + define 'good enough'"]

  Q1 -- No --> Q2{Need repeatable<br/>structure or mechanical<br/>consistency?}

  Q2 -- No --> L0
  Q2 -- Yes --> L1["Level 1: Determinism<br/>Move mechanical steps<br/>into tools/scripts<br/>Use structured output<br/>Log tool inputs/outputs"]

  L1 --> Q3{Will others use it<br/>or will you modify it often<br/>without manual re-checking?}

  Q3 -- No --> L1
  Q3 -- Yes --> L2["Level 2: Stability<br/>Unit tests for tools<br/>AI evals for behavior<br/>Golden cases + edge cases"]

  L2 --> Q4{Can it access sensitive<br/>data or take impactful<br/>actions or run unattended?}

  Q4 -- No --> L2
  Q4 -- Yes --> L3["Level 3: Safety/Scale<br/>Guardrails + least privilege<br/>Human approval for high impact<br/>Security-focused evals"]

Level 0: Intent (Markdown Only)

This is the default state. Write clear markdown instructions describing what the skill should do. Most agent CLIs have a built-in skill builder for this.

When this is enough:

You are the only user
You manually review every output
The task is well-defined and unlikely to drift
You stay in the loop for approvals

Keep instructions tight: define inputs and outputs explicitly, include examples of good results, and describe what “good enough” looks like. The skill-builder built into most agent CLIs may be all you need here.

Level 1: Determinism (Tools and Scripts)

The model should not reason about things a calculator, parser, or script could handle. Move mechanical steps into deterministic tools.

What to move:

Data validation and format checking
File parsing and transformation
Calculations and aggregations
API calls with fixed schemas
Any step that should produce the same output given the same input

Benefits:

Prevents error compounding: a script cannot hallucinate a calculation result
Reduces token spend: the model thinks less about routine steps
Produces consistent output across different model versions

If possible, log tool inputs and outputs. This makes debugging and observability much easier when something goes wrong.

Level 2: Stability (Tests and AI Evals)

Once you share a skill with others or modify it frequently, you need a way to know it still works without manually checking every run.

Two kinds of tests:

Unit tests for tools and scripts: Standard software testing. If your skill calls a tool that parses JSON, filter rows, or makes an API call, those should have unit tests. These are cheap, deterministic, and catch regressions immediately.

AI evals for behavior: These evaluate the model’s output against what you actually want. Start with a minimal set of golden cases (known good inputs with expected outputs) and a few edge cases. The eval checks that the skill produces the right kind of response, stays on topic, handles errors gracefully, and does not invent things.

An eval suite does not need to be comprehensive on day one. A few well-chosen test cases that cover the critical paths are enough to catch most regressions.

Level 3: Safety and Scale

If your skill can access sensitive data, take impactful actions, or run unattended, guardrails shift from nice-to-have to essential.

Least privilege: Grant the minimum permissions needed. Do not give a summarization skill database write access.
Human approval gates: Require confirmation for high-impact actions: sending emails, modifying production data, spending money
Security-focused evals: Test against prompt injection, unauthorized data access, and action boundary violations

Observability should be added as soon as you feel you are missing information. Monitor token spend, latency, tool selection, and failure rates. Without telemetry, you cannot answer the question “what happened in that run?”

Practical Takeaways

The levels are illustrative. Their purpose is to prevent paralysis, either from fear of breaking things or from too many choices at the start.

Build only what you need. If markdown is working, stop there. When a step stabilizes and you find yourself repeating the same manual check, move it into a tool. When you start sharing the skill or worrying about regressions, add evals.

After you build a few skills, recognizing patterns becomes easier. You may start thinking about intent, determinism, stability, and safety from the beginning. That does not mean implementing everything at once. It means being aware of more tradeoffs earlier.

The progression works for any agent CLI (Claude Code, Codex, OpenCode) or frameworks like Doug Trajano’s Agent Skills implementation for PydanticAI.

References

Alex Guglielmone Nemi, Building Agent Skills: Intent, Determinism, and Stability, alexhans.github.io
Agent Skills specification, agentskills.io
AI Evals primer, ai-evals.io
Error Compounding in GenAI Systems, alexhans.github.io
Agent Skills for PydanticAI (Doug Trajano), github.com/DougTrajano/pydantic-ai-skills

Crepi il lupo! 🐺