How to really stop your agents from making the same mistakes (7 minute read)

AI agentsinfrastructuredevops Read original

Y Combinator CEO Garry Tan presents "skillification," a systematic 10-step workflow that converts AI agent failures into permanent, tested skills instead of relying on prompt engineering and hoping for the best.

What: Skillification is a development practice where each AI agent failure becomes a durable "skill" consisting of a markdown procedure, deterministic scripts for precise tasks, comprehensive tests (unit, integration, LLM evals), and routing rules that structurally prevent the same mistake from recurring. Tan built this into GBrain, an open-source knowledge engine that enforces quality gates.

Why it matters: Most AI agent frameworks (including LangChain with its $160 million in funding) provide testing primitives but no opinionated workflow for when to test, what to test, or how to make fixes permanent, leaving developers to reinvent reliability practices from scattered tools while their agents repeat the same mistakes across conversations.

Takeaway: Developers can adopt the 10-step skillify checklist (SKILL.md, deterministic code, unit/integration tests, LLM evals, resolver triggers/evals, DRY audit, smoke tests, filing rules) or install GBrain from github.com/garrytan/gbrain to enforce the workflow automatically.

Deep dive

Tan demonstrates the problem with two real failures: his agent spent 5 minutes calling blocked calendar APIs and searching email instead of grepping local calendar files that already contained the answer, and another where the agent did UTC-to-Pacific timezone math in its head and was off by exactly one hour
The core insight is distinguishing latent work (requires judgment, belongs in LLM reasoning) from deterministic work (same input/output every time, should be handled by scripts) - letting agents waste compute on deterministic tasks in their reasoning space is the architectural bug
Each skill is a markdown contract that teaches the model how to approach a task, not what to do - the agent reads the skill, understands that certain work is deterministic, and generates a script to handle it (e.g., calendar-recall.mjs runs grep in under 100ms instead of multi-minute LLM reasoning)
The 10-step checklist ensures every skill is production-ready: SKILL.md contract, deterministic scripts, unit tests (179 tests across 5 suites run in 2 seconds), integration tests against real data, LLM-as-judge evals (35 run daily for context-now), resolver trigger entries, resolver routing evals, check-resolvable + DRY audit to find unreachable/duplicate skills, E2E smoke tests, and brain filing rules
Skillify becomes a verb in daily workflow - Tan prototypes something in conversation, sees it work, says "skillify it," and the agent automatically executes all 10 steps to make the solution permanent infrastructure without filing tickets or writing specs
First run of check-resolvable found 6 out of 40+ skills were "dark" (unreachable capabilities with no routing path) including a flight tracker nobody could invoke and a citation fixer not listed in the resolver at all - 15% of the system's capabilities were invisible
LLM evals catch process failures, not just wrong answers - one eval feeds the agent "my flight leaves in 45 minutes, will I make it to SFO?" and fails if the agent does mental math instead of running the context-now.mjs script, because even correct mental math will be wrong next time
The DRY audit prevents skill proliferation - Tan built a matrix showing calendar-recall (historical events), calendar-check (future planning), google-calendar (live API), and context-now (immediate context) each have distinct non-overlapping domains to prevent ambiguous routing
GBrain is the open-source knowledge engine that enforces this - gbrain doctor checks the full checklist and gbrain doctor --fix auto-repairs DRY violations and duplicates, with git working-tree checks to prevent data loss
Tan critiques Hermes Agent's skill_manage tool for letting agents autonomously create skills but never testing them, leading to rot: duplicate skills with ambiguous routing, skills that silently break when APIs change shape, and orphan skills with weak triggers that never match
The thesis: in healthy software engineering every bug gets a test that lives forever making recurrence structurally impossible - AI agents should work the same way, with every failure becoming a permanent skill with evals that run daily
GBrain SkillPacks are portable bundles of skills, resolver triggers, scripts, and tests that can be installed into any agent setup, making capabilities built for one agent reusable across others
Tan positions this as the workflow piece that $160 million in LangChain funding missed - not the testing primitives or eval tooling, but the moment where a human says "that worked, now make it permanent" and the system knows exactly what permanent means

Decoder

Skillification: The practice of converting AI agent failures into permanent "skills" - markdown procedures paired with deterministic scripts and comprehensive tests that prevent recurring mistakes
Latent vs deterministic work: Latent work requires LLM judgment and reasoning; deterministic work has the same input/output every time and should be handled by precise scripts instead of model inference
Thin harness/fat skills: An architecture pattern where the agent runtime (harness) is minimal and most capability lives in well-tested, documented skills that the agent invokes
Resolver: A routing table that maps task types to skills, determining which skill fires when the agent encounters a particular intent or phrase
LLM-as-judge: Using one language model to evaluate another model's outputs against a rubric, useful when correctness requires judgment rather than exact matching
GBrain: Tan's open-source knowledge engine that manages a "brain repo" of markdown files, runs evals, and enforces the 10-step quality gates that make skills durable
Dark skills: Capabilities that exist in the codebase with working scripts but have no routing path from the resolver, making them invisible and unreachable to the agent
DRY audit: "Don't Repeat Yourself" check that identifies duplicate or overlapping skills that would cause ambiguous routing or redundant capabilities

Original article

Relying on prompts to correct recurring AI agent mistakes is an unreliable, "vibes-based" approach that decays as soon as conversations become complex. To solve this, Y Combinator CEO Garry Tan advocates for "skillification." Instead of letting an agent waste compute attempting to solve deterministic tasks (like historical calendar lookups) in its latent space, this framework forces the AI to execute precise local scripts.