Lessons on Building MCP Servers (5 minute read)
A practical guide to designing MCP servers that guide AI models through multi-step workflows by embedding breadcrumbs rather than expecting models to plan ahead.
Deep dive
- Models don't have hidden planners—they scan available tools and pick whatever seems most probable based on conversation context, so servers must make the next call blindingly obvious at every step
- The author's Office server exposes 100+ tools but funnels models toward 8 core verbs through instructions, treating specialized tools as fallback/diagnostic options to prevent five-call detours for one-call jobs
- Consistent naming exploits probability: all Word tools are
word_*, Excel toolsexcel_*, unified toolsoffice_*—models that just calledoffice_inspectwill naturally reach foroffice_patchnext because the prefix matches - Every tool response should include a breadcrumb dictionary with
next_toolsandusagehints showing exact call syntax—smaller models will copy these verbatim because it's the most likely token sequence - Discovery should be a callable tool like
office_help(goal=...)that returns structured recommendations with rationale and next steps, not prose documentation—called with no arguments it returns the catalogue, with unknown input it returns the supported set instead of erroring - Use stable addressing like anchors, IDs, or structured paths instead of byte offsets or natural language descriptions that models lose between calls—if you return data the model has to describe back in natural language, your chain will misfire
- Collapse similar tools into mode parameters (
dry_run,best_effort,safe,strict) rather than separate tools—discovery cost scales with tool count not mode count, and models figure out escalation chains like dry_run → safe → strict on their own - Return standardized diagnostic envelopes with named fields like
matched_targetsandunmatched_targetsthat create branching points and recovery loops without forcing the model to re-read entire context - Always provide read-only introspection tools so confused models can "look again" without destructive consequences—the penalty becomes one extra round-trip instead of breaking files
- The design checklist includes: pick 5-10 core verbs and name them in instructions, use consistent prefixes, embed forward breadcrumbs in responses, provide stable addresses, give mutation tools mode enums, cache recovery loop calls, make repeat calls safe, and reject unknown arguments strictly
Decoder
- MCP (Model Context Protocol): A protocol for exposing tools and functions that AI models can call to interact with external systems and data sources
- Activation sets: The subset of available tools that are surfaced to the model at any given time, keeping the visible tool list small while maintaining access to a larger set
- Breadcrumbs: Structured hints embedded in tool responses that guide the model toward the next appropriate tool call in a workflow chain
Original article
Lessons on Building MCP Servers
I've been building MCP servers for a while now–I wrote about the general approach last year, started out by creating umcp, and I've recently opened up an Office server that's been battered by enough models against enough real documents that the patterns have settled.
I'm still not a fan of MCP, but what follows is what I've learned about making tool chains actually work, condensed from swearing at logs rather than reading papers.
Disclaimer: This is a condensed version of
CHAINING.md, which was itself stapled together from a bunch of notes in my Obsidian vault. The full version has more code examples and a techniques inventory table that Opus just _had to add, and I've since beaten that out of it and restored most of the original text (minus typos).
The short version: the MCP servers I design do most of the work, while the model walks breadcrumbs.
Models don't plan
They look at the conversation, scan the tool list, and grab whatever looks more probable. That's it. There is no hidden planner. If you want chains that finish somewhere sensible, the server has to make the next call blindingly obvious at every step.
After a year or so, I have pared down my approach into these three things, roughly in order of how much pain they save you:
- A small named core verb set covering most intents
- Output that suggests the next call
- An addressing scheme that survives between calls–anchors, IDs, paths, anything but line numbers.
Core verbs beat surface area
The Office server exposes over 100 tools. Its get_instructions() funnels models toward eight:
…start with
office_help, then preferoffice_read,office_inspect,office_patch,office_table,office_template,office_audit, andword_insert_at_anchor. Treat specialised tools as fallback, diagnostic, legacy-compatibility, or expert tools when the core flow is insufficient.
That single sentence does an outsized amount of work–it tells the model there is a recommended path, that the path is verb-shaped (help -> read -> inspect -> patch -> audit), and that everything else is opt-in.
Without it, models cheerfully reach for word_parse_sow_template when office_read would do, and you end up with five-call detours for one-call jobs.
So I quickly realized that I needed to be ruthless about which tools to surface and when. The specialised ones still ship–hidden under a "for experts" framing, and a handful of legacy ones filtered out of tools/list entirely.
I also make liberal use of activation sets–the surface the model sees is small; the surface it can reach is large.
Naming is the chain
Again, models chain whatever is most likely (or rhymes), and the most effective tactic, for me, has been taking advantage of that.
All Word tools are word_*, all Excel excel_*, all unified office_*. A model that just called office_inspect will reach for office_patch next, not word_patch_with_track_changes, because the prefix matches.
This particular server also makes liberal use of annotations and a little intent/inferrer hack that reads those prefixes to assign readOnlyHint/destructiveHint automatically, so naming discipline turns into safety metadata for free.
The prefix is the plan. The verb is the step. If you take one thing from this entire post, I'd suggest this notion…
Every response nominates the next call
This was the single change that made things behave on smaller models. The big ones will plan a chain from a tool list and a goal; the wee ones won't–they grab the first plausible tool and stop.
The fix is stupid simple: every response ends with a breadcrumb dictionary of hints to follow. At minimum next_tools: [...], plus usage: "<exact call>" whenever the current tool produced a value the next one needs.
A model that can't assemble arguments from a schema can copy the usage string verbatim. In fact, they will copy it, because it is still the most likely outcome as it fills in tokens, and thus those usage hints funnel the path the model takes.
Discovery as a tool, not documentation
Another thing I hit upon was that signposting needed to be curated.
Borrowing a page from intent mapping, office_help(goal=...) returns a structured record–recommended chain with rationale, fallbacks, diagnostic strings to watch for, one imperative next_step sentence. Not prose. Not a README, not skills. Data the model can act on without reading comprehension.
Called with no arguments, it returns the catalogue. Called with an unknown goal, it returns the supported set rather than an error, which turns a potential workflow-stopping error into an actual useful catalogue.
Addressing: anchors, not offsets
The biggest reason simple models can't follow chains is the model losing the thread between calls. "Insert a paragraph after the introduction" is fine in English but catastrophic if you expect it to remember a byte offset across three tool calls.
In this particular scenario, I cheated and since most Office documents have headings (or cells, or internal structured paths inside OOXML), I used either verbatim text from the document or immovable coordinates (which was particularly hard in PowerPoint, by the way).
So besides suggestions and hints, return identifiers your tools will later accept as input. If you find yourself returning data the model has to describe back to you in natural language, you've made a chain that will misfire on a Tuesday afternoon when you're not watching.
Modes turn one tool into four
I started out with individual editing tools per format, which was very easy to do automated tests for but incredibly wasteful of context, so at one point I decided to make things much simpler for initial discovery, and since I needed to make all outputs auditable, I then tagged available sub-operations risk-wise.
office_patch is the same code path whether you ask for dry_run, best_effort, safe, or strict. One tool, four modes, one entry in tools/list.
Discovery cost scales with tool count, not mode count. And dry_run -> safe -> strict is an escalation chain the model figures out on its own without being told.
If you have N tools that differ only in how cautious they are, collapse them. You're wasting everyone's context budget.
Diagnostics as the back-edge
Linear chains are easy. Real chains have loops, and loops only happen when the server invites the model back in. Every mutating tool returns a standard envelope with status, matched_targets, unmatched_targets, and next_tools.
The model then branches on a small subset of options "locally" without needing to go over the entire context, and if you name the diagnostic fields with exact strings the model will see again in your instructions, it will just reinforce them.
In this particular case, again, I cheated. I figured out that the models were starting to call tools at random because they couldn't introspect the document well enough and ended up breaking files, so I always gave them at least one read-only tool, so the penalty for "I'm confused, let me look again" is one extra round-trip, not a destructive cock-up.
My MCP Design Checklist
- Pick five to ten core verbs and name them in
get_instructions()or your local equivalent - Use consistent prefixes by surface
- Provide a discovery tool that returns recommendations as data, not prose
- Make the discovery tool browseable–no-arg returns the catalogue, unknown input returns the supported set
- Embed forward breadcrumbs in every tool response
- Provide a map/anchors tool so addresses survive between calls
- Give every mutating tool a mode enum including
dry_run - Return named diagnostic fields and cite the recovery tools
- Standardise the mutation envelope. If one tool changes something in a specific way, make sure the others are consistent (arguments, semantics, etc.)
- Reject unknown arguments strictly (this is much easier in some runtimes than others)
- Provide an audit tool so the model has somewhere to land
- Cache anything the recovery loop calls more than once, because, well, it will get called dozens of times even if you carefully curate paths through your tooling with hints.
- Make repeat calls safe–models retry, and they should be allowed to (idempotence is hard, and often impossible).
Do the boring work in the schema and the descriptions. The model will happily do the clever bit if you stop making it guess.