Agents can't choose between structure and flexibility

Why maximizing in either direction is a failure mode

I think it's safe to say that when the LLM hype cycle started a few years ago, no one expected one of the great debates of our time would be between Python and Markdown as agent specification languages. But here we are, and this has quickly turned into one of the most consequential architectural questions in AI.

Before we dive into the consequences of this debate, we'll take a moment to define our terms.

The Python camp uses code to express strict requirements for the steps an agent should take to accomplish a task. The Markdown camp uses English to express broad goals and constraints and lets the agent plan its way to the outcome. The tradeoffs are fairly straightforward. Code creates strong guardrails and reduces the chance that the agent's plan goes off the rails. Markdown gives powerful models the freedom to explore, adapts flexibly across tools and models, but risks the agent doing something unexpected and undesirable.

Most of the debate treats this as a choice between two defensible positions. It isn't. Both maximalist positions are, in fact, failure modes, and the reason is the same: Neither one is actually agent-native. Agents, like humans, are increasingly being given complex tasks, and that requires the flexibility to choose the right tool for the right task (or subtask). Code-maximalism forces agents to follow deterministic workflows and strips out the reasoning that makes them useful. Markdown-maximalism abdicates control and produces systems you can't debug, correct, or improve. Picking a side is how you avoid the hard work of designing an agent.

We're publishing this as part of the Agent Native series because these two approaches increasingly define how agent interactions get built — and because both maximalist versions end up in the same place we wrote about last week: copy-pasting what a human would do, just in different syntax.

What code-maximalism gets wrong

The code-maximalist pitch is reliability. You tell the agent exactly what to do in specific cases, surface errors when things break, and get tightly scoped results. Given that LLMs make mistakes, misunderstand intent, and generally do all sorts of weird things, this sounds appealing in theory. Enforce correctness at the code layer. Don't trust the model to do the right thing.

We're intimately familiar with where this can go wrong in the AI SRE space. Almost every vendor in our space tells customers they have to write runbooks. The product then encodes those runbooks as workflows and has the agent execute them in response to specific alerts. The results are trustworthy in the narrow sense: the agent does roughly what you expected. It's also useless the moment an alert looks different from anything that's come before or the underlying architecture changes. We started down this misguided path ourselves in the early days and quickly learned that it would rarely work in practice.

This approach fails to be agent-native in three ways. First, it copy-pastes what a human does. A human picks one hypothesis — the most likely based on experience — and runs it down. That works when the human is confident, but when the initial hypothesis is wrong, it creates a lot of wasted work. An agent doesn't have to fall into that trap. It can evaluate multiple hypotheses in parallel, and some will be dead ends, but the chance it lands on the right answer goes up dramatically. That's the architecture we've built RunLLM around, and it's consistently how we see real incidents get resolved.

Second, the runbook approach gives humans no meaningful visibility. SREs don't need to confirm that the agent executed Step 3 of the runbook. They need to know what the agent tried, what it ruled out, and why. A well-worn path automates some tedious work, but it doesn't let the human trust or learn from the agent's reasoning.

Third, encoded workflows don't evolve – they lose the intelligence that agents promise. When the underlying system changes or requirements shift, every encoding has to change with it. There's no way for the agent to take feedback, understand that the expected behavior has changed, and adapt on its own without someone going back into the harness.

What Markdown-maximalism gets wrong

The Markdown-maximalist is optimized for flexibility. Describe the goal, hand it to a capable model, let it figure things out. This is portable, expressive, and gets you something working quickly. Where creativity or open-ended problem-solving matters, it can be dramatically more useful than a fixed workflow.

The degenerate version of this is AI slide generation. We don't know the exact architecture behind these tools, but from the outside they read as "let the LLM do everything" applications — one prompt in, a whole slide deck out. The failure mode is familiar to anyone who's used one. Something is off. The layout is weird on slide 7, the chart doesn't match the claim, the flow of the argument is scrambled. You want to say: "On slide 7, make the flow vertical instead of horizontal and move the chart to the bottom." You usually can't get this to work the way you expect. There's no discrete layout logic to adjust, no separable step for chart placement, no addressable unit smaller than the whole generation. You re-prompt, get a new deck that's wrong in a different way, and start over.

It would be easy to write this off as a strawman. Serious Markdown-maximalists aren't arguing for one-shotting every single application. The sophisticated version of the position is skills.md plus a basic agent loop — rich context, thoughtful instructions, and a capable model reasoning its way through. Guide the agent through context, the argument goes, rather than constraining it with fine-grained LLM calls.

Complex applications expose the gap. When you're grappling with reality, there are plenty of engineering decisions that still require strict constraints: Context management and summarization, model selection, cost management, and cross-agent coordination to name a few. In each one of these cases, the challenge is not trusting the model to reason intelligently. It is building the tooling and infrastructure that allows a thoughtful model to execute these tasks efficiently and reliably.

In production, this results in a code harness that manages context, routes between models, orchestrates sub-agents, and handles the predictable places where pure prompting breaks down. That ends up being a hybrid architecture with markdown doing the guidance work and code doing the structural work — which is exactly the position the debate was supposed to be between.

If you start with a Markdown-maximalist architecture, you're probably going to end up building plenty of narrow, harness-like capabilities – content management, model routing, etc. – to enforce constraints whether you like it or not. The question is just whether you design those hooks intentionally or let the code component grow organically. You should be intentional about the design.

The hybrid isn't a compromise

The teams building serious agents have, largely independently, landed in the same place: Markdown for intent and domain guidance, code for enforcement, tool execution, and anything that must not fail silently. Claude Code works this way. We built RunLLM this way.

It's tempting to read this as an unopinionated compromise. That's the wrong framing. The whole point of agents is that – unlike traditional software – they have an understanding of the problem to be solved and can use the right tools to get there. Code-maximalism compromises on the planning and Markdown-maximalism compromises on execution and learning.

The reason hybrid architectures are winning is because they're the only architectures that support what agents are actually supposed to do. An agent needs reasoning flexibility to handle situations it hasn't seen before, and it needs deterministic guardrails so humans can trust it and intervene when needed. Neither extreme gives you both, which means neither maximalist position gives you a truly flexible agent. It gives you either a workflow with aspirations or a wish with nothing to execute it.

The architectural work is figuring out, for each part of your system, which layer it belongs to. What needs to be expressed as intent and reasoned about? What needs to be enforced and checked? Where does the agent need creativity, and where does it need constraints? This is the hard part, and it's the part that picking a side lets you avoid.

What agent-native actually requires

When you stop treating Python vs. Markdown as the debate, the architectural priorities come into focus. Can your agent evaluate multiple hypotheses in parallel, or does it march down one? Can a human see what the agent tried and why, or do they just get a final answer? Can the agent adapt when the underlying system changes, or does someone need to go edit the harness? Can a user correct the output at the level of granularity they care about, or is it all-or-nothing?

The maximalist debates are a symptom of an industry still thinking about agents as workflow automators — either very rigid ones, or very loose ones. The teams building agent-native products are past that argument, because they've figured out that the argument was never really about Python or Markdown. It was about whether you were willing to do the work to build something that actually behaves like an agent.

Agents can't choose between structure and flexibility (8 minute read)