February 15, 2025

Structured data is the foundation of trustworthy AI

There's a lot of noise about AI coding agents right now. Most of it focuses on the wrong thing: how fast they generate code. Speed is easy. What's hard is getting the code to be correct — meaning it does what you actually wanted, fits into the system you're building, and doesn't silently break something three layers away.

The reason AI agents produce unreliable output isn't intelligence. It's context.

When you give an agent a vague prompt, you get vague code. When you give it structured, traceable context — requirements with acceptance criteria, architecture specs with exact schemas and API contracts, work items that reference those specs — you get code that matches what you specified. The agent didn't get smarter. The input got better.

The pipeline

I've been calling this the Product Development Lifecycle (PDLC): Requirements, Architecture, Planning, Implementation, Testing, Feedback. Each phase produces structured artifacts that the next phase consumes.

The chain is the thing. Any break — a vague requirement, a missing architecture spec, a work item with no acceptance criteria — degrades everything downstream.

The discipline isn't "use AI." The discipline is completing each phase with real artifacts before moving to the next one. AI just happens to be very good at consuming the output when you do.

This isn't a new idea. Good engineering teams have always done this. The difference is that AI agents can consume structured artifacts at scale in a way that human developers never could. A developer skims a 40-page architecture doc. An agent reads every line and cross-references it with 32 other documents.

What makes the difference

I recently built a full-stack application using AI agents for the majority of the implementation — 75 work items across 7 phases. Three things separated useful output from garbage:

Specs with actual code examples, not descriptions. A spec that says "implement contact scoring" produces wildly different code than one that says "compute a 0-100 score using this weighted formula, store it in this schema with these exact field names, expose it via this endpoint returning this JSON shape." The second spec leaves almost no room for misinterpretation.

Work items that reference their source. Every work item linked to the architecture spec and requirement it implemented. When the agent picked up a task, it didn't guess. It read the spec, found the exact definitions, and implemented them. When I reviewed the output, I checked it against the same spec. The spec is the shared contract between human and agent.

Acceptance criteria that are testable. "The system should handle errors gracefully" produces vague implementations. "Login with wrong password returns 401 with error message, no token" produces tests that verify exactly that.

Where structured data isn't enough

Even with all this structure, things went wrong. The interesting part is where.

Code structure, API contracts, schema definitions, business logic — when the spec was specific, the agents got it right almost every time. But runtime infrastructure behavior was a different story. I had over 1,600 passing unit tests and multiple AI code reviewers approve the codebase. The app still had a critical bug where queue consumers were starving a shared database connection. Login took 29 seconds. Every authenticated endpoint hung.

No spec would have prevented this. The spec correctly said "use a queue." The implementation correctly used a queue. The problem was an emergent interaction pattern that only surfaces under real conditions.

The fix? Running the actual system. An AI agent booted the API against real infrastructure, hit every endpoint, measured response times, flagged the 29-second login, traced the root cause through the logs, and fixed it in minutes. This is the validation phase of the PDLC doing its job — e2e and integration testing catch an entire class of bugs that unit tests and code reviews are structurally blind to.

What I'd recommend

If you're using AI agents for anything beyond single-file edits:

Write real specs before generating code. Not paragraphs of prose — structured documents with schema definitions, API contracts, and testable acceptance criteria. This is the highest-leverage activity in the entire process.

Make specs queryable. Your agents need to read specs programmatically. Copy-pasting into prompts doesn't scale past a handful of files.

Use planning phases for batches of related work. When you have interdependent features touching the same schemas, routes, and services, a planning phase catches conflicts and ordering issues that sequential implementation misses.

Run reviews with multiple AI models. Different models have different blind spots. Running the same review with 2-3 models in parallel catches things any single model misses.

Test against real infrastructure after every batch of changes. Unit tests mock the hard parts. Integration tests against your actual services catch the bugs that specs, reviews, and thousands of passing unit tests can't.

The takeaway

The quality of AI-generated code is directly proportional to the quality of the structured context you give it. Vague prompts produce vague code. Structured specs with exact schemas, contracts, and testable criteria produce code that works.

This isn't about any specific tool or platform. It's about the discipline of writing down what you want before asking an agent to build it, and giving the agent a framework to operate within. The agents are good enough. The bottleneck is the structure around them.

← Back to Blog