Building Context Management Systems for Enterprises

June 2026

This builds on the prologue, Everything Is Context, and opens the three connected parts that follow it. The next two are The Human Becomes the Goal Owner and A Digital Twin.

From Externalize-and-Distill to Govern-at-Scale

The prologue argued that context is the real operating surface of organizational life — and that when agents become colleagues, the old osmotic back-channel disappears. You must externalize what you used to let spread informally, and then distill it: raw capture is noise, not context. That argument stands on its own. This essay does not re-argue it.

What the prologue left open is the organizational question. Externalizing and distilling at an individual or team level is a practice. Governing that practice across a workforce of agents — so that context is tracked, versioned, auditable, and fresh — is an architecture problem. That is what this essay addresses.

The instinct is still to throw more context at the problem: larger context windows, vector databases indexing everything, memory modules bolted onto every interface. More context is not the answer when you lack context discipline. The question is not volume. It is knowing which information is relevant right now, who produced it, whether it is current, and whether it has been approved for use — and building the system that answers those questions reliably.

The Arc: From Prompts to Context to Harness

The AI engineering discipline has evolved through three distinct epochs, each addressing different bottlenecks. Understanding this evolution explains why context management systems are the necessary next layer — and why building them requires new architectural primitives, not just better prompts or smarter retrieval.

Prompt Engineering (2022–2023) focused on how to phrase requests. Chain-of-thought, few-shot examples, role prompts, output formatting. The assumption was that the bottleneck lived in word choice — phrase the question correctly, get the right answer. This worked for single-turn interactions. But production applications quickly hit limits: you cannot fit an entire codebase in a prompt, the model forgets everything between sessions, and single-shot execution has no iteration or recovery.

As Andrej Karpathy observed in June 2025: "People associate prompts with short task descriptions you'd give an LLM in your day-to-day use. When in every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information for the next step."

Context Engineering (2024–2025) shifted focus from the question to the information environment. What does the model see when it reasons? RAG, large context windows, persistent system prompts, memory modules, structured context formatting. Context engineering acknowledged that the model's answer depends on what it knows when answering. Better context assembly yielded better outputs.

But context engineering still assumed a single agent operating in a single session with single execution. It answered "what should the model know?" but not "who can modify what?" or "who approves changes?" or "how do we verify multi-agent outputs against original goals?"

The ceiling appeared when teams deployed multi-agent systems. Context pollution became a real problem — too much information caused models to lose track of what mattered. Stale context caused hallucinations about outdated information. Multiple agents writing without coordination created chaos. No audit trail existed for enterprise governance requirements.

Agent Harness (2025–2026) adds the orchestration layer that context engineering lacks. Anthropic's multi-agent research system uses "an orchestrator-worker pattern, where a lead agent coordinates the process while delegating to specialized subagents that operate in parallel." Cursor's self-driving codebases architecture introduced "a root planner [that] owns the entire scope of the user's instructions. It does no coding itself. Workers pick up tasks and are solely responsible for driving them to completion." These convergent patterns represent the field's recognition that coordination, boundaries, and governance are not optional extras — they are the foundation.

The harness epoch brings new primitives: multi-agent coordination, write boundaries controlling what each agent can modify, approval gates separating machine verification from human authorization, version control for audit trails, and isolation patterns preventing interference. Orchestration adds coordination, boundaries, approval gates, and auditability — none of which prompting or retrieval alone can provide.

The Token Economics Problem

Tokens are the new oil — not because they are expensive, but because demand explodes as the price falls. This is the Jevons paradox applied to compute: every efficiency gain makes previously uneconomical use cases viable, so consumption rises faster than unit cost falls. Models get cheaper; the tasks we ask them to do get longer, more numerous, and more context-heavy. The corpus of distilled context each task must consume grows as the organization builds it. The economic result is that aggregate token demand almost certainly outpaces infrastructure — though I hold that as a reasoned thesis, not a certainty, because compute is also scaling and the race is genuinely open.

What is not open is the firm-level implication, and that is where the durable payoff lives. Your token budget is finite and rivalrous regardless of what happens globally. Every token your agents spend on re-establishing context they should already have is a token your competitor can spend on actual work. The efficiency gap compounds. Agents, platforms, and models are advantages any organization can acquire. What cannot be bought is the experience you have lived, externalized, and distilled into usable context. That is your digitalized, distilled experience — and it is the only part of the AI stack that is genuinely irreproducible. Raw data is commodity and noise (the prologue named this: raw capture is not context). The distilled corpus is the rare, defensible asset. Distilled intelligence drives token efficiency; token efficiency is the moat.

Two failure modes undermine that asset before it forms:

Ungoverned generation destroys signal and trust. When agents generate at volume without governance, the context corpus fills with contradictions, outdated assertions, and unverified outputs. The downstream effect is hallucination at scale — not a model failure, an architectural one. Teams that experience a high-profile hallucination incident do not conclude that the model is unreliable. They conclude that the system is unreliable. Adoption stalls. The distilled corpus never forms.

Volume is not value. To hit deployment metrics, teams burn tokens. Error rates climb with consumption. Without context discipline, increased spend yields diminishing returns — and no observability on which calls are high-value versus wasted context.

The primitives in the rest of this essay exist to close exactly this gap: governed generation that builds the distilled corpus rather than eroding it.

Why Existing Tools Fall Short

Most enterprise AI tools remain anchored in the context engineering era. They optimize retrieval. They do not govern execution.

Two specific gaps:

No write governance. When multiple agents can modify the same artifact, you lose traceability. Debugging cascading errors requires reconstructing who wrote what and when — often impossible without explicit boundaries.

No verification hierarchy. Machine-level verification (did the code compile?) is not the same as business verification (should we ship this?). Without explicit gates separating mechanical checks from human approval, organizations either over-trust AI outputs or create review bottlenecks that negate the speed advantage.

Context management systems fill these gaps by treating context not as input to retrieval, but as a governed, versioned, auditable asset.

The Write-Boundary Principle

Picture a workspace where every agent can write anywhere it likes. A research agent drops its findings into a shared space; an implementation agent, needing a scratch file, creates one with a similar name beside it; the research agent's next update quietly overwrites the scratch file; the implementation agent's next read picks up the research content instead of its own. Nothing is broken. No agent has a bug. Each did exactly its job. But a deliverable now contains fragments of two unrelated pieces of work, and finding out why means reconstructing who touched which file, in what order, on the basis of what — an afternoon of forensic archaeology for an error that was never anyone's mistake.

The fix is not a smarter prompt or better retrieval. It is a boundary: give each agent a space only it can write to, and let work cross between agents only through an explicit handoff. Now every artifact has exactly one possible author. When something is wrong, you do not reconstruct a history — you read the address. The constraint sounds limiting and turns out to be liberating: it makes an entire class of failure structurally impossible.

Write boundaries buy clear authorship — every artifact has exactly one possible author, so attribution is instant. They buy isolated failures — an error in one agent's output stays there until explicitly promoted, and cannot cascade invisibly. They buy an audit trail by default — the location of an artifact documents who created it, at what stage. And they enable genuine collaboration through explicit handoff rather than shared mutable state: agents reference each other's work without modifying it, which is how the cleanest distributed systems have always worked.

The Two-Phase Approval Principle

Machine verification confirms structural correctness. A human confirms business intent. Keeping these separate is the second load-bearing principle.

In practice this means: when a verification agent approves an artifact — confirms it meets the plan criteria, passes structural checks, contains no obvious errors — that is not a shipping decision. It is a proposal. The artifact moves to a staging state, visible and reviewable, but not yet authoritative. Only an authorized person can promote it to current status.

VERIFY-OK is not PASS. Machine verification catches structural problems. Human promotion catches business problems. Neither alone is sufficient. Together they create appropriate governance without making humans verify every intermediate step — agents move fast, verification is automated, but the final gate requires human judgment on decisions that should not be fully automated.

For enterprises this matters because compliance requires a traceable chain of custody: machine verified at this timestamp, human promoted at this timestamp. The proposed version is archived. Every transition is logged. Rollback is straightforward — revert to the previous version, investigate the proposed change, promote again when satisfied. Accountability is clear without creating a review bottleneck on every artifact.

The Goal-First Verification Principle

Without explicit goals, a verification agent can only check "did we build the thing right?" — not "did we build the right thing?" The gap between those two questions is where most enterprise AI failures hide.

The fix is to capture goals as verifiable criteria before any work begins, and verify against those criteria at every stage. A workspace goal captures the high-level objective. Cycle value questions specify what this particular effort must answer. Plan criteria translate goals into checkpoints a machine can check. Artifact verification measures outputs against those explicit criteria, not against vague intent.

This creates traceability in both directions. Forward: research focuses on the goals, plans reference the goals, implementation serves the goals, verification confirms the goals — the goals are the through-line that keeps multi-agent work coherent. Backward: you can always ask "why was this artifact approved?" and get a concrete answer pointing to specific criteria, not a reconstruction of what someone meant at the time.

It also makes verification mechanical rather than interpretive. The upfront investment in goal clarity pays off throughout the workflow. The alternative — letting intent drift through natural language handoffs — is how you get a landing page that passes every structural check and still misses the business objective it was built to serve.

Why Pivots Are Cheap

The prologue made a specific claim about re-pointability: for an agent, the document is the behavior. There is no private mental model to drift into, no habit overriding the source. Edit the canonical rules; the fleet re-reads on next run. What used to be a multi-quarter change-management program becomes editing a file.

I built and evolved the system underlying this series over 30 days, with the last 12 architectural pivots happening in a concentrated 48-hour sprint. A CTO reading that number could reasonably ask whether the architecture was stable at all. The right reading is the opposite: those 12 pivots are the proof of concept, not the apology for it.

Each pivot was a refinement to the canonical rules — write-boundary definitions, approval gate structure, verification criteria. Because the document is the behavior, each edit propagated instantly and fleet-wide. No retraining. No re-briefing. No rollout plan. Patching a new kind of colleague has never been this fast. The speed is not incidental; it is what the architecture is designed for.

Compare the counterfactual: a traditional system where each behavioral change requires code deployment, human retraining, or both. Twelve pivots in 48 hours would be impossible, and not because the problems would be smaller — complex adaptive systems surface edge cases at the same rate regardless of the stack. The difference is the cost of responding when they appear. Agentic systems built on governed context surface problems clearly (verification fails, state transitions fail, validation fails — each is a signal, not a surprise) and support correction immediately. The discipline is not planning every edge case in advance. It is building the system so that when the edge case arrives, the fix is a file edit.

What We Are Still Solving

Honest accounting requires acknowledging the open problems.

Token cost observability. Tracking what agents do is straightforward. Tracking precisely how many tokens each action costs — and which actions are high-value versus wasted context — remains difficult. Cost attribution is largely manual.

Concurrent coordination. Independent parallel workstreams are well-supported by the patterns above. What is not yet solved is automatic cross-workstream conflict detection when two efforts touch the same artifact. That requires an additional coordination layer, and the right design is still an open question. A related version-awareness problem: when a shared artifact changes, updating dependent work is explicit awareness, not automatic propagation — the system surfaces the drift, but a human decides whether to update dependents or accept it.

Goal-first verification at scale. The principle is clear — capture goals as verifiable criteria before work begins. The harder implementation problem is maintaining that traceability as work spans many cycles and many agents, and ensuring that verification consistently checks what the goal actually requires rather than what is easiest to measure.

Research graph maturity. A knowledge graph tracking artifacts and their relationships works well. The harder problem — tracking patterns and insights across many cycles so that earlier research genuinely informs later work without context pollution — is early-stage.

These are the problems worth working on next. The patterns described above — write boundaries, two-phase approval, goal-first verification — are not a finished theory. They are the load-bearing primitives I found by running the system under real load. The open problems are what real load surfaces next.

The Principle Behind the System

Delegation should feel like talking to a colleague, not calling a function. When you give an agent a persistent identity and a clear scope — this agent researches, this agent plans, this agent verifies — the coordination becomes natural language rather than API design. Each agent knows what it owns, what it can touch, what requires a handoff, and who has authority over what it produces.

This is a design principle as much as a practical pattern. A system where every agent can do everything is a system where nothing has clear ownership, errors are untraceable, and governance is theater. A system where agents have clear scope, clear write boundaries, and clear approval relationships is a system that can actually be governed — and that can actually earn the trust that makes AI useful at enterprise scale.

Context management is not a feature you add. It is the discipline you build from the beginning. The forgetting problem, the token-burn anti-pattern, the cascade debugging failure, the goal-intent gap — all of them have the same root: a system that treats context as a retrieval target rather than a managed asset. The harness layer exists to close that gap. What you are reading was produced through one.

Next in the series: The Human Becomes the Goal Owner.