Context Architecture

I. The Activity That Has Quietly Changed

Picture the moment as it usually happens. An engineer is implementing a feature. The IDE has Copilot loaded, or Cursor, or Claude Code, or a custom in-house agent — the specific tool has churned every six months for three years and will churn again before this paragraph is finished. The engineer types a comment describing the intent. The model produces code. The engineer reads it, accepts most of it, edits a few lines, runs the tests, moves on. Repeated thousands of times per day across the industry, this exchange has quietly become the dominant unit of software production. Nobody at the team level describes it as an architectural shift. It feels like a productivity tool — like autocomplete with better suggestions.

This is the failure. Not the use of the tool — sometimes the tool is exactly right — but the category it has been sorted into. What was treated as a tooling improvement is, in fact, a relocation of where engineering labor takes place. The engineer is no longer producing code by translation from intent; the engineer is producing a context against which the model translates intent, and then is verifying that the model's output is the code the context implied. The category of work has changed. The procurement framing — "we adopted AI coding assistants this quarter" — has not.

The argument of this essay is that this shift is real, it is happening invisibly, and the failure to recognize it as architectural — rather than as a tooling improvement — is the same category error described in two earlier essays on vendor and cloud lock-in. The cost of getting it wrong does not arrive as a single visible incident. It accumulates silently, in code review backlogs that grow faster than the team that maintains them, in production incidents whose root cause is a near-miss generation accepted under time pressure, in the slow erosion of architectural coherence across a codebase the model is rewriting one file at a time. By the time the cost is measurable, the decisions that produced it are years in the past.

The discipline this essay describes is context architecture: the deliberate engineering of the informational environment the model generates against, and of the verification regime that determines whether its output is accepted. The thesis is not that AI-assisted development is wrong. It is that the labor of AI-assisted development is not the labor most teams think it is, and that the structure of the work has migrated from writing code to specifying what code must be — on a schedule the practitioners have not yet noticed they are running.

II. The Anatomy of Context

To see context architecture clearly, it helps to stop thinking of "context" as a single thing — the prompt, the chat window — and instead recognize it as a layered structure that compounds. Every generation event runs against at least five distinct layers of context, each of which determines what the model can know, what it can do, and how its output gets evaluated. Each layer is harder to see than the previous one. Each layer is where a different failure mode lives.

The first is prompt context: the explicit text the engineer sends. The comment, the chat message, the instruction. This is the visible layer — the focus of the "prompt engineering" discourse, the subject of every "ten tips for better prompts" article, the surface where individual users feel they have leverage. It is also the smallest of the five problems. A perfectly-crafted prompt against a deficient context fabric will still produce inconsistent code; a barely-adequate prompt against a rich, structured context will often produce correct code on the first attempt. Disproportionate attention to this layer is the most common architectural mistake in current AI-assisted-engineering culture.

The second is code context: what the model can see of the existing codebase at generation time. The file open in the editor. The imports of that file. The surrounding modules. Recently edited files. Whatever the IDE's retrieval system has decided is relevant to the current request. Modern AI-coding tools differ enormously in what they include here and how they rank it — Cursor's retrieval is not Copilot's retrieval is not Claude Code's retrieval. The same model, asked the same question, in two different IDEs, with two different repository states, produces categorically different output, because the retrieved context is different. The engineer feels as if they are using one tool. They are, in fact, using whichever combination of retrieval, ranking, and context-window-packing the IDE has decided is appropriate for that moment.

The third is specification context: the architecture documents, interface contracts, dependency rules, style guides, naming conventions, and constraint files that the team has authored and made available to the model. This is the layer the article's thesis most directly addresses, because it is where the engineer's leverage lives — and it is the layer most teams have not built. Specification context is not the prompt and not the codebase; it is the body of intent the team has written down in machine-readable form. A team with no specification context is a team whose entire architectural rationale exists only in the heads of senior engineers, invisible to every generation event.

The fourth is tool and environment context: what the model can do at generation time, not just what it can read. The ability to grep the codebase, run a test, query a build system, read a database schema, search the web, execute a shell command. Tool access defines the model's effective competence in a way the model's parameter count alone does not. A model with grep and a working test runner is a categorically different system from the same model with only the open file. The cloud lock-in essay made a parallel argument about cloud-service binding; here the equivalent is binding to a specific tool-access surface, with its own deprecation risks and its own architectural commitments.

The fifth is verification context: the regime that determines whether the model's output is accepted into the codebase. Tests, type checking, structural lints, CI gates, code review, security-pattern scanners, dependency validators. This is the layer most teams have not updated for AI-assisted work — most code review workflows were designed for human authorship, where the author's understanding of intent could be assumed. AI authorship has no understanding of intent, only probability, and the verification regime that worked when a human typed each character is not the verification regime that works when a model emitted a thousand lines in eight seconds. Context failure most often becomes visible at this layer, because this is where wrong output meets the gate that should have caught it.

None of these layers is visible at the moment the engineer types a comment and accepts a suggestion. All of them compound silently. The output of any generation event is a function of all five layers jointly — and the quality of the output is a system property of the joint context, not of any one layer. This is the architectural shape of AI-assisted development: a series of context decisions, each individually small, each individually defensible, accumulating into a system property — output quality — that nobody designed and nobody owns.

III. Why "Better Prompts" Is the Wrong Frame

The dominant decision criterion in current AI-assisted-engineering culture is better prompts. The engineer learns to write more specific instructions, includes more examples, structures requests with explicit reasoning chains, adopts the latest "prompt-engineering" technique. The supporting industry — courses, books, consulting, certifications — is built almost entirely on this criterion. The criterion is correct at the level of the individual prompt and incomplete at the level of the system.

The incompleteness shows up as follows. Better prompts, applied prompt by prompt across a project's lifetime, produce code whose aggregate properties — naming consistency, architectural coherence, dependency hygiene, error-handling discipline — are still emergent from each prompt's local context. Each generation event was locally optimal. The aggregate codebase is globally inconsistent, because the property that determines consistency does not live in the prompt at all; it lives in the specification context the prompts execute against. A series of brilliant prompts against zero shared context still produces a codebase that looks like it was written by a committee of strangers.

This is a classic systems-thinking failure: the assumption that good local decisions compose into a good global outcome, when in fact the system property that matters — in this case, architectural coherence — is emergent and is precisely the property the prompt-engineering discourse does not address. Donella Meadows formalized this distinction in Thinking in Systems: a system's behavior is determined by its structure, not by the quality of its individual components, and an obsession with component-level optimization while ignoring structural properties produces predictable failure modes [1]. The earlier essays on vendor and cloud lock-in described the same shape; the structural property at stake there was optionality. The property at stake here is specification leverage: how much of the team's intent is encoded in artifacts the model can read at every generation event. Specification leverage is preserved by the discipline of writing structure down once and injecting it into the context fabric everywhere. It is consumed by the discipline of writing structure down zero times and re-inferring it from prompt to prompt.

The historical analogy Edsger Dijkstra drew in 1972 still applies, if anything more sharply. The move from assembly language to high-level compilers, Dijkstra observed, shifted the programmer's task from instructing the machine in detail to specifying what operations to perform [2]. The current shift is one further turn of the same screw. The programmer's task is shifting from specifying the operations to specifying the constraints the operations must satisfy — and from manually verifying the produced operations to engineering the verification regime that does the checking. The work that compilers absorbed in the 1950s does not return; the work that generative models are absorbing now will not return either. The question is what work the engineer is doing instead.

One critical distinction separates compilers from generative models. Compilers are deterministic: input A always yields output B; the transformation is rigid and provable. Generative models are probabilistic: the model does not compile intent, it infers it from statistical likelihood; input A may yield output B, B', or C depending on the stochastic state, the retrieved context, the temperature, the model version, the time of day. The discipline that produced reliable compiled output was a discipline of correctness proofs and type systems. The discipline that produces reliable generated output is a discipline of specification and verification — a soft type system imposed externally, by the engineer, on a probabilistic generator.

The architecturally correct frame, then, is not "better prompts." It is better context, designed jointly across all the surfaces from which the model retrieves it, with a verification regime calibrated for stochastic output. Most teams do not work this way. Most teams cannot, because the prompt-engineering framing surfaces neither the context fabric nor the verification regime — and the procurement decision to adopt an AI coding tool came with no architectural deliberation about either.

IV. Where Context Failure Cashes In: Three Patterns

The argument above is theoretical. The last few years have produced a sequence of concrete failure modes that make the same argument empirically. Three patterns are worth examining in detail, because each illustrates a different way context-poor generation fails in production, and because together they cover the surface where context architecture would have made the difference.

Hallucinated dependencies and slopsquatting. Large language models trained on a snapshot of public package registries occasionally produce code that imports packages that do not exist. The model synthesizes a plausible-sounding name from neighboring libraries, confident in its phrasing, with no mechanism to check whether the package is real. In early 2024, Bar Lanyado at Lasso Security noticed that AI models were repeatedly hallucinating a Python package called huggingface-cli. Lanyado uploaded an empty package under that name to PyPI as a research experiment to see what would happen. The package received more than 30,000 authentic downloads in three months. Alibaba had copy-pasted the hallucinated install command directly into the README of one of their public repositories [3].

The pattern earned the name slopsquatting by 2024: AI's version of typosquatting, where the attacker bets on hallucinations rather than typos. A study presented at USENIX Security in 2025 tested 16 models across roughly 576,000 code samples and found that hallucinated package names follow predictable structural patterns — 38 percent are conflations of two real names, 13 percent are typo-like variants, 51 percent are pure fabrications — and that 43 percent of hallucinated packages recur across multiple queries, which is exactly the property that makes them squattable [4]. By late 2025 the attack had moved from research demonstration to documented production incident, with hallucinated package names appearing inside published collections of AI-generated agent skills [5].

Architectural lesson: this is a context failure at every layer. The model did not know the live package registry at the moment of generation, because the prompt context did not include it, the code context did not include it, the specification context did not include "validate dependencies against the actual registry," and the verification regime did not catch a hallucinated import because the hallucinated package was already registered on PyPI by the time CI ran. The fix is not "a better model" — the next model will hallucinate different package names. The fix is a context fabric that makes registry validation part of the generation event and a verification regime that treats imports as security-relevant artifacts subject to allowlisting. Every layer of context architecture is load-bearing here.

The measured productivity gap. In July 2025, METR (Model Evaluation and Threat Research) published the results of a randomized controlled trial of experienced open-source developers' productivity with and without AI assistance. The methodology was unusually tight for a productivity study: 16 developers, 246 tasks, mature projects on which the participating developers had an average of five years of prior experience, tasks randomly assigned to AI-assisted or unassisted conditions, and frontier AI tools at the early-2025 capability level (primarily Cursor Pro with Claude 3.5 and 3.7 Sonnet). Pre-study, the developers expected AI tools to make them roughly 24 percent faster. The measured outcome moved in the opposite direction: developers using AI tools took roughly 19 percent longer to complete tasks. The post-task perception was perhaps the most striking finding: even after experiencing the slowdown, the developers still believed AI had sped them up by about 20 percent [6].

The METR result is not a verdict against AI-assisted development; the same researchers updated their experimental design in early 2026 to capture the rapidly evolving capability frontier [7]. What the result captures, sharply, is the cost of using AI without adequate context. The population studied — experienced developers on familiar codebases — is precisely the population with the highest specification leverage to offer the model and the highest verification overhead when the model produces near-miss output. AI without context produces output that requires more review than writing the code by hand would have. The 19-percent slowdown is the verification tax made visible.

Architectural lesson: the productivity claim and the productivity reality diverge most sharply where context infrastructure is weakest. The discipline of context architecture exists to close this gap — not by making AI tools work better intrinsically, but by reducing the verification tax through structural constraint. A codebase with rich specification context lets the model produce output that lands closer to the intended target; a codebase without it forces the model to invent, and forces the engineer to verify the invention. The METR study measured the cost of operating in the second regime.

Structural drift across sessions. The third pattern is the one most teams will recognize after running AI-assisted development for a year. The same coding task, asked of the same model on different days, in different IDE sessions, with non-deterministic sampling and varying retrieved context, produces structurally different code each time. Naming conventions drift. Error-handling patterns differ between files. Service boundaries get redrawn. The shape of test fixtures changes. A REST controller written on Monday looks subtly different from one written by the same engineer with the same tool on Friday, because the implicit context was different — what files were open, what the retrieval ranked highly, what the model happened to remember from training about idiomatic patterns in that framework.

This failure mode is harder to anchor in a single named incident because it is not a single event but a slow accumulation, like the vendor lock-in pattern. Over a project's lifetime, the result is a codebase that is locally well-formed but globally inconsistent — files that look as if they were written by different teams because, in a real sense, they were generated against different implicit contexts. The research literature on consistency-of-generation under sampling has measured the variance directly; published studies on the effect of explicit constraint files on convergence have shown that externally-imposed specification dramatically narrows the distribution of generated outputs. The signal is clear and reproducible. Where structure is written down, the model converges on it. Where it is not, the model invents.

Architectural lesson: structure that is not written down does not exist as far as the model is concerned. The tacit knowledge senior engineers carry — "how we do things here," "what we don't do," "what shape an X normally takes in this codebase" — is invisible to every generation event unless it has been externalized into a context artifact. The remedy is not better senior engineers and not better models. It is the discipline of writing the implicit architecture down in a form the model can read, on the same schedule the code is being written.

Three different failure modes — hallucinated dependencies, productivity tax, structural drift — converge on a single pattern. Each is the consequence of context that was insufficient for the generation event. The teams that pay the cost of bounded context in advance — through written specifications, through verification regimes calibrated for stochastic output, through tool-access discipline — pay a smaller bill in production. The teams that optimize for prompt convenience pay the larger bill, distributed across review backlogs, security incidents, and the slow erosion of architectural coherence. The mechanism by which the bill arrives is different in each case. The architecture that bounds it is the same.

V. AI Is Not Wrong; The Question Is the Tradeoff

The argument so far would be incomplete without honest treatment of the counter-position, because the counter-position has real force.

AI-assisted development is not wrong. For some workloads, minimal context and direct prompting is the architecturally correct choice — exploratory scripting, throwaway prototypes, one-off data extractions, learning the shape of an unfamiliar framework, generating boilerplate where structure is genuinely irrelevant because the code will not survive to be inconsistent with anything. A team that erected an elaborate context fabric for every shell script its engineers wrote would be paying real costs in specification overhead to constrain a generation event whose output is not load-bearing. The argument of this essay is not that every team needs heavy context architecture. It is that the category of the commitment should be recognized: the team is choosing, implicitly or explicitly, how much specification leverage to apply, and that choice should be made on the merits of the workload rather than as an inherited default.

There is also an honest middle ground that the dichotomy can obscure. Heavy context architecture — extensive specification files, custom retrieval pipelines, dedicated verification regimes, structural lints calibrated for AI-generated patterns — is appropriate for large codebases, long-lived systems, regulated environments, and high-coordination teams. Light context architecture — a single style guide, default IDE retrieval, standard test gating, a CODEOWNERS file the AI is expected to respect — is appropriate for fast-moving small-team work where the marginal value of additional structure is genuinely low. The two are not opposites; they are points on a gradient that should be chosen for the workload rather than imposed uniformly across all of an organization's engineering surface.

The boundary between the two is not always obvious at the moment of decision. A throwaway script becomes the foundation of a long-lived service. A prototype goes to production because the team needed something on Monday and the prototype worked. A small codebase grows large. The right discipline is not "always heavy context" but "explicit recognition of which regime the workload is in, and a written rationale for the choice." A team that can articulate, for every category of AI-assisted work it does, what context is required and what verification will catch it if missing, and that has actually written it down, is doing context architecture. A team that cannot is treating generative AI as autocomplete and discovering the difference at incident-response time.

The discipline this demands is not absolutism about specification. It is honesty about how much of the system's correctness depends on the model's output, on what time horizon, and at what review cost. Most teams have not had that conversation explicitly. Most teams could have it in an afternoon, and would benefit from the conversation more than from any single tool upgrade.

VI. The Discipline of Context Architecture

The constructive form of the argument is that context architecture is a discipline with identifiable practices. None of these practices is novel; what is uncommon is treating them as part of the architecture rather than as advice consulted after a problem has already occurred.

Treat the context fabric as architecture, not as prompt engineering. Specification files, retrieval rules, tool-access policies, and verification regimes are first-class architectural artifacts. They live in the repository, they are versioned alongside the code, they have named owners, they are reviewed when they change. A team whose AI-coding configuration lives only in IDE settings on individual laptops has no context fabric; it has a collection of personal preferences that produce inconsistent generation by design. The first practice is recognizing this layer as something that exists.

Externalize the structure the model needs to know. Architecture documents, interface contracts, dependency rules, style conventions, naming patterns, allowed and disallowed library lists — all of these must be in writing and in the model's accessible context. Tacit knowledge that lives only in senior engineers' heads is invisible to the model and will be silently violated. Externalization is uncomfortable work because it forces explicit articulation of decisions that were comfortable when they were implicit. That discomfort is the work. A team that cannot write down its architectural conventions does not actually have them; it has habits that look like conventions until someone new joins, or until the model joins.

Update the verification regime for stochastic output. Code review designed for human authorship assumes the author understood the intent and made errors only in execution; the reviewer's job is to catch the execution errors. AI authorship has no intent, only probability. The reviewer's job shifts from "did the author make a mistake" to "does the generated artifact satisfy the constraints we specified." Tests, type checks, structural lints, dependency-validation hooks, security-pattern detectors, and semantic-equivalence checks become load-bearing. The verification regime is the layer at which context failure becomes visible; investing in it is the highest-leverage architectural move available to a team operating in this regime.

Bound the tool surface the model operates against. What the model can read, write, execute, and reach defines what failure modes are reachable. Tool access is a security and architecture decision, not a productivity feature. Default-everything access — the agent can read the whole codebase, write any file, run any command, reach any service — is the equivalent of giving root to every contractor. The discipline is the same as for human contractors: least privilege, audit logging, blast-radius containment, and the explicit recognition that what an agent can do is the upper bound on what an attacker who hijacks the agent can do.

Recognize and break the lethal trifecta. Simon Willison's framing of the security boundary in agentic AI names three components: access to private data, exposure to untrusted content, and the ability to externally communicate in a way that could exfiltrate the data [8]. When a generation context combines all three, prompt injection becomes an arbitrary-data-exfiltration vector, because attacker-controlled text in any input the agent processes can instruct the agent to read the private data and send it elsewhere. The architectural response is to break the trifecta — remove one of the three legs from any agent that processes untrusted input — rather than to attempt to detect malicious prompts at the prompt layer. Exfiltration capability is usually the easiest leg to remove.

Calibrate context investment to workload longevity. Exploratory scripts do not need an elaborate context fabric. Long-lived systems do. The architectural mistake is not "we did not build a context fabric"; it is "we applied the same context regime everywhere, regardless of what each workload required." The discipline is recognizing which regime each piece of work belongs in, choosing accordingly, and revisiting the choice when workloads cross the boundary between regimes — which they will, more often than most teams realize, because most workloads that survive turn out to have been long-lived from the moment they survived their first quarter.

Document what context produced what output. When a generation event matters — production code, infrastructure changes, security-relevant logic, anything that will outlive the moment of its creation — the context that produced it should be reproducible. The artifact, the model, the temperature, the retrieved files, the tool calls, the verification trace. Without this, post-incident analysis is impossible: an investigator looking at why a piece of code does what it does has no way to reconstruct the conditions that produced it. The discipline here is the engineering equivalent of citing one's sources. It is not currently industry practice; it is the practice that will distinguish engineering from prompt-driven gambling over the next decade.

These practices are not exotic. Most experienced engineers can recite them once the framing is in front of them. The question is not whether they are known but whether they are applied as architecture — as part of the deliberation that produces the system, rather than as advice consulted after a problem has already occurred. A team that applies these practices consistently will, over years, build a context fabric whose specifications outlast the models that consume them. A team that does not will accumulate a generation history whose only documentation is the code it produced, and will discover, every time a model is retired or a tool deprecates, that the implicit architecture it relied on was sitting in a vendor's training data the whole time.

VII. The Asset Is the Specification

Generative AI in software engineering is not the end of the discipline, but the beginning of its maturation into a true systems discipline. The activity of writing code — the activity that defined the profession for most of its history — is being absorbed by tools, on a schedule the profession did not vote on. What is not being absorbed is the activity of deciding what code must be: the specification of structure, the design of constraints, the engineering of verification regimes. That work has not been automatable in any of the previous waves of tooling and is not automatable now. It is the work that remains, and it is the work this essay is about.

The dependencies are not the asset. The optionality is. That observation governed the earlier essays on vendor lock-in and cloud lock-in, and the analogous claim governs this one. The generated code is not the asset; the specification is. A team whose context fabric is rich, written down, versioned, and machine-readable can regenerate the implementation against any sufficiently capable model — current, next year's, the one after the current vendor exits the market. The implementation becomes a consumable. A team whose specifications live only in the heads of senior engineers can regenerate nothing — the moment those engineers leave, or the moment the model changes, or the moment the IDE's retrieval logic gets rewritten, the implicit architecture is gone with them.

The tools have changed. The need for engineering judgment has not — but the surface where judgment is expressed has migrated. Knowing how things work, writing it down in the form the model can act on, and verifying that the output is what the writing-down implied — that is the engineering of the next decade, and that is the discipline that distinguishes engineers from operators of probabilistic content generators. The choice of which to be is made implicitly, every day, in the structure of the work each team accepts. The argument here is that it should be made deliberately, with the same care any other architectural decision deserves, and on a horizon longer than the next sprint.

References

[1] Meadows, D. H. (2008). Thinking in Systems: A Primer. Chelsea Green Publishing. ISBN: 978-1-60358-055-7.

[2] Dijkstra, E. W. (1972). The Humble Programmer. ACM Turing Award Lecture. Communications of the ACM, 15(10), 859–866. https://www.cs.utexas.edu/~EWD/transcriptions/EWD03xx/EWD340.html

[3] Lanyado, B. (Lasso Security, 2024). Research on AI-hallucinated package names; huggingface-cli case study and 30,000-download experiment. Industry coverage: The Register / Dark Reading, "AI Code Tools Widely Hallucinate Packages." https://www.darkreading.com/application-security/ai-code-tools-widely-hallucinate-packages

[4] "We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs." USENIX Security Symposium 2025. Coverage: Socket, "The Rise of Slopsquatting: How AI Hallucinations Are Fueling a New Class of Supply Chain Attacks." https://socket.dev/blog/slopsquatting-how-ai-hallucinations-are-fueling-a-new-class-of-supply-chain-attacks

[5] Nesbitt, A. (2025). Slopsquatting meets Dependency Confusion. December 10, 2025. https://nesbitt.io/2025/12/10/slopsquatting-meets-dependency-confusion.html

[6] METR (2025). Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity. July 10, 2025. https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/ — Preprint: arXiv:2507.09089. https://arxiv.org/abs/2507.09089

[7] METR (2026). We Are Changing Our Developer Productivity Experiment Design. February 24, 2026. https://metr.org/blog/2026-02-24-uplift-update/

[8] Willison, S. (2025). The Lethal Trifecta for AI Agents: Private Data, Untrusted Content, and External Communication. Talk and write-up, Bay Area AI Security Meetup, August 2025. https://simonwillison.net/2025/Aug/9/bay-area-ai/