Building in the Age of AI: Why Systems Thinking Matters More

AI makes custom tooling accessible to anyone who understands the underlying systems. The value of knowing how things work keeps going up, not down.

I. The Democratization Story, and Why It Is Half True

A familiar narrative has formed around AI-assisted software development: the barrier to building has collapsed. Anyone with an idea can now produce working code. The gatekeepers — syntax fluency, framework familiarity, the long apprenticeship of learning a stack — have been swept aside. Software, the story goes, is being democratized.

The claim is not wrong. It is incomplete, in a way that matters.

What has been democratized is output. Producing a function, a script, a service skeleton, a working prototype — these have indeed become accessible to anyone who can describe what they want. But producing a sound system — one that holds up under load, defends itself against adversaries, fails predictably, ages gracefully, and integrates honestly with the human and organizational context around it — has not been democratized. The opposite has happened. The barrier to producing sound systems has moved, and risen.

This is the argument of the present essay: AI has not lowered the bar for serious engineering work. It has raised it. The reason is structural, and it has nothing to do with the cleverness of any particular model. It is about the kind of tool AI is, the kind of systems we ask it to help build, and the gap between the two.


II. The Category Error: A Probabilistic Tool for Deterministic Systems

The first thing to understand about a large language model is that it is a probabilistic generator. It produces output sampled from a learned distribution. For any given prompt, there is no single correct response — there is a space of plausible continuations, and the model returns one. Run the same prompt twice and you may get two different answers. Both may be plausible. Neither is guaranteed to be correct.

The systems we ask such tools to help build are, by contrast, deterministic in their requirements. A payment processor must not occasionally lose a transaction. An authentication layer must not sometimes admit an unauthorized request. A medical record system must not now and then mix up two patients. Industrial control logic must not, on rare prompts, miscalibrate the equipment it controls. The acceptable failure rate of these systems, when integrated honestly with their consequences, approaches zero.

This is a category error built into the workflow. We are using a probabilistic producer to construct deterministic artifacts that operate on consequential reality. The gap between probabilistic production and deterministic requirement has to be closed by something. Right now, that something is the human in the loop.

This gap is not a temporary engineering deficiency that better models will close. It is a feature of the approach. A model that generates code by sampling from a learned distribution will, by construction, produce occasional outputs that are confidently wrong — syntactically valid, plausibly structured, sometimes even superficially elegant, and incorrect in ways that are not visible from the surface. Brooks's distinction between essential and accidental complexity in software engineering is useful here [1]. AI dramatically reduces accidental complexity: the syntactic toil, the boilerplate, the framework-lookup overhead. It does not touch essential complexity — the inherent difficulty of specifying correct behavior, reasoning about edge cases, understanding the problem domain. Brooks argued in 1986 that no single development would deliver a tenfold improvement in software productivity within a decade, because the hard part of software is conceptual, not clerical. Forty years later, the prediction holds. AI eliminates clerical work spectacularly. The conceptual work remains.

The empirical evidence on this point is increasingly substantial, and increasingly inconvenient. In a controlled study at NYU, Pearce and colleagues prompted GitHub Copilot to generate code in 89 scenarios designed to elicit common security weaknesses, producing nearly 1,700 programs; approximately 40% of the generated code contained recognizable vulnerabilities from the MITRE Top 25 Common Weakness Enumeration list [2]. A subsequent user study at Stanford by Perry and colleagues, published at ACM CCS 2023, found that participants given access to an AI assistant wrote less secure code than those without — and, critically, were more confident their code was secure [3]. Both findings persist when one controls for developer experience.

The Stanford finding is the more important of the two. It is not surprising that an early code-generation model produced insecure suggestions; that is a tractable engineering problem and models have improved. What is harder to engineer away is the calibration failure: the user's confidence rises faster than the user's actual competence. The tool produces output that looks correct, reads correct, and runs in the obvious test cases. The reviewer, lacking the deep context, accepts it. The defect ships.

This is the specific danger of a probabilistic tool generating deterministic artifacts. It is not that the tool sometimes produces wrong answers — every tool sometimes produces wrong answers. It is that the tool produces wrong answers that are indistinguishable from right answers to anyone who is not already competent enough to spot the difference. The danger is plausibility without correctness, and it scales with the user's trust.


III. Two Structural Problems: Throughput and Drift

Beyond the category error itself, the daily practice of AI-assisted engineering surfaces two structural problems that follow directly from the nature of the tool.

The Throughput Asymmetry

A modern code-generation model can produce 5,000 lines of plausible-looking code in minutes. A competent human review of 5,000 lines — actual review, with comprehension of intent, edge cases, security implications, integration assumptions, and architectural fit — takes hours, often days. The asymmetry is not a temporary quirk of current tooling. It is structural, and it scales unfavorably with model capability: as models produce more code faster, the review bottleneck tightens.

This has a corollary that the naive workflow ignores. The default mental model — AI writes, human reviews — does not survive contact with realistic throughput. You cannot solve it by reading faster. The METR randomized controlled trial of experienced open-source developers in early 2025 found that allowing AI tooling on real tasks in mature codebases increased completion time by 19%, even as the same developers estimated they were going 20% faster [4]. The perception-reality gap is exactly what one would predict from the calibration problem above: developers feel productive while shipping work that takes longer to land cleanly. METR has since acknowledged that interpreting this finding requires care [5], but the directional point holds: at minimum, the productivity story is more complicated than the marketing suggests, and at maximum, naive AI-assisted workflows are net negative for experienced engineers on serious codebases.

The 2024 Accelerate State of DevOps Report — DORA's annual survey of more than 39,000 software professionals — found a paradox of the same shape at organizational scale: AI adoption correlated with measurable gains in individual productivity, flow, and job satisfaction, while simultaneously correlating with a 7.2% decrease in software delivery stability and a 1.5% decrease in throughput [6]. The individual feels faster. The system the individual is part of moves slower and breaks more often. The plausible explanation, as the DORA report itself notes, is that AI allows individuals to produce larger changesets, and larger changesets are slower to integrate and more prone to instability. The local optimum and the system optimum are diverging.

If this sounds familiar, it should. Lisanne Bainbridge described exactly this dynamic in 1983, in a paper titled "Ironies of Automation" [7]. Studying industrial process control, Bainbridge observed that automating most of an operator's work did not eliminate the operator — it shifted the operator into a monitoring role for the situations the automation could not handle, while simultaneously degrading the operator's hands-on skill through disuse. Automation, Bainbridge concluded, raised the demand for operator skill rather than lowering it: the human is now responsible for exactly the cases the machine cannot handle, which are by definition the hardest cases. Forty years later, with a different kind of automation, the same irony reappears. The AI handles the easy and middling cases. The human is left to catch the hard ones — at higher volume, with less practice, and under the additional cognitive load of reviewing machine output rather than producing original work.

The Drift Problem and the Containment Question

The second structural problem is drift. A probabilistic generator without strong guardrails wanders. It hallucinates APIs that do not exist. It adopts framework conventions that look plausible but are wrong. It "improves" code outside its assigned scope. It forgets architectural decisions made earlier in the conversation. It produces non-reproducible output for similar prompts: similar inputs do not yield similar outputs in the way an engineer relying on the result would assume.

The empirical signature of this drift is now visible at scale. GitClear's 2025 analysis of 211 million changed lines of code from 2020 to 2024 found multiple converging indicators of architectural decay tracking the rise of AI-assisted development: a fourfold increase in code clones, an eightfold increase in code blocks of five or more lines duplicating adjacent code, and a collapse in moved lines — the metric that captures refactoring activity, the rearrangement of code into reusable modules — from 24.1% of changes in 2020 to 9.5% in 2024 [8]. Code is being added; it is no longer being shaped. Whatever organization a codebase had is being eroded faster than it is being maintained.

This is not a moral failing of the developers using AI. It is the predictable downstream of the throughput asymmetry combined with the drift tendency. When generation outruns reflection, refactoring is what gets cut.

To use this tool seriously requires what we might call containment: explicit, durable constraints that keep the probabilistic generator on course. Defined scope, locked architectural decisions, reference documentation held in context, deterministic test harnesses that catch deviation, prompt and context discipline, version-controlled prompt artifacts so that the same intent produces comparable output across sessions. Prompt engineering, in the popular sense, is a small subset of this. The real engineering problem is not "how do we get the AI to write code" — that problem is solved. The real problem is: how do we contain a probabilistic generator inside a deterministic engineering pipeline such that the output is trustworthy enough for production?

Most of what is currently called AI-assisted engineering is not engineering this question at all. It is prompting, with the containment problem unsolved or ignored. The damage of that omission is currently invisible because most AI-generated code is young; the bills come due over the maintenance horizon, when the architectural drift compounds and the duplicated blocks need to be changed in fifteen places at once, and the wrong assumption embedded in the original generation surfaces in production.


IV. The New Competence Profile, and the Meta-System

What does this mean for the practitioner who wants to do serious work with these tools?

It means, first, that the specialist division of labor that organized software engineering for thirty years is collapsing into the individual at the controls. Before AI, a team distributed concerns across roles: a backend engineer wrote backend code, a security engineer hardened it, a database administrator tuned it, a quality engineer tested it, an architect designed it, an SRE operated it, a senior reviewer caught what the others missed. Each role brought specialized judgment, accumulated through years of practice. The team's competence was the sum of those specializations, mediated by their interactions.

With AI in the loop, implementation is commoditized but judgment is not. The AI can produce backend code, but it cannot tell you whether the architectural decision behind that code will hold under three years of scaling. It can generate a security middleware, but it cannot tell you whether the threat model was correctly specified to begin with. It can write database queries, but it cannot tell you whether the schema is the right shape for the questions the business will ask in eighteen months. It can scaffold tests, but it cannot tell you whether you are testing the things that matter or whether the test surface adequately covers the failure modes that will actually occur in production.

In each of these cases, the AI is the productive layer and the human is the judgmental layer. The AI is the only thing producing code, and the human is the only thing verifying that the code does what the larger system requires. To verify across all those dimensions — architecture, security, performance, data, operations, integration — the human must hold senior-level judgment in all of them simultaneously. Not because the human personally produces all of it, but because the human is the only verification layer the AI has.

This is the rising bar. Before AI, you could be a strong backend engineer who deferred security to the security team and operations to the SRE team. After AI, when you are working alone or in a small team with AI as a force multiplier, those handoffs do not exist. The AI does not defer to a security team. It produces what it produces, and you are the security team. And the SRE team. And the architect. And the reviewer.

This is the sense in which "knowing how things work" has become more valuable, not less. The old role of implementation specialist — the engineer whose value was deep mastery of one stack — is being commoditized. The role that grows in value is the systems generalist: the engineer who can hold the whole picture at sufficient depth to evaluate the AI's output against requirements the AI cannot itself articulate. The headline of this article, posed as a claim, can now be stated as a conclusion: the value of knowing how things work has gone up, not down, because that knowledge is now the only thing standing between AI-generated plausibility and production reality.

There is a deeper move available here, beyond the senior-generalist profile, and it is the one that distinguishes the practitioners who will build durable systems with AI from those who will accumulate technical debt at unprecedented speed.

The deeper move is to recognize that the operator's real job is not to review the AI's output line by line — that does not scale, as we have established. The operator's real job is to design the meta-system that contains the AI. This meta-system is itself an engineering artifact, and it is the actual work of AI-assisted engineering. It includes:

The architectural envelope: the design decisions, interface contracts, and structural invariants that the AI is permitted to fill in but not violate. These must be specified explicitly, in artifacts the AI is given alongside the task, so that the probabilistic generator is anchored to a deterministic skeleton.

The verification scaffolding: the deterministic test harnesses, type systems, static analyzers, schema validators, contract tests, and property-based tests that catch deviations the human cannot review fast enough to catch directly. The verification surface must be designed to scale at the same rate as the generation surface, or the throughput asymmetry will dominate.

The context architecture: the body of reference material — documentation, examples, conventions, prior decisions, domain models — that the AI works against. This is an artifact that must be curated, versioned, and maintained, because the quality of AI output is a function of context quality more than prompt cleverness. Context is the non-prompt half of the prompt.

The scope discipline: explicit boundaries on what the AI is allowed to change in any single interaction, enforced operationally rather than relied upon as a convention. A model that "improves" code outside its assigned scope is a model that is drifting; the discipline must prevent the drift, not detect it after the fact.

The reproducibility infrastructure: prompts, contexts, model versions, and parameters captured in version control alongside the code they produced, so that a future engineer can understand not only what the system does but how it came to do it, and can reproduce the generation if it must be redone.

This is, in the precise sense, systems thinking — Donella Meadows's discipline of seeing stocks, flows, feedback loops, and leverage points rather than discrete transactions [9]. The AI-assisted development workflow is itself a system: the human, the AI, the codebase, the verification scaffolding, the context, the deployment pipeline, all coupled with feedback loops both productive (tests catch errors, reviews improve quality) and pathological (drift accumulates, calibration degrades, debt compounds). To work in this system effectively, the practitioner must see the system, not just the next chunk of generated code. The leverage points are not in faster generation. They are in the architecture of the containment.

There is a recursive Conway's Law lurking here that is worth naming. Conway observed in 1968 that the systems an organization produces mirror the communication structures of the organization that produced them [10]. With AI as a participant in the development process, the system being built now mirrors not only the human organization but the structure of the human-AI collaboration itself. A team that uses AI without containment will produce systems that look like AI output without containment: plausible, sprawling, internally inconsistent, hard to maintain. A team that engineers the meta-system with care will produce systems that reflect that care. The discipline of the collaboration is visible in the artifact.


V. Closing: Knowing How Things Work, As Foundation

The story that AI has made software development easy is, taken as written, a comfortable lie. What AI has done is collapse the cost of producing code that looks like software, while leaving untouched — and in some respects raising — the cost of producing software that actually works at scale, under adversaries, across years, on the consequences of real decisions. The first cost is large and visible. The second is enormous and invisible until it surfaces, which it will, on a timescale measured in maintenance cycles rather than sprints.

The practitioner who treats AI as a magic productivity multiplier and skips the containment work is not doing AI-assisted engineering. They are doing fast, plausible, undisciplined production, and they will pay for it later, almost certainly with interest. The practitioner who treats AI as a powerful but unaccountable junior — one who works at superhuman speed and requires architectural supervision they cannot themselves provide — is doing the actual work. That work is harder, slower in the short term, and considerably more valuable.

The headline of this essay is, in the end, a quiet claim and a demanding one. AI does make custom tooling accessible to anyone who understands the underlying systems. The first half of that sentence is the visible democratization. The second half is the part the marketing leaves out, and it is the part that decides whether what gets built will hold. The value of knowing how things work has gone up, not down, because that knowledge is now load-bearing for using the tool at all.

It is a good time to be a systems thinker. It is a worse time than ever to pretend to be one.


References

[1] Brooks, F. P. (1987). "No Silver Bullet — Essence and Accidents of Software Engineering." IEEE Computer, 20(4), 10–19. (Originally published 1986 in Information Processing 86, IFIP.) https://www.cs.unc.edu/techreports/86-020.pdf

[2] Pearce, H., Ahmad, B., Tan, B., Dolan-Gavitt, B., & Karri, R. (2022). "Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions." 2022 IEEE Symposium on Security and Privacy (SP). arXiv:2108.09293. https://arxiv.org/abs/2108.09293

[3] Perry, N., Srivastava, M., Kumar, D., & Boneh, D. (2023). "Do Users Write More Insecure Code with AI Assistants?" Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS '23). arXiv:2211.03622. https://arxiv.org/abs/2211.03622

[4] Becker, J., Rush, N., Barnes, E., & Rein, D. (2025). "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity." METR. arXiv:2507.09089. https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/

[5] METR (2026). "We are Changing our Developer Productivity Experiment Design." https://metr.org/blog/2026-02-24-uplift-update/

[6] DORA / Google Cloud (2024). 2024 Accelerate State of DevOps Report. https://services.google.com/fh/files/misc/2024_final_dora_report.pdf

[7] Bainbridge, L. (1983). "Ironies of Automation." Automatica, 19(6), 775–779. DOI: 10.1016/0005-1098(83)90046-8

[8] GitClear (2025). AI Copilot Code Quality: 2025 Data Suggests 4x Growth in Code Clones. https://www.gitclear.com/ai_assistant_code_quality_2025_research

[9] Meadows, D. H. (2008). Thinking in Systems: A Primer. Chelsea Green Publishing. ISBN: 978-1-60358-055-7.

[10] Conway, M. E. (1968). "How Do Committees Invent?" Datamation, 14(4), 28–31. http://www.melconway.com/Home/pdf/committees.pdf

← Back to Articles