The Context File Mistake Making Your AI Coding Agent Worse

Share your expertise with our readers. TrueSolvers accepts in-depth, independently researched articles on technology, AI, and software development from qualified contributors.

Get Started Editorial Policy

The Auto-Generation Trap Most Developers Don't See Coming

AI coding agents, including Claude Code, Cursor, Codex, and GitHub Copilot, share an architectural constraint that no amount of model improvement will eliminate in the near term: they start each session without any memory of the last one. Every session begins at zero. The agent has no awareness of architectural decisions made last sprint, conventions the team settled on three months ago, or the mistake it made and corrected last Tuesday. That amnesia is by design, not by accident, and it is the root cause of most consistency problems developers attribute to the model itself.

Context files emerged as the community's practical answer. The idea is direct: write down what the agent needs to know at the start of every session, load it automatically, and give the agent a running start. Claude Code reads a file called CLAUDE.md; Cursor reads .cursorrules; OpenAI's Codex and GitHub Copilot agents have converged on AGENTS.md as an emerging cross-tool standard. By the time a 2026 efficiency study tallied adoption, the AGENTS.md format alone had been adopted by more than 60,000 GitHub repositories. A separate large-scale scan of 129,134 GitHub projects found AI coding agent adoption ranging from 15.85 to 22.60% across the surveyed population.

That adoption curve is accelerating, and with it, a default behavior: run the agent's initialization command, let it analyze the codebase, and auto-generate the context file. It feels productive. The file looks comprehensive. The problem is that benchmark research published in February 2026 by researchers at ETH Zurich found that LLM-generated context files reduced task success rates in 5 of 8 evaluation settings, with an average performance drop of 2 to 3 percentage points. Inference costs rose by 20 to 23% at the same time. The auto-generation workflow that feels like a productivity win is, by the benchmark's measure, the worst of the three options tested: no context file, AI-generated file, and human-written file. Human-written files performed best. No file at all came in second. Auto-generated came last.

The ETH Zurich team built a purpose-designed benchmark called AGENTbench, which evaluated 138 real Python tasks drawn from 12 niche repositories. These weren't toy problems. The repositories were chosen because they already had developer-committed context files in place, giving the researchers a ground truth of what humans thought agents needed to know. The context files those developers wrote averaged 641 words across 9.7 sections. The researchers then tested whether providing those files, or AI-generated alternatives, improved agent task completion across Claude Code, Codex, and Qwen Code.

Whether the ETH Zurich results generalize equally to polyglot or monorepo codebases remains an open question for our team to watch as follow-up studies emerge. The benchmark was built on Python repositories, and the degree to which its findings transfer to other language ecosystems and codebase structures has not yet been tested empirically. That said, the mechanism the researchers identified applies at the level of how LLM inference works, not at the level of Python specifically, and that is where the more durable insight lives.

Why Agents Follow Bad Instructions Better Than Humans Do

When ETH Zurich's researchers traced why context files raised agent costs by 20% without improving outcomes, the mechanism turned out to be agents' own compliance: they followed every instruction, ran every check the file prescribed, even when those checks had nothing to do with the task at hand. A context file that told agents to use the uv package manager resulted in that tool being invoked 1.6 times per instance on average, compared to fewer than 0.01 times without the file. Agents do not read context instructions and judge their relevance to the current task. They treat all instructions as active for all tasks, all the time. The behavioral trace analysis in the ETH Zurich paper suggests this mechanism is likely operating across all agents tested, though the precise degree may vary by model capability and task type.

Every transformer-based model creates a pairwise relationship between every token in its context and every other token. The number of those relationships grows as the square of the context length: doubling the tokens quadruples the attention relationships the model must track. That means every token in a context file competes directly with every token in the actual task for the model's attention capacity. Adding more instructions doesn't just give the agent more to follow; it reduces the precision with which the agent can follow everything.

Researchers at Chroma tested this phenomenon, which they call context rot, across 18 frontier models and found that every model degrades as input length grows. The degradation is not a steep cliff but a gradient: performance erodes steadily as more content fills the window. Models attend most reliably to content at the very beginning and very end of their context; material positioned in the middle is retrieved with measurably lower accuracy. A verbose context file loaded before every session doesn't just consume tokens; it pushes task-critical information toward the middle of the window, the lowest-attention zone.

Claude Code's behavior makes this concrete in a way the benchmark data doesn't fully capture. HumanLayer's analysis of production Claude Code deployments found that Claude Code injects a system-level reminder into every conversation, telling the model that context file content "may or may not be relevant to your tasks" and that it "should not respond to this context unless it is highly relevant to your task." A long context file doesn't just cost attention budget; it triggers a filtering step where Claude actively deprioritizes content it deems irrelevant. Longer files contain more content that will be filtered, and the filtering itself consumes capacity.

Across the sessions the ETH Zurich benchmark measured, context files added an average of 3.92 extra steps to agent execution paths. Those steps weren't always wrong; they were often exactly what the context file prescribed, such as running the full test suite or reading multiple files before making changes. They were just frequently unnecessary for the specific task being executed. The ETH Zurich behavioral traces suggest this mechanism operates across all agents tested, though the degree likely varies by model capability, and a boundary our team expects future benchmarks to clarify.

Context file failures, then, are not primarily a content problem. Instructions don't fail because they're incorrect. They fail because they're unconditional. Every instruction in a flat context file is in scope for every task, and more instructions produce more steps, more cost, and more noise in the attention window, regardless of whether any given instruction applies.

What Two Seemingly Contradictory Studies Actually Prove

The data picture looks confusing at first. One paper finds context files hurt performance. Another paper finds they save significant time. A practitioner reading both might reasonably conclude that the research is inconsistent and discount both. That conclusion would miss the most important thing the two studies, read together, establish.

The efficiency study, published in January 2026 and covering a wide sample of AGENTS.md deployments, found that providing a context file reduced median agent completion time from 98.57 to 70.34 seconds, a 28.64% reduction, with output tokens falling by 20%, and the results statistically significant. Agents with a context file finished faster and used fewer tokens. That is a real and reproducible finding.

The ETH Zurich study, published the following month, found that human-written developer context files produced only a 4% average gain in task success rates, while still raising inference costs by up to 19%. LLM-generated files performed worse. Tasks were harder to complete correctly.

The JAWs efficiency study found AGENTS.md files cut median agent runtime by 28.64%; the ETH Zurich effectiveness study found those same files reduced task success rates. These results are not contradictory. Speed and correctness are different measurements. A context file that tells an agent which package manager to use, how to invoke the build system, and where the test runner lives will eliminate a lot of exploratory steps, reducing time and tokens consumed. That same file, if it also specifies code style conventions, documentation requirements, and testing protocols the agent will apply regardless of task scope, will push the agent down more steps than necessary and reduce the probability it lands on the correct solution. The efficiency gain comes from narrowing the action space. The quality degradation comes from forcing that narrowed action space onto tasks where the narrowing is wrong.

The 4% performance gain from human-written files was measured across a specific benchmark set; teams running high-volume agent workflows may see different ratios depending on their codebase complexity. But the direction the data points is consistent: human-written, precision-focused files outperform auto-generated files on both dimensions. The developers who wrote the files in the ETH Zurich benchmark knew their repositories and wrote instructions specific to what those repositories needed. The LLM-generated files, by contrast, tended toward comprehensiveness, producing denser documents covering more ground without necessarily covering the right ground.

Detailed directory trees and codebase overviews added to context files did not help agents find relevant files faster; agents navigate file systems effectively without maps. The information that helped was precise, non-inferable instruction: which tool to use, which test to run, which pattern to follow. Everything else was overhead.

What This Means for Teams Measuring Productivity

Teams that measure AI coding agent productivity primarily through speed and token consumption will see improvements with almost any context file. The faster completion times are real. The risk is that a file optimizing for speed may simultaneously be degrading correctness, and correctness failures only show up at code review, QA, or production, not in the productivity dashboard. An accurate picture of context file impact requires measuring task success alongside time and cost, and most teams aren't measuring all three.

The Precision-First Framework for Context Files That Actually Work

A context file should contain the minimum set of non-inferable, task-relevant instructions that would cause the agent to make mistakes if removed. Not the maximum information the agent might find useful. The minimum information it cannot function correctly without.

Anthropic's official Claude Code best practices documentation specifies that CLAUDE.md is loaded every session, making every line a permanent tax on the agent's attention budget. The target documented is under 200 lines, and the filtering question is explicit: would removing this line cause Claude to make mistakes? If not, it should not be there. The documentation is direct about the consequence of ignoring this guidance: "Bloated CLAUDE.md files cause Claude to ignore your actual instructions." This isn't a style preference. It reflects the system reminder mechanism described earlier, where Claude actively deprioritizes content it deems irrelevant, and larger files produce more content that will be deprioritized.

The most productive structure practitioners have found follows a WHY/WHAT/HOW sequence. The WHY is project purpose: what is this codebase for, and what problem does it solve? One or two sentences. The WHAT is technical architecture: the stack, the key dependencies, the structural decisions that the agent can't infer from reading files. The HOW is operational guidance: build commands, test invocations, conventions the agent would otherwise have to guess at. If you're setting up Claude for the first time and want to understand what broader configuration steps complement your context file, the Claude Setup Guide covers the foundational options most new users skip. What belongs in none of the WHY/WHAT/HOW categories is anything the agent can figure out by reading the codebase itself, which covers a great deal more than most developers initially assume.

Studies of how developers actually use context files across production deployments found that 72.6% of Claude Code projects specify application architecture, the most common category by a significant margin. The next most common categories are build and run commands, implementation details, and testing protocols. What appears rarely: performance targets, security requirements, and UI/UX conventions. That distribution aligns with the research: the instructions that help are the ones that can't be inferred from the code, and architecture decisions are the clearest example of non-inferable project knowledge.

The ETH Zurich benchmark data and Anthropic's official documentation arrive at the same prescription from different directions: minimize scope, maximize precision, write every line to the removal test. Neither source states this convergence explicitly; it only becomes visible when reading both together, and that convergence is the strongest signal a practitioner has that the precision-first principle is the right one to build on.

The Four Ways Context Files Fail After Launch

A context file that works well at creation can quietly become harmful over time. Four failure patterns appear consistently across production deployments.

Vagueness is the most common. Instructions like "follow existing indentation style" require the agent to make a judgment call. When that call is wrong, the output is inconsistent. Instructions that work are specific: not "follow code style" but "use 4-space indentation, no tabs, with blank lines between method definitions." If the instruction requires agent judgment to interpret, it belongs in a different category or should be removed.

Contradictions typically appear as a codebase grows. An early instruction specifying one test framework conflicts with a later instruction after a migration. The agent follows both as best it can, producing inconsistent or incorrect behavior. Context files need the same review discipline as code: when a technical decision changes, the context file changes with it.

Missing feedback loops mean there is no way for the team to know when the agent is ignoring or misinterpreting context instructions. The fix is periodic audit: run a session with an explicit check of whether the agent's output matches what the context file specifies, and update accordingly.

Drift is the most insidious. A context file accurate six months ago may describe an architecture, a toolchain, or a testing approach the codebase has since moved away from. The agent follows the stale instructions faithfully. The output is wrong in ways that trace to the context file, not to model failure. Context files should be reviewed on the same cadence as other infrastructure artifacts: during code review when architectural decisions change, not only when onboarding new team members.

When One File Isn't Enough: The Tiered Architecture for Large Codebases

A single context file works well for codebases a developer can describe in a few screens of text. As a project grows, the flat-file approach forces a choice between too little context, leaving agents without the knowledge they need, and too much context, loading the unconditional cost of comprehensiveness into every session. A third path exists, and it appears both in published research and in the engineering philosophy behind how Claude Code itself handles context.

The codified context infrastructure paper we examined specifies that over 80% of human prompts across 283 sessions were under 100 words, consistent with the hypothesis that pre-loaded context reduces in-prompt explanation requirements. That figure came from a three-tier architecture built during the construction of a 108,000-line C# distributed system, across 283 sessions, with 2,801 human prompts and 1,197 agent invocations. The project was built by a researcher whose primary background is in chemistry, not software engineering, making it a documented test case for AI-assisted development by a domain expert working outside their expertise. A single AGENTS.md file could not have captured what that project needed to know.

The architecture the paper developed separates context into three tiers with distinct loading strategies:

Tier 1: The Always-Loaded Constitution (Hot Memory)

The first tier is a compact, always-loaded document that encodes only the project-wide conventions, architectural decisions, and orchestration protocols that apply to every task. This is what a CLAUDE.md or AGENTS.md file should contain: the non-negotiable knowledge no session can begin without. In the case study, this document ran to approximately 660 lines, substantially longer than the under-200-line target for a single flat file, but it was the foundation for a 108,000-line system, not a prototype.

The critical design choice is what stays out of this tier. Domain-specific knowledge, workflows that apply to particular subsystems, and specifications for complex features all belong in the lower tiers where they are loaded on demand rather than unconditionally.

Tier 2: Specialist Agents (Domain Experts, Invoked Per Task)

The second tier consists of specialist agents with substantial embedded domain knowledge loaded only when the relevant task type is encountered. The case study developed 19 specialist agents, including a code reviewer invoked 154 times and a network-protocol designer invoked 85 times. Trigger tables in the first-tier document directed the orchestrator to the correct specialist based on the type of work being requested.

This approach addresses what the researchers call the brevity bias: when context files are iteratively optimized for conciseness, they tend to collapse toward generic instructions that don't give specialists enough to work with. Embedding domain knowledge directly into specialist agents rather than trying to fit it into the always-loaded file preserves specificity without paying the unconditional loading cost.

Tier 3: The On-Demand Knowledge Base (Cold Memory)

The third tier is a collection of specification documents, 34 in the case study totaling approximately 16,250 lines, retrieved via an MCP (Model Context Protocol) server only when a task requires them. These documents cover architectural decisions, known failure modes, and design constraints that don't belong in any single source file but need to be accessible when the relevant question arises.

The key property of this tier is that its contents never enter the context window unconditionally. An agent working on a networking task pulls the relevant network-protocol specification. An agent fixing a save-system bug retrieves the save-system document. Everything else stays in cold storage. The result is comprehensive project knowledge without the attention budget cost of loading all of it into every session.

Anthropic's engineering team describes a parallel principle in their guidance on agentic context design: rather than pre-processing all relevant data upfront, effective agents maintain lightweight identifiers and load information dynamically at runtime. Claude Code implements this with its hybrid approach, where CLAUDE.md files are always loaded while primitives like glob and grep allow file retrieval as tasks require it. For teams working at scale, extending this hybrid model with specialist agents and on-demand specification documents follows the same architecture.

Anthropic's official best practices documentation arrived at "under 200 lines" through tooling design; the ETH Zurich paper arrived at "minimal, human-written requirements" through benchmark evaluation. These two paths end at the same destination. One research thread optimized how the tool should work; the other measured what actually happened when agents used real context files on real tasks. They converge because the underlying constraint is the same: unconditional context loading always costs more than selective context loading, and the cost compounds as files grow. The flat-file approach can't escape this constraint. The tiered architecture dissolves it by making comprehensiveness load on demand rather than by default.

Three Design Rules the Research Agrees On

The evidence from benchmark studies, official tooling guidance, and practitioner architecture points consistently toward the same prescriptions. Teams building with AI coding agents don't need to choose between comprehensive context and usable context. They need to build their context infrastructure the same way they build other software infrastructure: with clear responsibilities per component, minimal overlap, and explicit load strategies.

Rule 1: Write every context file line to the removal test. If deleting a line would not cause the agent to make a mistake on any of the team's common task types, it should not be in the always-loaded file. This test eliminates vagueness, redundancy, and aspirational guidance that sounds useful but functions as noise.

Rule 2: Never auto-generate a production context file. Use initialization commands to understand what the auto-generator produces; its output reveals what the model finds salient about a codebase. Then rewrite from scratch, keeping only the instructions that pass the removal test and that no agent could infer by reading the code.

Rule 3: Treat context files as infrastructure, not documentation. A README gets updated when someone remembers. Infrastructure gets updated when the system breaks. Context files that drift become a source of consistent, repeatable, hard-to-diagnose agent errors. The same review discipline applied to Dockerfiles and CI/CD configurations should apply to context files: they change when the architecture changes, they're reviewed when conventions change, and they're tested against actual agent behavior when the team suspects they've drifted.

The research doesn't promise that a well-built context file eliminates agent inconsistency. Models improve, benchmarks shift, and the specific thresholds that hold today may look different in twelve months as context window management techniques develop. What the evidence does establish clearly is the direction: precision beats comprehensiveness, human judgment beats auto-generation, and selective loading beats unconditional loading. Building in that direction is the most defensible choice the current data supports.

Share Article

TrueSolvers Toolbox

Write for Us

Share Article

TrueSolvers Toolbox

Write for Us

The Context File Mistake Making Your AI Coding Agent Worse

The Auto-Generation Trap Most Developers Don't See Coming

Why Agents Follow Bad Instructions Better Than Humans Do

What Two Seemingly Contradictory Studies Actually Prove

What This Means for Teams Measuring Productivity

The Precision-First Framework for Context Files That Actually Work

The Four Ways Context Files Fail After Launch

When One File Isn't Enough: The Tiered Architecture for Large Codebases

Tier 1: The Always-Loaded Constitution (Hot Memory)

Tier 2: Specialist Agents (Domain Experts, Invoked Per Task)

Tier 3: The On-Demand Knowledge Base (Cold Memory)

Three Design Rules the Research Agrees On

Written By

Share Article

TrueSolvers Toolbox