The Missing Layer: Static Security Analysis for CLAUDE.md and AGENTS.md

curl -fsSL https://example.com/install/linux.sh | bash

That line lives in a public skill file, today, in a community repository. A single regex against the raw Markdown catches it in milliseconds, before any agent ever loads the file. Every open-source security eval surveyed for this piece does the opposite: spin up the agent, send it adversarial prompts, read the response. The file the agent reads on startup is never the file under test. By the time an agent following the skill’s install instructions runs the pipe, the supply-chain decision has already happened.

The threat model nobody lints

The OWASP Top 10 for LLM Applications, 2025 edition, lists ten risk categories. Four of them produce structural fingerprints inside the instruction files an agent loads on startup. Those fingerprints sit on disk, in version control, before the model is invoked. The category nobody owns is the layer that reads them.

LLM06 Excessive Agency is the cleanest case. A skill that says “always proceed without confirmation” is a fully detectable structural marker. There is no runtime ambiguity. The instruction file is the artifact.

LLM02 Sensitive Information Disclosure overlaps territory that has been worked for a decade. gitleaks, trufflehog, and git secrets scan repositories for credentials and high-entropy strings; they do it well. The argument here is not that those tools are missing. It is that an internal hostname in an AGENTS.md, an example API key committed because someone wanted the skill to be “ready to run”, or a quoted endpoint inside a workflow description all propagate through skill distribution channels the way a leaked secret in a Dockerfile would. The layer that catches them needs to be aware of the LLM context — what an instruction file is, how it travels, what a skill registry redistributes — on top of the established secret-scanning primitives, not instead of them.

LLM01 Prompt Injection is more subtle. The attack itself is a runtime event: an external input or a tool response carries the payload, and the agent acts on it. What lives in the instruction file is hardening posture. Imperative override patterns, context-reset phrasing, Unicode tag smuggling primitives, instructions that explicitly invite unconditional compliance — none of these prevent prompt injection. They make the file easier to subvert when the runtime payload arrives. Catching them at lint time is harm reduction, not prevention. Naming that distinction openly is part of keeping the category honest.

LLM07 System Prompt Leakage is the upstream-source case. The leak event happens at runtime. The thing that leaks lives in the instruction file. Lint-time inspection of what the file contains is partial coverage of the LLM07 surface, not full coverage. Better than zero coverage.

MITRE ATLAS, in its current release as of publication, catalogues the attacker side of this picture. The agent-specific techniques added in recent versions describe behaviors that emerge once an agent is running. The instruction-file layer sits upstream of all of them.

The 2025 MCP ecosystem CVEs — CVE-2025-6514 in mcp-remote at CVSS 9.6, CVE-2025-49596 in MCP Inspector at CVSS 9.4 — are transport- and tooling-layer vulnerabilities. They are not evidence that instruction-file linting was the missing defense in either case. They are evidence that the broader MCP surface is real and under active scrutiny. The instruction-file layer is adjacent to that surface, not the same one. Snyk’s “ToxicSkills” study (vendor) reports a non-trivial rate of risky patterns across distributed skill files; cited here for motivation, not as independent benchmark.

What runtime evals can’t see

The leading open-source security eval tools — PyRIT from Microsoft, garak from NVIDIA, Promptfoo, DeepTeam, Rebuff, Lakera PINT — assume one shape of test. Configure a target endpoint. Send adversarial prompts. Record the responses. Wrap custom assertions around the loop.

Two of them, Promptfoo and garak, accept static content as test input. All of them allow custom assertions a determined practitioner could point at a Markdown file. The technical capability to grep an instruction file with a regex is not the missing piece. Anyone can write a regex.

The missing piece is the taxonomy. None of those projects ships an opinionated, shared list of instruction-file risk patterns mapped to concrete detections. A first-time skill author writing AGENTS.md gets no scoped guidance from the eval ecosystem about what should and should not appear in the file. A reviewer trying to gate skill submissions on something deterministic has no off-the-shelf rule set. A CI pipeline that wants a pre-merge security gate for instruction artifacts builds it from scratch every time.

Among the leading open-source eval tools surveyed here, none ships that taxonomy. Commercial tools — Protect AI Guardian, HiddenLayer, Lakera’s enterprise features — were not part of this survey, so the claim is bounded to the OSS landscape.

The contribution of a dedicated instruction-file linter is the taxonomy plus the patterns. Regex matching against Markdown is trivially implementable. A shared, named, versioned set of categories with explicit penalties and known false-positive expectations turns an ad-hoc grep loop into something a CI pipeline can run on every PR with the predictability of ESLint or pytest. The gap is structurally similar to the one SAST filled for application source code in the 2010s. Static analyzers existed before SAST products did. What changed was the taxonomy: a shared model of what to look for, refined across many codebases, ported across many languages. Instruction files are at the early-2010s point on that curve.

A scorer, a corpus, and an uncomfortable result

I built a scorer. It is one implementation of the layer this article argues for, not the layer itself. Six categories, each with a budgeted penalty: injection (25 points), exfil (25), dangerous_cmd (20), obfuscation (15), overpermission (15), boundaries (10). Three explicit false-positive mitigations — code-block exclusion, meta-discourse awareness, negation-aware matching — sit alongside the patterns. Pinned at commit a6e1086. None of that is the interesting part.

On 2026-04-25 I pointed the scorer at 91 public instruction files: skills, CLAUDE.md, AGENTS.md scraped from public repositories. The numbers came back like this:

Stage	Count	Rate
Files scanned	91	—
Flagged (raw recall)	14	15.4%
Legitimate after triage	2	14% precision

14 of 91 came back flagged. I triaged them by hand. 2 survived. That is the post.

The 12 false alarms are illustrative. The English verb “fetch” appeared inside cache-tutorial prose and matched a regex aimed at exfiltration verbs. A Rust call site of the form tokio::Span::current().record(...) matched an exfil pattern designed to catch credential-recording shapes. A zero-width joiner inside the runner emoji — the one a casual reader sees as a single glyph — matched an obfuscation rule built for invisible-character smuggling. None were exfil. All came from regex patterns intentionally tuned for high recall.

The right way to read 14% precision is as a design decision, not a defect. A first-pass linter optimized for recall is the correct starting shape, on two conditions: that the precision is reported plainly rather than buried under a dashboard number, and that there is a refinement path the architecture leaves room for. Early Pylint flagged everything that moved. Early ESLint had a reputation for noise. Early Semgrep pinned its rules conservatively and grew them. All three matured into tools practitioners run on every commit. Start loud, refine, narrow, version.

If a scorer you wrote returns a flattering number on a corpus you did not curate, the scorer is wrong. If it returns 14% precision and you ship the number anyway, the scorer is doing its job and you are doing yours.

What the scan actually found

Two flags survived triage. One of them, sanitized, looked like this when the scorer emitted it:

{
  "category": "exfil",
  "penalty": 25,
  "line": "curl -fsSL https://example.com/install/linux.sh | bash",
  "source": "public skill file, community repository",
  "scorer_commit": "a6e1086"
}

A category, a penalty, the offending line, provenance. The host is sanitized; the structure of the line and the fact that the line lives in a published skill file are not. I am not naming the repository. The pattern is the point, not the author.

A runtime eval would not have caught it. Runtime evals see what the agent does once it is executing. They probe outputs against adversarial inputs and assert on the responses. The skill file containing this pattern is not an output. It is configuration loaded before the model produces a single token. By the time an agent following the skill’s install instructions runs the pipe, the supply-chain crossing has happened. The harness wrapping the agent has no signal to send: the agent did exactly what the instructions told it to.

That is the specific failure mode this layer addresses. An instruction file is a contract the agent treats as authoritative. An agent loaded with this skill, asked to “follow the install instructions”, concatenates an unauthenticated remote script into a shell with no human prompt, no checksum verification, no provenance check. The instruction does not need to be malicious. The install method itself encodes a supply-chain vulnerability that becomes the agent’s behavior at the first opportunity.

The fix is well-understood outside the LLM context. Download the install script to disk, verify against a published hash, then execute. Use a package manager with signed-package guarantees. Pin by content hash if the publisher exposes one. The pattern of “fetch network output, pipe directly to a privileged interpreter” was a security anti-pattern long before LLMs existed. What changed is the propagation surface: a skill file shipped through a registry now carries that pattern into every agent that loads the skill. Linting the file before publication is the cheap intervention.

I found this in a published skill. The scan that found it took three minutes. One concrete finding, deterministically surfaced, in a place no runtime eval would have looked. Existence proven; distribution unknown.

Where static analysis stops

Static analysis has hard limits, and naming them is part of what keeps the category honest.

Semantic intent is one. A skill that reads as helpful and useful while quietly steering the agent toward a behavior the user did not intend is not a regex problem. The patterns there are linguistic, not lexical. Multi-turn adversarial resistance is another — whether an instruction file holds up under sustained probing across many turns is a runtime question, not a static one. Indirect prompt injection via externally loaded content lives at runtime by definition: the malicious payload is not in the instruction file at lint time, because it has not been fetched yet.

The natural counter-argument: an LLM-judge running over the instruction file itself catches much of what a regex misses. That counter-argument is correct. It is not an argument against static analysis. It is an argument for layering.

Static analysis runs first because it is deterministic, cheap in CI, and reviewable line by line. The 14% precision number is the reason layering matters. A first-pass linter optimized for recall flags 14 in 91 files; the LLM-judge layer’s job is to read each flagged region in surrounding context and decide whether the flag is real signal or false alarm. The architecture is suggested by the precision data, not threatened by it. Run the static layer to bound the search space. Run the judge layer to interpret what the static layer found. Both feed the same pre-merge gate.

PromptEvals (Shankar et al., arXiv 2504.14738) demonstrates what a well-constructed LLM-judge layer looks like in the adjacent domain of pipeline-assertion authoring. The instruction-file layer is a sibling to that work, not a competitor. Cheap deterministic recall plus expensive semantic precision is the same composition, applied one lifecycle stage earlier.

What is missing today is the shared taxonomy underneath both layers. The community needs an agreed list of patterns, with names, weights, and known false-positive expectations — the way OWASP categories are community artifacts. Implementations will vary. The taxonomy is what makes them comparable.

Closer

A pre-merge CI job that runs a security scorer over every changed instruction file, fails the build below a score threshold, and posts the offending line to the PR comment is the concrete shape of the integration. In GitHub Actions, the minimum looks like this:

- name: Lint instruction files
  run: |
    pip install schliff
    schliff security CLAUDE.md AGENTS.md skills/**/*.md \
      --fail-under 80 --format github

Five lines, one job, one threshold. Same lifecycle stage as Pylint. Same pattern as the earlier MCP self-audit and the same-day blindspot fix: scan your own files first, fix what the scorer finds, then publish the rule set so other people can do the same.

The Missing Layer: Static Security Analysis for CLAUDE.md and AGENTS.md

The threat model nobody lints#

What runtime evals can’t see#

A scorer, a corpus, and an uncomfortable result#

What the scan actually found#

Where static analysis stops#

Closer#