The Merge and the Score
On April 17th, Cliff Hall — a maintainer in the Model Context Protocol organization — merged PR #3733 into modelcontextprotocol/servers. The review was “LGTM! 👍”, nineteen checks resolved cleanly, commit 756db5b landed on main. The PR added a root CLAUDE.md for the monorepo and fixed a typo (stremableHttp → streamableHttp) in src/everything/AGENTS.md. One hundred and five lines added, one deleted, twenty-one days open.
A few minutes later I ran the merged file through schliff, my own deterministic scorer for AI instruction files. It returned 59.2/100, grade D on the measured dimensions.
That number is the post. Not the merge — the gap between a file that ships and a file that scores well.
What Shipped
The repo holds seven reference MCP servers — four TypeScript, three Python — under src/. It had no root-level CLAUDE.md until this PR, despite already running a Claude Code GitHub Action on incoming PRs — the agent was operating without monorepo-wide context. Per-server files existed, but nothing described the shape above them. The new file covers project overview, a monorepo map, build commands for both stacks, test procedures, and release notes on server boundaries and publishing targets.
The scope was deliberately small. No behavioral rules, no invariants, no “never do X without discussion.” It was a structural document — where things live, how they build, what each server ships.
What Schliff Measured
============================================================
Schliff Structural Score: 59.2/100
[4/8 dimensions measured, 40% weight coverage]
============================================================
△ structure 60
— triggers n/a
— quality n/a
— edges n/a
△ efficiency 57
✗ composability 40
— runtime n/a
✓ clarity 100
============================================================
Four dimensions scored. Four reported as not applicable. That is schliff being explicit about confidence: at 40% weight coverage, the headline number is partial by design. The tool does not round up to make a file look stronger than the measurement supports.
Where the Score Is Right
Composability: 40. The file has no scope boundaries. It never tells Claude what it should not do across servers — no “never cross-edit without discussion,” no “defer to per-server CLAUDE.md for behavioral rules,” no handoff signal. For a seven-server monorepo where a careless refactor can touch unrelated packages, this is a real gap. Schliff flagged six specific issues here: no_scope_boundaries, no_io_contract, no_handoff_points, no_error_behavior, no_dependency_declarations, no_version_compat. Every one of them is something a future reader or future agent would benefit from.
Efficiency: 57. The text is dense — 518 words in 104 non-blank lines, zero hedges, zero filler — but the scorer also measures actionable lines and why signals, and counted zero on both. That is not quite right. The file contains at least two clear imperatives: Package manager: uv (not pip) and Not accepted: new server implementations. Schliff’s pattern for actionable lines did not match them. This is a blind spot worth naming rather than waving away — either the file needs more explicit do/do-not formatting, or the scorer needs a broader pattern. Probably both. A mixed file that gave Claude a few visible action rules and “why this way” notes would score better and genuinely help.
Structure: 60. No YAML frontmatter. Project CLAUDE.md files have never historically carried frontmatter. The penalty has a reason anyway: frontmatter enables tooling — versioning, categorization, machine-readable metadata. A one-line version: 1 and scope: monorepo-root would cost thirty characters and make the file discoverable by any tool that walks the tree.
Where the Score Is Structurally Unfair
The four n/a dimensions are the interesting ones.
Triggers asks when a skill should activate. A project CLAUDE.md has no triggers by design — it is always in scope when the agent is in the repo. There is nothing to score.
Quality and Edges require a companion eval suite — eval-suite.json, test cases, edge cases. That is a skill pattern, not a project-root pattern. Nobody ships a regression suite for a repo overview document.
Runtime measures whether the instruction enables automated scoring in CI. Enforcing that on a project root document is premature.
Schliff handles these honestly: they are marked n/a, not scored as zero, and the weight coverage drops to 40%. The composite number is not pretending the missing dimensions were failures.
Same Tool, Different Targets
Schliff’s own SKILL.md scores 98.9/100 under the same scorer. That file is a skill — it declares triggers, it has an eval suite, it specifies handoff and edge behavior. The MCP CLAUDE.md is an artifact of a different kind. Same tool, two target shapes, two honest scores.
This is why the confidence display matters. A 98.9 at 95% coverage means the tool measured almost everything and found it strong. A 59.2 at 40% coverage means the tool measured a subset and reports the subset. Reading the number without the coverage is the mistake.
What the Gap to 80 Would Cost
Three additions would do most of the work:
A scope-boundaries section. Four to six lines. “Do not cross-edit server packages without discussion. Behavioral changes defer to per-server CLAUDE.md. Release timing is owned by maintainers, not agents.” That moves no_scope_boundaries, no_handoff_points, and no_error_behavior from red to green and lifts composability from 40 into the 70s.
Dependency declarations. Two lines. Requires: Node 20+, Python 3.10+. Build artifacts: npm, PyPI. This addresses no_dependency_declarations and no_version_compat.
Minimal frontmatter. Four lines. Version, scope, maintainers, last-reviewed date.
Combined, these take a thirty-minute editing pass and push the composite score into the low 80s — still at 40% weight coverage, because the measurable gaps are not the only gaps. More importantly, they make the file better for the Claude Code Action that reads it on every incoming PR.
Why I Published This
The first instinct after a merge in a high-visibility repo is to promote the win. The second — if you use your own tools — is to audit it. A 95/100 would have meant a broken test corpus. Fifty-nine-point-two is useful because it is not flattering.
If a scorer you wrote returns a kind number on work you shipped, you built the wrong scorer.
Score yours: pip install schliff and schliff score path/to/your/CLAUDE.md. The repo is here. Related reading: why skills are the new dotfiles and Hydra, the cross-model code review council.