Labs / citation-faithfulness

Citation faithfulness

Does the wiki say what its sources say? Two AI judges grade every citation; the headline is how often they agree.

The question

When the wiki cites a source, the source backs the claim. And cheap Haiku agrees with pricier Sonnet often enough that routine checks don't need Sonnet at all.

Judge agreement 89% · 103 claims

Faithfulness · does the source back the claim?

84% supported · 12% partial · 4% unsupported (206 claims, both judges)

Haiku

85% · 103 claims

Sonnet

83% · 103 claims

supported partial unsupported paraphrase counts; inference past the source doesn't

Agreement · do the two judges return the same verdict?

89% 92 agree · 11 disagree (of 103 claims both judges scored)

Each judge pulls its own citing sentences, so only the 103 both happened to extract can be compared head-to-head.

9 of 11 splits land on the partial / supported line — both judges read the source as on-topic but split on how much it backs. Only 1 is a hard supported↔unsupported flip. Open a row to read each judge's reasoning side by side.

Your Agent Loves MCP argues that loading MCP tool definitions into context each session is expensive in tokens and non-composable, analogous to forcing a programmer to read a GUI manual before every task.

H unsupported → S supported ▸ ▾

Haiku unsupported

The source summary states that MCP tool definitions loaded into context are 'token-expensive and non-composable,' but it does not contain the GUI analogy or the comparison to 'forcing a programmer to read a GUI manual before every task.' The claim's specific metaphorical framing is not supported by the provided source text.

Sonnet supported

The source summary explicitly states MCP is 'token-expensive, and non-composable' and describes 'MCP tool definitions loaded into context each session,' matching the claim. The GUI analogy is also present in the source title and summary.

read the cited source ↗

Zod-based response validation addresses the external API surface: when a backend response deviates from the expected shape, runtime errors surface in production rather than at dev time.

H supported → S partial ▸ ▾

Haiku supported

The source confirms that Zod validation catches unexpected backend response shapes and prevents runtime errors. The claim's core assertion—that Zod-based response validation addresses API surface issues by surfacing errors when responses deviate from expected shape—is directly supported by the source summary, which describes using Zod to catch unexpected backend response shapes.

Sonnet partial

The source says Zod validation catches unexpected backend response shapes 'at dev time before they cause runtime errors,' which is the opposite of the claim's assertion that 'runtime errors surface in production rather than at dev time.' The source partially supports the topic (Zod-based response validation addressing unexpected backend shapes) but contradicts the specific timing assertion.

read the cited source ↗

Vectorize's Hindsight system uses biomimetic data structures and multi-strategy retrieval so agents accumulate and query memory across sessions, moving closer to the kind of persistent context that one-shot context windows cannot provide.

H supported → S partial ▸ ▾

Haiku supported

The source explicitly describes Hindsight as using 'biomimetic data structures and multi-strategy retrieval' and states it lets 'AI agents learn and build mental models over time,' which directly supports the claim about accumulating and querying memory across sessions to provide persistent context beyond what context windows alone offer.

Sonnet partial

The source confirms biomimetic data structures and multi-strategy retrieval, and that agents 'learn and build mental models over time.' However, the source does not explicitly mention 'accumulating and querying memory across sessions' or contrast the system with 'one-shot context windows,' so those specific aspects of the claim go beyond what the source states.

read the cited source ↗

Shell-level gains are also available: Christian Hofstede-Kuhn documents Readline bindings, history search, and script safety flags that pay dividends without any AI dependency.

H supported → S partial ▸ ▾

Haiku supported

The source summary explicitly mentions 'Readline key bindings, history search' and 'script safety flags,' which directly matches the claim's reference to 'Readline bindings, history search, and script safety flags.' The source demonstrates these are documented by the material without AI dependency as stated in the claim.

Sonnet partial

The source covers Readline key bindings, history search, and script safety flags, matching three items in the claim. However, the source does not mention the author name 'Christian Hofstede-Kuhn' anywhere in the supplied text, so that specific attribution cannot be verified from the source alone.

read the cited source ↗

Tuhin Nair adds that senior engineers lose influence when they speak in terms of complexity management rather than uncertainty reduction, a communication gap that slows organizational decision-making regardless of individual output.

H supported → S partial ▸ ▾

Haiku supported

The source explicitly states that senior developers lose influence because they speak in terms of complexity management while the rest of the business is motivated by uncertainty reduction. The claim accurately paraphrases this core assertion and adds that this communication gap slows organizational decision-making, which is a reasonable inference from the source's framing of the problem.

Sonnet partial

The source supports the core claim that senior developers lose influence by speaking in terms of complexity management rather than uncertainty reduction. However, the source does not mention a communication gap that 'slows organizational decision-making regardless of individual output.' That specific assertion goes beyond what the source states.

read the cited source ↗

Using Zod with a custom RxJS operator in Angular catches unexpected backend response shapes at dev time rather than at runtime, moving failures left without requiring changes to backend contracts.

H supported → S partial ▸ ▾

Haiku supported

The source summary directly supports the claim. It states that Zod schema validation with a custom RxJS operator catches unexpected backend response shapes 'at dev time before they cause runtime errors,' which aligns with the claim's assertion about catching failures at dev time rather than runtime without requiring backend contract changes.

Sonnet partial

The source supports catching unexpected backend response shapes at dev time before runtime errors, and mentions Zod with a custom RxJS operator in Angular. However, the source does not mention 'moving failures left without requiring changes to backend contracts,' so that specific framing goes beyond what the source states.

read the cited source ↗

The SAP npm supply chain attack, however, is a direct counterpoint: the same Claude Code and VS Code configs exploited as persistence vectors show that tools with deep system access are attractive targets, and supply chain hygiene is part of the tooling surface.

H partial → S supported ▸ ▾

Haiku partial

The source confirms that the attack exploited Claude Code and VS Code configs as persistence vectors and involved npm packages (supply chain). However, the source does not explicitly state that tools with deep system access are 'attractive targets' or discuss supply chain hygiene as part of the tooling surface—those are interpretive inferences from the attack details rather than direct claims in the source.

Sonnet supported

The source explicitly states the attack 'abuses Claude Code and VS Code configs as persistence vectors,' directly supporting the claim that these tools were exploited as persistence vectors. The source also confirms it is a supply chain attack involving credential stealing, supporting the broader point about supply chain hygiene and tools with deep system access being targets.

read the cited source ↗

Crafting Interpreters represents craft at its most foundational.

H partial → S unsupported ▸ ▾

Haiku partial

The source describes Crafting Interpreters as a book with interpreter implementations and a build system, but does not explicitly characterize it as representing 'craft at its most foundational' or make claims about its foundational nature or meaning of craft.

Sonnet unsupported

The source describes the repository as containing the book text and interpreter implementations, but makes no claim about the work representing craft at its most foundational level. That evaluative assertion is not present in the source.

read the cited source ↗

Optimal Workshop makes enterprise compliance an explicit selling point against UserTesting, citing SOC 2, GDPR, and data residency options as reasons a procurement team would choose it.

H partial → S supported ▸ ▾

Haiku partial

The source confirms that Optimal Workshop's comparison page mentions enterprise compliance as a differentiator against UserTesting, but does not specify the particular compliance features cited (SOC 2, GDPR, data residency) or explicitly state that procurement teams would choose Optimal for these reasons. The claim goes beyond what the summary states.

Sonnet supported

The summary states Optimal Workshop argues its platform beats UserTesting by offering 'enterprise compliance' as a differentiator. However, the source does not explicitly mention SOC 2, GDPR, or data residency options; it only references 'enterprise compliance' broadly. The claim adds specific details not present in the source.

read the cited source ↗

The Currents.dev piece on Playwright selector strategy addresses the root cause most within a team's direct control Designing Playwright Tests That Survive UI Refactors.

H supported → S partial ▸ ▾

Haiku supported

The source summary explicitly states that Playwright tests break due to coupling to CSS classes, DOM structure, and text content rather than semantic roles and attributes, and prescribes selector hierarchy and page-object patterns as solutions. This directly supports the claim that the piece addresses root causes within a team's control (selector strategy choices).

Sonnet partial

The source addresses selector strategy and its role in test breakage during UI refactors, but it does not mention Currents.dev, nor does it explicitly frame selector strategy as 'the root cause most within a team's direct control.' The source supports the general topic but not that specific framing.

read the cited source ↗

Adrian Bece's breakdown covers the math in detail: the preferred parameter is derived from a linear equation relating viewport width to font size, and using rem units for the min and max values (rather than px) preserves browser-level text zoom, which is critical for accessibility.

H partial → S supported ▸ ▾

Haiku partial

The source summary confirms that the reading covers the math behind preferred-value parameters and rem-based accessibility fixes, supporting those parts of the claim. However, the source does not explicitly state that the preferred parameter is 'derived from a linear equation relating viewport width to font size,' nor does it explicitly connect rem units to preserving browser-level text zoom. The summary mentions these topics but at a higher level of abstraction than the specific technical claim made.

Sonnet supported

The source summary explicitly states it covers 'the math behind preferred-value parameters' and 'rem-based accessibility fixes,' which directly supports both parts of the claim: the linear equation math and the rem units preserving browser text zoom for accessibility.

read the cited source ↗

The cost bet · what the agreement buys

Haiku

$0.14

Sonnet

$0.40

Haiku grades the same claims for $0.14 against Sonnet's $0.40 — about 2.8× cheaper. At 89% agreement, routine passes can run Haiku alone and route only the 11 splits to Sonnet and a human.

Per-article · 21 wiki concepts disagreements first

developer-productivity

2 split

2 split

1 split

1 split

1 split

1 split

1 split

1 split

1 split

ai-agents

agree

developer-tools

agree

ai-assisted-coding

agree

continuous-integration

agree

agree

agree

agree

agree

ai-safety

agree

automation

agree

font-pairing

agree

kubernetes

agree

An AI builds the wiki. It reads each saved article, clusters the articles by topic, and writes one synthesis paragraph per cluster, citing the sources it drew from. The question that matters here is whether those citations hold: when the paragraph cites an article, does the article actually say what the paragraph claims?

So for every sentence in the wiki that links to a saved article, two AI judges read the sentence and the source and return one of three verdicts:

supported — the article backs the claim. Paraphrase counts; inference past what the source says doesn’t.
partial — the article is on the topic but backs only a weaker version, or doesn’t quite get there.
unsupported — the article doesn’t back the claim, contradicts it, or is about something else.

The two judges aren’t the same size. Haiku is small and cheap. Sonnet is bigger and runs about three times the cost. Grading the same claims with both turns the cost-quality tradeoff into a number: if they mostly agree, later passes can run Haiku alone and kick only the disagreements up to a human, and to Sonnet.

When they disagree, I read the source myself. Sometimes the citation is fine and a judge got it wrong. Sometimes the wiki overreached and the citation needs retagging or removal. And sometimes the claim sits in a gray zone the rubric hasn’t pinned down yet — those cases are worth the most, because they’re where the rubric grows.

All of this assumes the wiki cited the right article to begin with. That assumption gets its own check in Topic stability, which asks whether the topic tags the clustering relies on hold still over time. What each run costs lands in Ingest pipeline cost.

The judge prompt is versioned. Bump the version and every prior score is thrown out on the next pass, so verdicts written against two different rubrics never get averaged together.

back to /labs