DETERMINISTIC OPTIMISM ๐ŸŒž's avatar
DETERMINISTIC OPTIMISM ๐ŸŒž
nvk@primal.net
npub1az9x...m8y8
-... .. - -.-. --- .. -. strong opinions, loosely held.
DETERMINISTIC OPTIMISM ๐ŸŒž's avatar
NVK 5 days ago
# Writing 84 Tests for a Project With Zero Lines of Code The llm-wiki project has 3,610 lines across 22 files. Every single one is a markdown file. There is no Python. No JavaScript. No compiled binary. The "source code" is English prose โ€” instructions that Claude reads and follows to build knowledge wikis from web research. So how do you write tests for a program that is, technically, a document? I figured it out. 84 structural assertions, 11 intentionally broken wiki fixtures, 5 behavioral evals via Promptfoo, and a GitHub Actions pipeline. As far as I can tell, nobody has used Promptfoo to test a Claude Code plugin before. Here is what I learned. ## The three-layer problem Traditional testing has a simple contract: given input X, the function returns Y. If it doesn't, the test fails. But when your "function" is an LLM reading markdown instructions, the contract dissolves. The same instruction file, given the same user request, might produce different article titles, different file structures, different cross-references. The output is correct within a range, not at a point. Anthropic, OpenAI, and GitLab all converge on the same solution: split tests into layers by how much uncertainty you're willing to tolerate. **Layer 1 is deterministic and free.** No LLM calls. You're checking that the wiki's file system is internally consistent. Does every directory have an `_index.md`? Does every raw source have the six required frontmatter fields? Does the `type: articles` file actually live in `raw/articles/` and not `raw/papers/`? These checks take seconds and cost nothing. I have 84 of them. They run on every push. **Layer 2 is semantic and costs money.** You ask Claude to do something โ€” ingest a URL, compile an article, route a command โ€” and then grade whether it followed the instructions. Promptfoo handles this with three assertion types: trajectory assertions ("did it call WebSearch?"), llm-rubric assertions ("does the output have complete frontmatter?" graded by a judge LLM), and custom JavaScript that checks the file system after the agent runs. Each eval costs about $0.50. I run five of them on PRs. **Layer 3 is full workflows.** Research-to-article. Ingest-compile-lint. Retract-and-verify-cleanup. These use `claude -p` in headless mode, cost $10-20 per run, and execute weekly. I haven't built these yet. Layers 1 and 2 are live. ## The golden wiki Every structural test needs something to test against. I built a golden wiki โ€” a minimal but complete fixture with three raw sources, two compiled articles, proper cross-references, bidirectional See Also links, correct index files, and a valid log. Twenty files total. It passes every check. Then I broke it eleven different ways. One copy per lint rule. `missing-index/` has a deleted `_index.md`. `bad-frontmatter/` has `type: invalid` instead of `type: articles`. `misplaced-file/` puts a concept article inside `wiki/references/`. `retracted-marker/` leaves a `<!--RETRACTED-SOURCE-->` comment that should have been cleaned up. Each broken copy triggers exactly one violation. The test asserts that the defect is present โ€” negative testing. A shell script called `http://generate-defect-fixtures.sh` creates all eleven from the golden wiki in under a second. Change the golden fixture, regenerate, and every negative test updates automatically. ## Promptfoo on a Claude plugin Promptfoo has a provider called `anthropic:claude-agent-sdk` that can load local plugins. Point it at your plugin directory, whitelist the tools, set a budget cap, enable sandbox mode, and it runs your plugin through test cases defined in YAML. The part that surprised me: the `skill-used` assertion type. You can assert that the agent invoked a specific skill โ€” not just that the output mentions wiki commands, but that Claude actually triggered the wiki skill at the Claude Code level. Combined with trajectory assertions that verify which tools were called, you can check both what happened and how. I test five behaviors: the fuzzy router dispatching "Research the history of testing" to the research command, a URL to ingest, a question to query, an ambiguous single word triggering clarification (negative control), and the plugin loading without errors. Each runs three times to measure variance. ## What I actually learned The biggest surprise: Layer 1 catches almost everything. The expensive behavioral evals in Layer 2 are for confidence, not coverage. Index corruption, frontmatter drift, misplaced files, broken cross-references โ€” these are the actual failure modes of a wiki management system, and they're all deterministic. You don't need an LLM to verify that a file exists in the right directory. Anthropic's eval guide says "grade outcomes, not trajectories." For wiki operations, the outcome IS the file system state. Check the files, check the indexes, check the links. If the structure is correct, the agent followed the protocol. The trajectory โ€” which tool calls it made, in what order โ€” is interesting but secondary. The test suite is at in `tests/`. Clone it, run `./tests/test-structure.sh`, and watch 84 green checkmarks validate a project that contains zero lines of code.
DETERMINISTIC OPTIMISM ๐ŸŒž's avatar
NVK 1 week ago
Writing bitcoinquantum.space with llm-wiki.net In April 2026 I wanted to assess whether the quantum threat to Bitcoin was real. The honest answer lived across fifteen papers, a dozen Delving Bitcoin threads, twenty Bitcoin Optech newsletters, a running testnet, some Liquid transactions, and whatever Avihu Levy had pushed to GitHub that morning. The work was real and scattered. No article summarized it honestly. Headlines were downstream of press releases. The primary sources were where the actual answer lived. This is one of the things llm-wiki was built for. I used it. Three weeks later I published [bitcoinquantum.space](bitcoinquantum.space) โ€” three articles, ~15,000 words, 95+ sources cross-referenced, every claim verified. This is a writeup of how. ## The shape of the problem Serious research has three failure modes: 1. **You can't find everything.** Sources scatter across formats and venues. You don't know what you're missing. 2. **You can't remember everything.** By paper #60 you've forgotten paper #4. You re-read. You contradict yourself. 3. **You can't update.** A new paper drops on publication day. Your conclusion is stale and your notes are already collapsed into prose you can't untangle. Traditional knowledge management fixes (1) and partly (2). It fails at (3) because the maintenance burden compounds. @karpathy's framing, *"who does the maintenance?"*, is load-bearing because humans don't, not reliably, not for unsexy cross-reference updates nobody sees. llm-wiki.net fixes (3) by making the entire artifact mechanically regeneratable from immutable raw sources. The only thing you maintain is the source pile. ## The pipeline, applied **Raw sources, not notes.** Every paper, blog post, mailing list thread, and testnet report got dropped into `raw/` verbatim with a frontmatter header. No interpretation, no paraphrasing. If I don't have the primary source, I don't have it. `raw/` grew to 95+ entries. **Compile, don't write.** `/wiki:compile` reads the raw pile and synthesizes cross-referenced wiki articles โ€” one per concept, person, and proposal. "SHRINCS." "Taproot script-path post-quantum proof." "The BIP 86 problem." "Quantum Safe Bitcoin." Each article carries a confidence level, citations, and bidirectional cross-references. The wiki is Claude's work; the sources are mine. **Query to find gaps.** Once compiled, I stop reading papers and start asking questions. *"What's the relationship between Ruffing's Taproot proof and BIP 86?"* The wiki answers with citations โ€” and in the process surfaces the gap: 70-90% of BIP 86 outputs can't use the escape hatch. That's a thread I wouldn't have pulled linearly. Query mode is where llm-wiki stops being a filing cabinet and starts being a research partner. **Output, last.** The articles on bitcoin
DETERMINISTIC OPTIMISM ๐ŸŒž's avatar
NVK 1 week ago
llm-wiki.net v0.0.15 is out image
DETERMINISTIC OPTIMISM ๐ŸŒž's avatar
NVK 3 weeks ago
So who's going to buy out Flickr and republish on Nostr? I'd tip in.
DETERMINISTIC OPTIMISM ๐ŸŒž's avatar
NVK 1 month ago
COLDCARD Mk5
DETERMINISTIC OPTIMISM ๐ŸŒž's avatar
NVK 1 month ago
You are addicted to other people's problems Go outside, walking grass and absorb some low entropy from the sun image
DETERMINISTIC OPTIMISM ๐ŸŒž's avatar
NVK 1 month ago
Honey badger doesn't care image
DETERMINISTIC OPTIMISM ๐ŸŒž's avatar
NVK 2 months ago
Banana bread > Banana rice
โ†‘