Roadmap - WARDEN

WARDEN is built in phases designed to be independently useful. You get value before the whole system exists. Each phase delivers a working tool; later phases deepen and connect what came before. The status key used throughout this page:

Status	Meaning
Implemented and tested	Code ships, exercised by the test suite, runs on `warden demo`.
Scaffolded	Module and interface exist, the seam is wired end-to-end, but the implementation is a stub or requires optional native tooling to activate.

Phase 0: Spine

Goal. A queryable model of any Emscripten module: parse the binary, compute stable function identities, seed obvious symbols, and store everything in a versioned knowledge base. Deliverables. warden init, warden ingest, warden funcs, warden show, warden coverage, warden set-name are all fully operational. The KB can round-trip a module and answer “what do we know about func[N]?” without any external tooling. Status: implemented and tested.

Component	Source	What it does
WASM section parser	`src/warden/ingest/`	Reads type, import, function, export, code, name, element/table sections in pure Python.
JS-glue parser	`src/warden/ingest/`	Reads Emscripten’s export-index map, `dynCall` signatures, and `PROXY_TO_PTHREAD` shape from the `.js` glue file.
Identity fingerprinter	`src/warden/identity/fingerprint.py`	Computes exact body hash, structural skeleton, opcode-class histogram, call-target set, and type signature per function. The `stable_id` composite is what annotations are keyed to.
Knowledge base	`src/warden/kb/`	SQLite schema with `module_versions`, `functions`, `symbols`, and `diffs` tables; provenance/confidence/locked columns; `upsert_symbol` economy (human > oracle > agent).
CLI	`src/warden/cli.py`	`init`, `ingest`, `versions`, `coverage`, `funcs`, `show`, `set-name`, `verify`, `export`, and `demo`.
`warden demo`	`src/warden/samples.py`	Runs the entire pipeline end-to-end on generated sample modules with no network or native toolchain.

Covered by tests/test_ingest.py, tests/test_fingerprint.py, tests/test_kb.py, tests/test_cli.py, and tests/test_pipeline.py. Fingerprint determinism is tested explicitly: same bytes in, same stable_id out.

Phase 1: Decompile and export bridge

Goal. Push recovered names from the KB back into existing RE tools so analysts keep their current workflow. Pull decompiler output back in to enrich proposals. Deliverables. warden export --format ghidra emits a runnable Python rename script; warden export --format headers emits a C header; warden export --format pseudo emits readable per-function listings; warden export --format kb-text emits a git-diffable plain-text snapshot. Status: exporters implemented and tested (including a built-in pure-Python decompiler); live Ghidra round-trip scaffolded. The four export formats (headers, pseudo, kb-text, ghidra) in src/warden/export/text.py are implemented and covered by tests/test_cli.py. The ghidra format emits a valid Python snippet that calls getFunctionByWasmIndex (from the nneonneo/ghidra-wasm-plugin) and fn.setName(name, SourceType.USER_DEFINED) for every named function in the KB. Built-in lifter. warden.lift is a pure-Python stack-machine lifter that renders readable pseudo-C without any native tooling. lift_function(module, func) returns a string; lift_module(module) lifts every function. warden export --format pseudo now emits real pseudo-C (not a mnemonic dump), and the dedicated CLI command warden lift <label> [--index N] [--out FILE] exposes the lifter directly. For example, parse_token lifts to:

i32 parse_token(i32 p0, i32 p1) { return ((p0 + p1) * 7); }

What is scaffolded: the script is generated correctly, but the live Ghidra round-trip (launching Ghidra headlessly, loading the plugin, running the script, and reading decompiled p-code back through pyghidra) is not automated. Activating it requires Ghidra and the plugin installed locally.

Phase 2: The Emscripten Oracle

Goal. Auto-identify 40–80% of any Emscripten module as known musl / libc++ / dlmalloc / Emscripten-runtime code, instantly, with real upstream names, so agent and human effort concentrates on the application-specific remainder. Deliverables. warden oracle build and warden oracle identify. A corpus of labeled .wasm artifacts (emsdk × build-flag matrix) backed by the signature store; version inference from the distribution of Oracle matches. Status: Oracle engine and MinHash-LSH index implemented and tested; full emsdk corpus farm scaffolded.

Component	Source	What it does
Signature extraction	`src/warden/oracle/corpus.py`	`extract_signatures` fingerprints every named defined function in a labeled module and classifies it by library (musl, libc++, dlmalloc, emscripten, wasi-libc, musl-pthread).
Signature store	`src/warden/oracle/signatures.py`	JSON-serialisable store; `load` / `save` / `extend` / `libraries()`.
Identification pass	`src/warden/oracle/match.py`	Fingerprints every defined function in the target, scores against each corpus signature using `similarity()`, and writes matches above threshold as `oracle`-provenance symbols into the KB.
MinHash-LSH index	`src/warden/oracle/index.py`	`SignatureIndex.build(store, bands=8)` builds a sublinear candidate index; `index.candidates(fp)` returns approximate neighbors; `identify_indexed(kb, version_id, store, threshold=0.82, write=True)` is a drop-in replacement for the linear `identify()` pass. CLI: `warden oracle identify <label> --store s --indexed`.
Version inference	`src/warden/oracle/`	`infer_version` reads the distribution of `emscripten_version` fields across matches and returns the plurality winner with a calibrated confidence score.

What is scaffolded: scripts/corpus/ describes the containerised emsdk matrix build (each tagged release × -O0…-Oz, -pthread on/off, LTO, exceptions mode). That farm has not been run. The seed store shipped with the repo (src/warden/oracle/seed_signatures.json) is a small hand-crafted fixture used by tests. Running the real farm requires emsdk and produces the multi-thousand-signature corpus that gives Oracle the 40–80% identification rate.

Phase 3: Diff and carry-over

Goal. Turn reverse engineering from Sisyphean to incremental. When a new .wasm ships, classify every function as unchanged / moved / modified / new / deleted, carry all annotations forward automatically for unchanged and moved, apply a confidence penalty for fuzzy matches, and emit a semantic changelog that separates genuine application deltas from runtime churn caused by an Emscripten version bump. Deliverable. warden diff <from> <to> carries annotations forward and prints a human-readable changelog; the diff report is stored in the KB for time-travel queries. Status: fully implemented and tested. src/warden/diff/engine.py runs a three-pass matching algorithm:

Exact-body hash match

Functions with the same exact_hash are unchanged if the table index stayed, or moved if it shifted.

Stable-identity match

Functions with the same stable_id but a different body are treated as unchanged/moved too, because the stable identity intentionally tolerates relocations the exact hash would miss.

Greedy fuzzy match

Among the remainder, functions are paired by similarity().overall. Score ≥ 0.6 is classified modified; the rest become new or deleted.

Annotation carry-over copies oracle / agent / human symbols to the new stable_id with a 0.7 confidence multiplier for fuzzy matches; diff-carry provenance is recorded. render_changelog separates runtime_churn from app_modified using the _RUNTIME_PREFIXES table. Covered by tests/test_pipeline.py and tests/test_cli.py.

Phase 4: Agent crew over MCP

Goal. A propose → verify → write-back loop that populates the KB with human effort spent only on what the agents cannot resolve confidently. Deliverables. warden agent <label> running a multi-backend crew; warden mcp serving the KB as an MCP tool surface so any capable model can drive it from outside. Status: agent loop, MCP server, and specialized concurrency + struct analyzers implemented and tested; full specialized crew wiring scaffolded. Implemented:

Offline heuristic backend (src/warden/agents/backends.py): deterministic, zero-dependency; uses string xrefs and call-neighborhood context to produce proposals. Works with no API key.
OpenAI backend (src/warden/agents/backends.py): structured JSON output via the OpenAI Responses API, model gpt-5.3-codex by default. Auto-selected when OPENAI_API_KEY is set and openai is installed (pip install warden-re[agents]). codex and oai are provider aliases.
Anthropic backend (src/warden/agents/backends.py): structured JSON output via the Anthropic Messages API, model claude-opus-4-8. Auto-selected when ANTHROPIC_API_KEY is set and anthropic is installed, if OpenAI is not available. make_backend auto-detects.
Crew loop (src/warden/agents/crew.py): gather_facts seeds each call with hard evidence (type signature, call targets, string xrefs, opcodes) to constrain hallucination; verify_proposal is a cheap syntactic gate; run_agent_pass iterates bottom-up (fewest call targets first), skips already-confident symbols, gates through the verifier and KB economy.
MCP server (src/warden/mcp/server.py): FastMCP server exposing project reads, function facts, agent backend discovery, server-side agent runs, and economy-gated symbol proposals. Agent writes are economy-gated at the KB layer, so they cannot overwrite human or higher-confidence Oracle annotations. Activate with pip install warden-re[mcp] then warden mcp.
Concurrency analyzer (src/warden/analysis/concurrency.py): analyze_concurrency(module, kb, version_id) returns a ConcurrencyReport with .shared_memory, .atomic_sites, .pthread_markers, and .facts. Populates the previously-empty thread_model KB table via kb.add_thread_fact. Deterministic; zero external dependencies.
Struct analyzer (src/warden/analysis/structs.py): analyze_structs(...) returns a list of StructLayout values (each carrying .name, .fields as StructField(offset, size, type, name), and .source_function). Populates the structs KB table via kb.upsert_struct. CLI: warden analyze <label> runs both analyzers and persists all facts.

What is scaffolded: the VISION describes six specialized agents (Oracle, Concurrency, Type/Struct, Naming/Summarization, Diff, and Verifier). The current crew is a single general loop; the concurrency and struct analyzers produce facts but are not yet wired as autonomous agents that re-drive the naming loop. A true bottom-up call-graph ordering pass that re-decompiles each function after its callees’ names are resolved is not yet implemented.

Phase 5: Verifier

Goal. Make “understood” provable. Lift target functions via wasm2c/w2c2 to C, recompile the agent reconstruction the same way, differentially execute both over a fuzzer-generated corpus, and require I/O and memory match. Deliverable. warden verify <wasm> reports determinism and differential-readiness. The verifier gate in the agent loop activates the behavioral check when the required tooling is present. Status: determinism verification and mini-interpreter with differential execution implemented and tested; wasm2c differential harness scaffolded. Implemented in src/warden/verify/harness.py:

verify_determinism re-ingests the same bytes twice and confirms every function’s stable_id is bit-identical across runs. This is the foundational guarantee that the entire carry-over mechanism depends on.
tooling_status probes PATH for wasm2c, w2c2, a C compiler, and wasm-validate; reports can_differential truthfully.
differential_plan returns the concrete shell steps for the wasm2c lift → recompile → differential execution pipeline, and whether the environment can run them. warden verify <wasm> surfaces this output.

Mini interpreter. warden.interp is a zero-dependency interpreter for the integer subset of WebAssembly that makes behavioral equivalence runnable without any native toolchain.

execute_function(module, func, args, *, host=None, memory=None, fuel=100000) executes a single function and returns a list of integer results. Raises UnsupportedExecution for instructions outside the integer subset.
differential_execute(mod_a, fn_a, mod_b, fn_b, inputs) runs both functions over a list of argument tuples and returns a per-input list of dicts reporting whether the outputs matched. For example, it proves parse_token v1 and v2 are behaviorally equivalent (v2’s bounds-check result is dropped from the return), while flagging that internal_crc differs.
CLI: warden exec <label> <index> [args...] prints the result of executing a function by index directly from the KB.

What is scaffolded: the actual wasm2c differential harness (launching wasm2c, compiling the lifted C, running the fuzzer corpus, comparing outputs) is not automated. The verifier gate in crew.py:verify_proposal is currently the cheap syntactic check; the behavioral hook is the documented plug-in point for when a C toolchain is detected. SeeWasm symbolic checks and Wasabi/Frida dynamic tracing are not yet wired.

Phase 6: UX

Goal. A “RE-as-version-control” interface: diff view, confidence heatmap, time-travel query (“when did this function first appear?”), thread/memory map, one-click export to pseudocode or headers. Deliverable. A usable interface for analysts who want WARDEN’s power without running CLI commands manually. Status: static HTML report generator implemented and tested; full interactive UX scaffolded. All the data a UX would consume is present in the KB today: versioned functions, per-function confidence and provenance, diff reports stored with kb.store_diff, and the kb-text export format designed to diff cleanly in git. The warden demo output already produces a human-readable coverage progression and changelog in the terminal. Static HTML report. warden.report generates a self-contained HTML file (inline CSS, no server required) that captures an analysis session as a shareable artifact.

render_report(kb, version_id, module=None) returns the HTML as a string; write_report(kb, version_id, path, module=None) writes it to disk.
The report includes a coverage summary, a confidence heatmap of functions colored by provenance and confidence score, a thread/memory model section drawn from the concurrency analyzer’s facts, and the diff changelog.
CLI: warden report <label> [--out FILE].

No browser UI, TUI, or IDE plugin has been built beyond the static report. This phase is the natural integration point for a React diff view over the KB’s REST/MCP surface, or a Ghidra panel that highlights confidence with color.

Status at a glance

Phase	Name	Status
0	Spine	Implemented and tested
1	Decompile and export bridge	Exporters + built-in pure-Python lifter implemented; live Ghidra round-trip scaffolded
2	The Emscripten Oracle	Engine + MinHash-LSH index implemented; emsdk corpus farm scaffolded
3	Diff and carry-over	Implemented and tested
4	Agent crew over MCP	Loop, MCP server, concurrency + struct analyzers implemented; specialized agent wiring scaffolded
5	Verifier	Determinism + mini-interpreter + differential execution implemented; wasm2c harness scaffolded
6	UX	Static HTML report implemented; full interactive UX scaffolded

How to contribute

The project is alpha. Every phase has a concrete gap where a focused contribution lands quickly.

Phase 0: Spine (best entry point)

Fix edge cases in the WASM section parser (src/warden/ingest/), add fingerprint properties to src/warden/identity/fingerprint.py, or improve the JS-glue parser to handle additional Emscripten output shapes. Every change can be validated against the existing test suite with pytest.

Phase 1: Export bridge

Add a Ghidra headless launch wrapper that runs the generated rename script via analyzeHeadless, or wire pyghidra to pull p-code back into the lifter output. The built-in lifter in src/warden/lift/ is a good starting point for extending coverage to floating-point and SIMD opcodes. Requires Ghidra and the nneonneo/ghidra-wasm-plugin installed locally.

Phase 2: Oracle corpus

Run the emsdk matrix build (scripts/corpus/) and contribute the resulting oracle.json as a versioned artifact. Improve classify_library to cover additional Emscripten runtime prefixes. Tune the MinHash-LSH band count in SignatureIndex.build for better precision/recall on cross-opt-level matching.

Phase 3: Diff

Improve the fuzzy similarity score in src/warden/identity/fingerprint.py (call-graph anchoring, dominator-tree comparison) to reduce false modified/new classifications on large modules. Add time-travel query helpers to KnowledgeBase (when_first_seen, evolution_of).

Phase 4: Agent crew

Wire the concurrency and struct analyzers (src/warden/analysis/) as autonomous crew agents that feed facts back into the naming loop. Add a true call-graph ordering pass that re-decompiles each function after its callees are resolved. Add more MCP tools (get_diff, search_symbols, export_kb_text).

Phase 5: Verifier

Automate the wasm2c → compile → fuzz → compare pipeline in harness.py. Wire tooling_status().can_differential to actually gate the crew loop and upgrade verify_proposal to invoke the behavioral check for proposals above a confidence threshold. Extend the mini-interpreter in src/warden/interp/ to cover floating-point and memory opcodes.

Phase 6: UX

Build a terminal diff view (rich Table comparing two versions) or an HTTP API over the KB so a browser frontend can consume it. The static HTML report (src/warden/report/) and the MCP server (warden mcp) are already the right programmatic surfaces to build on.

See the contributing guide for how to open a PR, and the limitations page for an honest accounting of current gaps. To understand the full architecture behind these phases, start with core concepts.

​Phase 0: Spine

​Phase 1: Decompile and export bridge

​Phase 2: The Emscripten Oracle

​Phase 3: Diff and carry-over

​Phase 4: Agent crew over MCP

​Phase 5: Verifier

​Phase 6: UX

​Status at a glance

​How to contribute

Phase 0: Spine

Phase 1: Decompile and export bridge

Phase 2: The Emscripten Oracle

Phase 3: Diff and carry-over

Phase 4: Agent crew over MCP

Phase 5: Verifier

Phase 6: UX

Status at a glance

How to contribute