Verification - WARDEN

Readable but wrong is the cardinal sin of reverse engineering. WARDEN treats verification not as a post-hoc sanity check but as a first-class pipeline phase. This page explains what “verified” means along three distinct axes, what runs unconditionally today, what activates when native tooling is present, and what is scaffolded for the future.

Three honest axes of “100%”

“100% reverse engineered” is a precise, three-part claim. Each axis addresses a different failure mode.

Axis	Claim	Failure mode it guards
Symbol coverage	Every defined function has a binding: a name, a provenance, and a confidence score. No `func_412` stays anonymous.	Dark spots that conceal unknown behavior.
Behavioral equivalence	Reconstructions are differentially executed against the original until their outputs and side effects match on a corpus of inputs.	Plausible-sounding names that describe the wrong behavior.
Change accountability	Across any two versions, every delta is mapped to a function and a semantic explanation. Nothing changes silently.	Missed security-relevant changes in an update.

Coverage and accountability run fully today. Behavioral equivalence is corpus-bounded evidence, not a formal proof, and the automated harness activates only when the right native toolchain is present.

Running the check

warden verify app_v1.wasm

The verify command reads the file, runs the determinism check, and prints a colored summary. It then calls differential_plan and reports whether the optional behavioral harness is ready:

42/42 functions fingerprint deterministically
differential equivalence ready: False  tooling={'wasm2c': False, 'w2c2': False, 'cc': True, 'wabt_validate': False, 'can_differential': False}

If any function’s stable_id is unstable, the exit line prints in red and the offending function indices are listed. verify takes no options beyond the .wasm path. The determinism check always runs; differential readiness is always reported.

What runs today: `verify_determinism`

verify_determinism(wasm_bytes: bytes) -> DeterminismReport is the zero-dependency check that runs on every call to warden verify. It requires only the Python standard library.

How it works

The function parses the same raw bytes twice via parse_module, calls fingerprint_function on each pair of defined functions, and compares the resulting stable_id strings byte-exactly. The returned DeterminismReport carries:

total: number of defined (non-import) functions examined.
stable: count whose stable_id matched across both parse runs.
unstable: list of function indices where the id differed.
ok: True when unstable is empty.
summary: a one-liner such as "42/42 functions fingerprint deterministically".

Why it matters for carry-over

The stable_id is WARDEN’s stable function identity: a composite of the structural fingerprint, call-neighborhood, and type signature that stays constant across rebuilds even when table indices shift. The entire annotation carry-over mechanism (the feature that makes RE incremental rather than Sisyphean) depends on every stable_id being reproducible from the binary bytes alone.

If a stable_id were non-deterministic, annotations written against one parse would silently fail to port to the next version. An unstable result from verify_determinism indicates a bug in the fingerprinting engine that must be fixed before any carry-over can be trusted.

Because the check operates on raw bytes with no external tooling, it runs identically in CI, in a sandboxed environment, and on a laptop with nothing native installed.

Differential execution (runs today)

The warden.interp mini-interpreter is the middle tier: it makes behavioral equivalence runnable right now, with zero native toolchain required, for the integer subset that covers the vast majority of Emscripten arithmetic and glue code.

What the interpreter covers

execute_function is a pure-Python stack-machine evaluator for the i32 integer subset. It models:

All i32 arithmetic and bitwise opcodes (add, sub, mul, and, or, xor, shl).
All i32 comparisons, signed and unsigned (eq, ne, lt_s, gt_s, le_s, ge_s, eqz).
Structured control: block, loop, if/else/end, br, br_if, return.
Local variable operations: local.get, local.set, local.tee.
Direct function calls (with recursive fuel tracking).
Linear memory loads and stores (i32.load, i32.store).

Anything outside that scope raises UnsupportedExecution. The harness catches this and records the pair as “undecided” rather than crashing, so a partially-modeled module still yields useful results for the functions that are covered. Execution is deterministic by construction: the result depends only on the module bytes, the function index, and the argument vector. No wall-clock, no RNG, no external state.

Running one function

warden exec <label> <index> [args...]

The exec command looks up the named project, parses its .wasm, locates function <index>, and executes it on the provided arguments.

warden exec app_v1 42 10 3
# → [91]

Programmatically:

from warden.ingest import parse_module
from warden.interp import execute_function

module = parse_module(wasm_bytes)
func = module.functions[42]
results = execute_function(module, func, [10, 3])
# → [91]

execute_function accepts optional keyword arguments:

Argument	Default	Purpose
`host`	`None`	Callback for imported function calls; absent → imports are no-ops
`memory`	Zeroed 64 KiB	Pre-seeded `bytearray`; observe stores after the call
`fuel`	`100000`	Instruction budget; exceeded → `UnsupportedExecution`

Differential execution

differential_execute runs two functions from two modules over the same input corpus and reports per-input agreement:

from warden.interp import differential_execute

rows = differential_execute(mod_v1, fn_v1, mod_v2, fn_v2, inputs=[
    [0, 0], [1, 2], [255, 1], [0x7FFFFFFF, 1],
])
# Each row: {"args": [...], "a": [...], "b": [...], "match": True|False}

The match field is True when both sides return identical result stacks. If either side raises UnsupportedExecution, that side records None and match is False (undecided, not wrong). Concrete example. parse_token v1 and v2 differ structurally (v2 adds a bounds check), but the bounds-check result is dropped before the return. differential_execute proves they are behaviorally equivalent across the full input corpus: every row shows match: True. internal_crc, by contrast, shows match: False on most inputs, which is a genuine behavioral change.

# parse_token v1 == v2: bounds check is dropped, behavior unchanged
{"args": [10, 3],          "a": [91],  "b": [91],  "match": True}
{"args": [0x7FFFFFFF, 1],  "a": [...], "b": [...],  "match": True}

# internal_crc v1 != v2: genuine behavioral divergence
{"args": [10, 3],  "a": [0x1234], "b": [0xABCD], "match": False}

differential_execute never mutates the modules or functions it receives, and each input gets a fresh zeroed memory. You can run it in a loop across a large corpus without side effects.

Three-tier verification model

With the interpreter in place, WARDEN now has three stacked tiers for behavioral equivalence:

Tier	Dependency	Coverage
`warden.interp` (mini-interpreter)	Pure Python, stdlib only	i32 integer subset; runs today on any machine
`wasm2c` / `w2c2` + `cc` (native harness)	WABT + C compiler	Full numeric and memory behavior; corpus-bounded
SeeWasm (symbolic cross-check)	SeeWasm (future)	Path-condition soundness for security-critical branches

The interpreter fills the gap between the zero-cost determinism check above and the native harness below. For the integer-heavy glue code Emscripten emits, it is often sufficient on its own.

The optional differential harness

The behavioral harness activates only when the right native toolchain is present. Nothing silently pretends a check ran. WARDEN reports readiness honestly.

Tooling detection: `tooling_status()`

tooling_status() uses shutil.which to probe PATH for four tools:

Field	Binary probed	Role
`wasm2c`	`wasm2c`	WABT’s WebAssembly-to-C transpiler
`w2c2`	`w2c2`	Alternative wasm-to-C transpiler (faster compile)
`cc`	`cc`, `clang`, or `gcc` (first found)	C compiler to build the lifted output
`wabt_validate`	`wasm-validate`	Structural validation of the `.wasm` before lifting

The can_differential property is True when (wasm2c or w2c2) and cc. That is the minimum required to execute the harness. wabt_validate is independent, useful for confirming the input is well-formed before spending time lifting it.

The plan: `differential_plan(wasm_path)`

differential_plan(wasm_path: str) -> dict calls tooling_status() and returns a dict describing what would run and whether it can:

{
    "tooling": { ... },        # ToolingStatus.as_dict()
    "ready": True | False,     # can_differential
    "steps": [ ... ],          # concrete shell commands
    "note": "...",             # human-readable install hint when not ready
}

When ready is False, the note field tells you exactly what to install. The concrete steps it describes, in order:

Transpile the original to C

wasm2c app_v1.wasm -o target.c

wasm2c (from WABT) produces a self-contained C file that is functionally equivalent to the original by construction.

Compile the lifted C to a native executable

cc -O2 target.c wasm-rt-impl.c -o target_ref

The WABT runtime shim wasm-rt-impl.c is included with WABT.

Compile the agent/LLM reconstruction the same way

The reconstruction is compiled identically to produce a second native executable.

Differential execution over a fuzzer corpus

Both executables are run over a fuzzer-generated input corpus. Outputs and observable memory side effects must agree across every input.

Activating the harness

brew install wabt

After installation, warden verify app_v1.wasm will report differential equivalence ready: True.

What behavioral equivalence actually claims

When the differential harness reports a function as verified, the precise claim is: corpus-bounded behavioral equivalence: the reconstruction agrees with the original on every input tried, to the depth the fuzzer reached. This is strong, automatable evidence. It is not a formal proof. The corpus is finite, and an adversarial input outside it could in principle expose a divergence. Formal equivalence checking via SMT-based symbolic execution over all inputs is a future direction for cryptographically critical functions, but it is not what the current harness provides.

For most reverse-engineering purposes (understanding behavior, tracking changes, writing accurate documentation), corpus-bounded evidence is sufficient and is considerably stronger than “the name sounds right.”

Future dynamic ground-truth hooks

Three additional sources of behavioral evidence are identified in the architecture vision and scaffolded as future work.

SeeWasm: symbolic cross-check

For specific claimed path conditions (“this branch fires when magic == 0xCAFE”), SeeWasm can confirm the condition symbolically rather than relying on the model’s assertion. This would plug in as an additional gate in the verifier after the differential harness, providing path-condition soundness for security-critical branches.

Wasabi / Frida / Chrome DevTools: dynamic ground truth

Wasabi is a WebAssembly instrumentation framework; Frida and Chrome DevTools expose runtime traces. Live execution traces confirm which element-table targets actually fire, which atomic operations guard which memory regions, and what values flow through indirect calls. This is precisely the information hardest to recover statically. These hooks would validate the thread model and dynCall attribution produced by the concurrency agent.

Oracle-as-oracle: free verification for matched functions

Any function matched by the Emscripten Oracle to a known upstream source (musl, dlmalloc, the Emscripten runtime) is verified by definition: its behavior is already documented in the public source. The reconstruction can be checked directly against that real source code. This is available today for Oracle-matched functions, at no additional toolchain cost.

Verification in the confidence economy

Every symbol written to the KB carries a provenance and a confidence score. The verifier controls whether an agent proposal is promoted:

Oracle matches (provenance="oracle") are inherently verified against compiled ground truth. They receive the highest confidence.
Agent proposals (provenance="agent") are gated by the verifier before write-back. Until behavioral equivalence is confirmed, they carry a confidence below 1.0 and are marked unverified.
Human names (provenance="human", locked=True) are sovereign. The verifier does not touch them and they cannot be overwritten by agent passes.

No unverified annotation is ever silently promoted to ground truth from which subsequent carry-overs or agent passes propagate. Verification failure is always visible in the KB.

Summary of checks

Check	Dependency	Runs today	Guarantee
`verify_determinism`	Pure Python, stdlib only	Yes	`stable_id` byte-stability, which guards all carry-over
`warden.interp` differential execution	Pure Python, stdlib only	Yes	Corpus-bounded behavioral equivalence for the i32 subset
Structural validation	`wasm-validate` (WABT)	When installed	Input is a well-formed WebAssembly module
Differential harness	`wasm2c` or `w2c2` + `cc`	When installed	Corpus-bounded behavioral equivalence (full instruction set)
Oracle source verification	Emscripten Oracle corpus	Today, Oracle-matched functions only	Exact behavioral match to upstream source
Symbolic cross-check	SeeWasm	Future	Path-condition soundness for critical branches
Dynamic trace validation	Wasabi / Frida / DevTools	Future	Runtime dynCall and atomic ground truth

Phase 4: agents

Where the proposals the verifier gates actually come from.

Phase 3: diff

How verified annotations carry forward when a new version ships.

​Three honest axes of “100%”

​Running the check

​What runs today: verify_determinism

​How it works

​Why it matters for carry-over

​Differential execution (runs today)

​What the interpreter covers

​Running one function

​Differential execution

​Three-tier verification model

​The optional differential harness

​Tooling detection: tooling_status()

​The plan: differential_plan(wasm_path)

​Activating the harness

​What behavioral equivalence actually claims

​Future dynamic ground-truth hooks

​Verification in the confidence economy

​Summary of checks

Phase 4: agents

Phase 3: diff

Three honest axes of “100%”

Running the check

What runs today: `verify_determinism`

How it works

Why it matters for carry-over

Differential execution (runs today)

What the interpreter covers

Running one function

Differential execution

Three-tier verification model

The optional differential harness

Tooling detection: `tooling_status()`

The plan: `differential_plan(wasm_path)`

Activating the harness

What behavioral equivalence actually claims

Future dynamic ground-truth hooks

Verification in the confidence economy

Summary of checks