Ingestion & normalization

Ingestion is Phase 0, the mandatory first step before the Oracle, agents, or diff can run. warden ingest reads the raw binary, recovers everything the file reveals for free, and writes the result into the knowledge base. By the time the command returns, every function already has a stable identity, and many have a name.

Overview

Parse the .wasm binary

A pure-Python WASM binary parser (ingest/wasm.py) reads the file section by section and produces a Module: a structured object containing every function, type, import, export, memory, element segment, data segment, and the name custom section.

Parse the Emscripten JS glue (optional)

If you pass --glue, the regex-driven JS glue parser (ingest/jsglue.py) extracts the Emscripten version string, dynCall_* signatures, exported symbol names, import bindings, and pthread / PROXY_TO_PTHREAD markers. None of these live inside the .wasm itself.

Fingerprint every function

identity.fingerprint_function() derives four complementary fingerprints from each Function and writes them into the functions table alongside a stable_id, the carry-over key used by every later phase. See core concepts for how the fingerprints work.

Seed names for free

Before the Oracle or agents ever run, project.ingest_into_kb() mines the name section, export table, and import table and writes initial symbols into the KB for every function where a name is recoverable at high confidence.

The WASM binary parser

ingest/wasm.py implements a complete pure-Python reader of the WebAssembly binary format. No native toolchain (no WABT, no Binaryen, no system libraries) is required.

What gets parsed

The parser walks every standard section in a single left-to-right pass:

Section	What is recovered
Type	Function type signatures such as `(i32, i32) -> i32`. Free in every module, including stripped ones.
Import	All imports, keyed by `module.field`. Imported functions occupy the low function-index slots.
Function	Type-index assignments for locally defined functions.
Memory	Limits (minimum/maximum pages), shared flag (threads), and memory64 flag.
Global	Mutable/immutable globals with their init expressions.
Export	All exported names and the function indices they point to.
Element	Table initializers; `ref.func` targets are resolved to indices where statically known.
Code	Per-function locals and body bytes; each body is immediately passed to the disassembler.
Data	Active and passive data segments; active segments carry a constant-expression offset so string extraction can compute absolute addresses.
Name (custom)	The `name` custom section (see below).
Other custom	Preserved verbatim in `module.custom_sections` for downstream use.

After the pass, export names are attached to their Function objects so the module.function_name() resolver can apply the name-section > export > import priority chain.

Name resolution priority

def function_name(self, func: Function) -> str | None:
    """Best available name: name section > export > import field."""
    if func.index in self.names.function_names:
        return self.names.function_names[func.index]
    if func.export_names:
        return func.export_names[0]
    if func.is_import and func.import_field:
        return f"{func.import_module}.{func.import_field}"
    return None

The name section is ground truth when present. Exports are next. Import field names are a fallback.

The name section

ingest/names.py parses the name WebAssembly custom section. Emscripten retains this section unless you explicitly strip it (e.g. with wasm-strip). It is the richest source of high-fidelity names in a debug or --profiling-funcs build. Three subsections are parsed:

Module name (subsection 0): the module-level name string.
Function names (subsection 1): a map from function index to name. This is the primary source for function_names.
Local names (subsection 2): per-function local-variable names. Useful for the agent crew when constructing context.

Later subsections (labels, types, tables) are skipped but recorded in NameSection.present_subsections. A malformed subsection does not abort the parse. It is silently skipped and the rest of the section is still read.

In a production Emscripten build compiled with -O2 or higher and no --profiling-funcs, the name section is often absent or covers only a handful of exports. The parser handles both cases gracefully: NameSection.present returns False and name-section seeds are not written.

Full opcode disassembly

ingest/opcodes.py contains the instruction decoder. It covers the MVP plus every extension that Emscripten actually emits:

Sign-extension ops

i32.extend8_s, i64.extend32_s, and friends.

Non-trapping float-to-int

The 0xFC prefix family (sub-opcodes 0–7).

Bulk memory

memory.init, data.drop, memory.copy, memory.fill, table.* (also 0xFC).

Reference types & tail calls

ref.null, ref.is_null, ref.func, return_call, return_call_indirect.

Threads / atomics

The full 0xFE prefix family: atomic.fence, notify/wait, and all load/store/rmw/cmpxchg variants. Central to the -pthread story.

SIMD

The 0xFD prefix family, with per-sub-opcode immediate layouts (memarg, lane byte, 16-byte shuffles, and no-immediate ops).

Each decoded Instruction records its offset and size inside the function body, its opcode (first byte) and sub_opcode for prefixed families, a mnemonic, an opcode klass (used for the histogram fingerprint), and the parsed immediates. The degradation path. An opcode whose immediate layout the decoder does not model raises UnsupportedOpcode. The code section parser catches that exception and sets func.disasm_error instead of aborting:

try:
    func.instructions = _disassemble_body(code_bytes)
except (UnsupportedOpcode, IndexError, ValueError) as exc:
    func.disasm_error = str(exc)

A function with disasm_error set has instructions = None but still has its full raw body bytes. It is still byte-fingerprintable via exact_hash and contributes to the KB with whatever fingerprints can be derived. The KB notes the error so downstream analysis knows the disassembly is incomplete.

If you are targeting a module built with a very recent WASM proposal or a custom toolchain extension, you may see disasm_error on some functions. Those functions are not lost. They carry an exact body hash and a structural hash derived from the bytes we did decode, but the opcode-histogram and MinHash fingerprints will be partial. File an issue with the offending opcodes if you need them modeled.

The Emscripten JS glue parser

The JS file Emscripten emits alongside the .wasm is what jsglue.py calls the Rosetta stone of the module. Facts that do not exist inside the binary at all live only in the glue:

Fact	How it is extracted
Emscripten version	Four regex patterns covering `EMSCRIPTEN_VERSION`, `@emscripten/X.Y.Z`, `GENERATED_BY`, and inline package references.
`dynCall_*` signatures	All `dynCall_<sig>` identifiers found anywhere in the file; the signature suffix (e.g. `viii`, `iji`) describes the indirect-call ABI.
Exported symbols	`Module['_foo']` and `wasmExports['bar']` patterns, plus the `asm[...]` form from older glue.
Import bindings	Object literal member patterns like `_emscripten_memcpy_js: _emscripten_memcpy_js`.
pthread / threading	Presence of `PThread`, `pthread_create`, `_emscripten_proxy`, `PROXY_TO_PTHREAD`, `proxyToMainThread`, `spawnThread`, `worker.js`, `_emscripten_thread_init`.
`PROXY_TO_PTHREAD`	Checked explicitly; activates a separate flag on `GlueInfo`.
Memory growth	`ALLOW_MEMORY_GROWTH`, `_emscripten_resize_heap`, `growMemory`.

The parser is intentionally regex-driven and tolerant. Glue shape shifts across Emscripten versions and optimization levels. A partial, honest GlueInfo is preferable to a hard failure on an unfamiliar layout. If no version string is found, GlueInfo.notes records that explicitly. Minification is detected via a crude heuristic: if the average line length exceeds 200 characters, GlueInfo.minified = True. This signals that symbol extraction may be incomplete.

from warden.ingest import parse_glue_file

info = parse_glue_file("app.js")
print(info.emscripten_version)    # e.g. "3.1.55"
print(info.dyncall_signatures)    # e.g. ["ii", "iii", "viii"]
print(info.uses_pthreads)         # True / False
print(info.proxy_to_pthread)      # True / False

The glue is optional. If you only have the .wasm, ingestion still works. You lose the version string, dynCall signatures, and threading-model facts, but the binary is still fully parsed. Pass --glue whenever the glue file is available; it sharpens every downstream heuristic.

What ingest seeds for free

project.ingest_into_kb() calls _seed_symbol_for() for every function immediately after fingerprinting. The function mines the three tiers of free naming evidence and writes an initial Symbol into the KB if any evidence is present:

Evidence tier	Provenance	Confidence	Condition
Name section	`export`	0.90	Function index appears in `NameSection.function_names`
Export name	`export`	0.85	Function has at least one export name
Import field	`import`	0.80	Function is imported (`is_import = True`)

These seeds go through kb.upsert_symbol(), the same economy-gated write path used by the Oracle and agents, so they cannot be accidentally overwritten by lower-authority sources later. A debug build that retains its name section can reach substantial KB coverage before you run a single Oracle or agent pass.

Using `warden ingest`

warden init                                          # create warden.db
warden ingest app_v1.wasm --glue app_v1.js --label v1

Flags:

<wasm>: path to the .wasm file (required).
--glue, -g: optional Emscripten .js glue file.
--label, -l: version label; defaults to the filename stem.
--notes: free-text note stored with this version in the KB.
--db: project database path (default warden.db, or WARDEN_DB env var).

The command prints a summary table:

              Ingested 'v1'
 ─────────────────────────────────
  functions        1 847
    imported         312
    defined        1 535
  seeded symbols     427
  emscripten       3.1.55
  shared memory    yes

If pthread markers are found in the glue, a warning is printed. The concurrency model is relevant to several downstream analysis steps.

Using `ingest_into_kb` from Python

from warden.kb import KnowledgeBase
from warden.project import ingest_into_kb

kb = KnowledgeBase("warden.db")

result = ingest_into_kb(
    kb,
    "app_v1.wasm",
    label="v1",
    glue_path="app_v1.js",   # optional
    notes="initial analysis",
)

print(result.num_functions)    # total functions (imported + defined)
print(result.num_imported)     # functions from the import section
print(result.seeded_symbols)   # names written for free before Oracle/agents
print(result.emscripten_version)  # from glue, or None
print(result.shared_memory)    # True if shared-memory flag is set in the binary

kb.close()

IngestResult.glue_info holds the full GlueInfo object if a glue file was parsed. IngestResult.notes carries any parser warnings (e.g. missing version string, empty export list from a minified glue). You can also parse a module without touching the KB:

from warden.ingest import parse_file

module = parse_file("app_v1.wasm")

print(module.version)              # should be 1 for all current modules
print(len(module.functions))       # all functions, imported and defined
print(module.shared_memory())      # True if threads are in use
print(module.function_name(module.functions[42]))  # best available name

# Inspect one function's disassembly
f = module.defined_functions[0]
if f.instructions is not None:
    for ins in f.instructions[:5]:
        print(ins.mnemonic, ins.immediates)
elif f.disasm_error:
    print("partial disassembly:", f.disasm_error)
    # f.body still contains the raw bytes

# Extract strings from active data segments
for addr, s in module.strings(minimum_length=4):
    print(f"0x{addr:08x}  {s!r}")

What comes next

Ingestion does not run the Oracle, agents, or diff. Those are separate commands you layer on top. The typical sequence after ingest:

Oracle identification

Match every defined function against a corpus of labeled Emscripten/musl/libc signatures to collapse runtime code instantly.

Agent crew

Propose names for the application-specific remainder, gated by the provenance economy.

Diff & carry-over

When a new version ships, ingest it and diff against the previous label.

Export

Emit a C header, pseudocode listing, git-diffable text, or a Ghidra rename script.

​Overview

​The WASM binary parser

​What gets parsed

​Name resolution priority

​The name section

​Full opcode disassembly

Sign-extension ops

Non-trapping float-to-int

Bulk memory

Reference types & tail calls

Threads / atomics

SIMD

​The Emscripten JS glue parser

​What ingest seeds for free

​Using warden ingest

​Using ingest_into_kb from Python

​What comes next

Oracle identification

Agent crew

Diff & carry-over

Export

Overview

The WASM binary parser

What gets parsed

Name resolution priority

The name section

Full opcode disassembly

The Emscripten JS glue parser

What ingest seeds for free

Using `warden ingest`

Using `ingest_into_kb` from Python

What comes next