Skip to main content
The WARDEN knowledge base (KB) is a single portable SQLite file (one per project), opened with KnowledgeBase("project.db"). The schema is applied (idempotently via CREATE TABLE IF NOT EXISTS) on every open from src/warden/kb/schema.sql, loaded at runtime through importlib.resources. No migration runner is required. Two SQLite PRAGMAs are set at open time: PRAGMA journal_mode = WAL (concurrent readers alongside the single writer, which matters when warden mcp and a local agent run simultaneously) and PRAGMA foreign_keys = ON (version-scoped tables cascade-delete when a module_versions row is removed).
Status: alpha. The schema is functional and exercised end-to-end by warden demo. Additional columns (DWARF source info, wasm2c harness references) are tracked in the roadmap.

The linchpin design choice

Annotations are keyed on a function’s stable content identity, not on a per-version row-ID.
The functions table is the per-version appearance log: it records that function index 42 appeared in version v1 with a given stable_id. The symbols table is the durable annotation layer: it holds names, types, summaries, and evidence keyed on that same stable_id. When a vendor ships v2.wasm and every table index shifts, WARDEN re-ingests and re-fingerprints. A function that was index 42 in v1 might become index 51 in v2, but its stable_id (a composite content hash) stays the same. The annotation in symbols is still there, automatically. That is the carry-over mechanism that makes RE incremental. See core concepts for how stable_id is computed from structural, semantic, fuzzy, and exact fingerprints.

Table reference

meta

Lightweight key/value store for project-level metadata. Accessed via KnowledgeBase.get_meta() and set_meta().
ColumnTypeConstraintsNotes
keyTEXTPRIMARY KEYArbitrary metadata key.
valueTEXTNOT NULLCurrent keys: schema_version (set to "1" on init) and project.

module_versions

One row per ingested .wasm file, representing a point in the target’s version history. The id here is the foreign key anchor for functions, thread_model, and diffs.
ColumnTypeConstraintsNotes
idINTEGERPRIMARY KEY AUTOINCREMENTSurrogate; referenced throughout.
labelTEXTNOT NULL UNIQUEHuman-readable name, e.g. "v1". Set via --label on ingest.
wasm_pathTEXTRecorded path to the original .wasm.
glue_pathTEXTPath to the accompanying Emscripten JS glue, if any.
wasm_sha256TEXTNOT NULLSHA-256 of the raw bytes; guards against re-ingesting a changed file under the same label.
emscripten_versionTEXTInferred from Oracle matches or the JS glue (e.g. "3.1.55").
inferred_flagsTEXTJSON object: Oracle/glue-inferred build flags (-O2, -pthread, etc.).
glue_infoTEXTJSON: parsed GlueInfo from the Emscripten JS glue (export map, dynCall signatures, pthread shape).
num_functionsINTEGERDEFAULT 0Total functions (imported + defined).
num_importedINTEGERDEFAULT 0Imported function count.
shared_memoryINTEGERDEFAULT 01 if the module uses shared memory (pthreads).
ingested_atTEXTDEFAULT CURRENT_TIMESTAMPUTC timestamp of ingest.
notesTEXTFree-text annotation set via --notes on ingest.

functions

The per-version appearance log. Each row records one function as it appeared in one version, with all fingerprinting data needed for identity matching and Oracle lookup. This table does not hold human annotations. Those live in symbols.
ColumnTypeConstraintsNotes
idINTEGERPRIMARY KEY AUTOINCREMENTReferenced by oracle_matches.
version_idINTEGERNOT NULL FK → module_versions.id ON DELETE CASCADEWhich version this appearance belongs to.
func_indexINTEGERNOT NULLWebAssembly function index within this version.
stable_idTEXTNOT NULLThe carry-over key. Content identity hash. See idx_functions_stable.
exact_hashTEXTNOT NULLHash of the raw instruction bytes; changes on any edit.
structural_hashTEXTNOT NULLCFG-shape hash; robust to constant and immediate changes. See idx_functions_struct.
minhashTEXTNOT NULLJSON int array (MinHash signature for fuzzy/Jaccard similarity).
histogramTEXTNOT NULLJSON {opcode_class: count} (opcode-class frequency histogram).
call_targetsTEXTNOT NULLJSON list of import names this function calls.
local_callsINTEGERDEFAULT 0Number of calls to other locally-defined functions.
type_signatureTEXTWebAssembly type signature (always present for defined functions).
instruction_countINTEGERDEFAULT 0Total instructions in the function body.
body_sizeINTEGERDEFAULT 0Raw byte size of the function body.
is_importINTEGERDEFAULT 01 for imported functions (no body; just a type stub).
raw_nameTEXTName as found in the name section, export table, or import descriptor, if any.
Unique constraint: (version_id, func_index). One row per function per version. Indices: idx_functions_stable on stable_id; idx_functions_version on version_id; idx_functions_struct on structural_hash.

symbols

The durable annotation layer. Keyed on stable_id so that an annotation created against v1 is automatically present when the same function appears in v2, v3, and beyond. All writes go through KnowledgeBase.upsert_symbol(), which enforces the provenance/confidence economy (see below).
ColumnTypeConstraintsNotes
idINTEGERPRIMARY KEY AUTOINCREMENT
stable_idTEXTNOT NULLContent identity. Indexed (idx_symbols_stable).
kindTEXTNOT NULL DEFAULT 'function''function' | 'global' | 'struct' | 'type'.
nameTEXTRecovered or assigned name.
type_signatureTEXTInferred or confirmed type.
summaryTEXTFree-text description of what the function does.
provenanceTEXTNOT NULLWho produced this annotation. See Provenance ranking.
confidenceREALNOT NULL DEFAULT 0.00.01.0. Drives the write-economy gate.
evidenceTEXTJSON list of {kind, detail} objects (the evidence trail).
source_refTEXTUpstream source URL or file/line (e.g. a musl commit URL). Set for Oracle-confirmed runtime code.
lockedINTEGERDEFAULT 01 means human-verified; immutable to any automated write.
created_atTEXTDEFAULT CURRENT_TIMESTAMP
updated_atTEXTDEFAULT CURRENT_TIMESTAMPUpdated on every accepted write.
Unique constraint: (stable_id, kind). One annotation record per identity per kind.
Never insert directly into symbols. Always go through upsert_symbol(). A direct INSERT bypasses the economy gate and can silently clobber a human-verified, locked annotation.

structs

Recovered memory-region and struct layouts, stored as first-class structured data rather than free-text comments.
ColumnTypeConstraintsNotes
idINTEGERPRIMARY KEY AUTOINCREMENT
nameTEXTNOT NULL UNIQUEStruct name, e.g. "pthread_mutex_t".
layoutTEXTNOT NULLJSON list of {offset, size, type, name} field descriptors.
provenanceTEXTNOT NULLTypically "agent" or "oracle".
confidenceREALNOT NULL DEFAULT 0.0
notesTEXTFree-text notes.
updated_atTEXTDEFAULT CURRENT_TIMESTAMP
Written via KnowledgeBase.upsert_struct(), which uses ON CONFLICT(name) DO UPDATE.

thread_model

The concurrency spine: lock-to-guarded-data relationships, atomic sites, TLS references, and worker entry points. Version-scoped because the threading model may change between releases.
ColumnTypeConstraintsNotes
idINTEGERPRIMARY KEY AUTOINCREMENT
version_idINTEGERFK → module_versions.id ON DELETE CASCADE
kindTEXTNOT NULL'lock' | 'atomic' | 'tls' | 'worker-entry'.
siteTEXTFunction stable_id or address where the site was observed.
guarded_dataTEXTNamed memory region this lock protects (for kind='lock').
detailTEXTAdditional free-text detail.
provenanceTEXTNOT NULL DEFAULT 'agent'
confidenceREALNOT NULL DEFAULT 0.0
Written via KnowledgeBase.add_thread_fact(); retrieved via thread_facts(version_id).

oracle_matches

Records of individual Oracle identifications: a function fingerprint matched to a known upstream runtime or libc function. References functions.id (not stable_id), so each match is version-specific: the same logical function may match at a different func_index in each version, generating a separate row.
ColumnTypeConstraintsNotes
idINTEGERPRIMARY KEY AUTOINCREMENT
function_idINTEGERNOT NULL FK → functions.id ON DELETE CASCADEWhich per-version function row was matched.
matched_nameTEXTNOT NULLThe real upstream name, e.g. "__stdio_write".
libraryTEXTSource library: musl | emscripten | libc++ | compiler-rt | dlmalloc.
emscripten_versionTEXTEmscripten version range where this signature was observed.
opt_levelTEXTOptimization level of the corpus artifact, e.g. "-O2".
scoreREALNOT NULLMatch score (0.0–1.0).
source_refTEXTUpstream source URL or file/line from the corpus artifact.
Unique constraint: (function_id, matched_name). An Oracle match is a separate record from the symbols annotation it triggers. warden oracle identify writes both: the match record here, and a symbols row (provenance "oracle") via upsert_symbol.

diffs

Stored cross-version diff reports (the semantic changelogs produced by warden diff). One row per ordered pair of versions. The report column holds the full JSON classification (unchanged, structurally-equivalent, fuzzy-matched, added, removed) and carry-over counts.
ColumnTypeConstraintsNotes
idINTEGERPRIMARY KEY AUTOINCREMENT
from_version_idINTEGERFK → module_versions.id ON DELETE CASCADEEarlier version.
to_version_idINTEGERFK → module_versions.id ON DELETE CASCADELater version.
reportTEXTNOT NULLJSON diff report (function classifications, carry-over counts, changelog).
created_atTEXTDEFAULT CURRENT_TIMESTAMP
Unique constraint: (from_version_id, to_version_id).

audit_log

Append-only trail of every symbol write attempt, whether accepted or rejected. Rows are never modified, only inserted. Used to answer “who changed this name and when?” and “why was this agent proposal rejected?”.
ColumnTypeConstraintsNotes
idINTEGERPRIMARY KEY AUTOINCREMENTMonotonically increasing.
stable_idTEXTThe identity that was affected. May be NULL for non-symbol actions.
actionTEXTNOT NULL'created' | 'updated' | 'rejected'.
actorTEXTNOT NULLProvenance of the writer ('human', 'oracle', 'agent', etc.).
detailTEXTHuman-readable reason string returned by may_overwrite().
created_atTEXTDEFAULT CURRENT_TIMESTAMPUTC timestamp.
Retrieve recent entries with KnowledgeBase.audit_log(limit=100).

Provenance ranking

Every symbol write carries a provenance label. PROVENANCE_RANK in src/warden/kb/models.py assigns an integer priority that the may_overwrite() gate uses to decide whether a new write may displace an existing annotation.
PROVENANCE_RANK = {
    "human":       100,
    "oracle":       90,
    "export":       60,
    "import":       55,
    "string-xref":  50,
    "diff-carry":   40,
    "agent":        30,
}
Any provenance string not present in this dict defaults to rank 10. Human is sovereign at the top; raw agent proposals sit at the bottom. The ranking is also shown in the core concepts table.

The may_overwrite() economy

may_overwrite() in src/warden/kb/models.py is the gatekeeper called by every KnowledgeBase.upsert_symbol() invocation. It receives the new write’s provenance and confidence alongside the existing annotation’s provenance, confidence, and lock state, and returns (allowed: bool, reason: str). The rules, exactly as implemented:
1

Empty slot: always allow

If existing_provenance is None the slot is empty and any write succeeds. Reason: "new symbol".
2

Human write: always allow

If new_provenance == "human" the write wins unconditionally, regardless of what already exists. Reason: "human override". A subsequent lock_symbol() call then sets locked = 1 to protect it.
3

Locked symbol: reject any non-human write

If existing_locked is true and the writer is not "human", the write is rejected. Reason: "existing symbol is locked (human-verified)". The locked flag is set by warden set-name (which locks by default) and by KnowledgeBase.lock_symbol().
4

Agent write: narrow overwrite rights

An agent may only write if one of two conditions holds:
  • The existing annotation is also "agent" and new_confidence > existing_confidence (strictly higher). Reason: "higher-confidence agent write".
  • The existing provenance has a rank strictly below 30 (the agent rank), i.e. an unknown source. Reason: "outranks existing automated source".
An agent can never overwrite oracle, export, import, string-xref, diff-carry, or human annotations. The rejection reason names the existing provenance and its confidence.
5

Other automated sources: rank then confidence

For all other provenances (oracle, export, import, string-xref, diff-carry):
  • Higher rank wins outright.
  • At equal rank, new_confidence >= existing_confidence is required.
  • Otherwise the existing annotation takes precedence.
Rejected writes are not errors. They return (False, reason) and are appended to audit_log with action='rejected'. This lets you re-run the entire agent crew on every update without clobbering any verified human work or high-confidence Oracle annotations.

Example economy decisions

New writeExistingOutcome
agent, conf 0.80(empty)Allowed: new symbol
agent, conf 0.80agent, conf 0.60Allowed: higher-confidence agent write
agent, conf 0.80agent, conf 0.90Rejected: lower confidence
agent, conf 0.95oracle, conf 0.85Rejected: agent may not overwrite oracle
oracle, conf 0.90agent, conf 0.95Allowed: oracle outranks agent
human, conf 1.00oracle, lockedAllowed: human override
oracle, conf 0.90human, lockedRejected: symbol is locked

KnowledgeBase API

KnowledgeBase (in src/warden/kb/database.py) is a pure-stdlib sqlite3-backed class that supports the context-manager protocol. Schema is applied idempotently on every open.
from warden.kb import KnowledgeBase

with KnowledgeBase("project.db") as kb:
    ...

Representative calls

version_id = kb.add_module_version(
    label="v1",
    wasm_sha256="abc123...",
    wasm_path="app_v1.wasm",
    emscripten_version="3.1.55",
    num_functions=512,
    num_imported=40,
    shared_memory=True,
)

v  = kb.get_version("v1")   # -> ModuleVersion | None
vs = kb.versions()           # -> list[ModuleVersion], ordered by id
latest = kb.latest_version() # -> ModuleVersion | None
from warden.kb import Symbol

written, reason = kb.upsert_symbol(Symbol(
    stable_id="a3f9c0b2...",
    kind="function",
    name="malloc",
    provenance="oracle",
    confidence=0.97,
    source_ref="https://github.com/emscripten-core/emscripten/blob/main/system/lib/dlmalloc.c#L...",
    evidence=[{"kind": "oracle-match", "detail": "score=0.97 lib=dlmalloc"}],
))
# written=True, reason='new symbol'
# written=False, reason='...' if a higher-authority annotation already exists

sym = kb.get_symbol("a3f9c0b2...", kind="function")  # -> Symbol | None
funcs = kb.functions_for_version(version_id, include_imports=False)
symbols = kb.symbols_for_stable_ids([f["stable_id"] for f in funcs])
# symbols: dict[stable_id -> Symbol]
c = kb.coverage(version_id)
print(f"{c.named}/{c.defined} ({c.coverage_pct}%)")
print(f"  oracle={c.oracle_named}  human={c.human_named}  agent={c.agent_named}")
coverage_pct is computed over defined functions only (imports excluded from the denominator).
kb.record_oracle_match(
    function_id=func_row_id,
    matched_name="__stdio_write",
    library="musl",
    emscripten_version="3.1.55",
    opt_level="-O2",
    score=0.94,
    source_ref="https://git.musl-libc.org/...",
)
# Lock against any future automated write.
kb.lock_symbol("a3f9c0b2...", kind="function")
Equivalent to warden set-name <label> <index> <name> (which locks by default).
kb.store_diff(from_version_id=1, to_version_id=2, report=report_dict)
report = kb.get_diff(1, 2)    # -> dict | None
entries = kb.audit_log(limit=50)
# list of dicts: id, stable_id, action, actor, detail, created_at

The kb-text export format

warden export <label> --format kb-text emits a stable, deterministic text dump of every symbol in a version, sorted by func_index. Because the output uses fixed-width fields and a consistent sort, it diffs cleanly in git alongside version-controlled .wasm artifacts.
# WARDEN KB export (version_id=1)
index  stable_id          lk provenance  conf   name
    0  a3f9c0b2e1d4f7a9    oracle      0.97  malloc
    1  b7e2d3a1c4f5e9b0    agent       0.72  parse_token
    2  c1f4a8b3e6d2c7f0  L human       1.00  crypto_init
    3  d9b3e5a2f1c8d4e7    -           -     -
The lk column is L for locked (human-verified) symbols and a space otherwise. The provenance and conf columns are - for functions that have no symbol record yet.
Commit the kb-text dump to version control after each warden diff run. The resulting git diff shows exactly which names were carried over, which were new, and which remain unnamed. It is a human-readable changelog of your RE progress.

Storage notes

Pure stdlib

KnowledgeBase imports only sqlite3 and json. No C extensions or native dependencies on the core path.

WAL mode

PRAGMA journal_mode = WAL allows concurrent readers alongside the single writer, so warden mcp and a local agent can run simultaneously without blocking each other.

Foreign-key cascade

ON DELETE CASCADE is set on all version-scoped tables. Deleting a module_versions row automatically cleans up functions, oracle_matches, thread_model, and diffs.

JSON columns

Columns such as minhash, histogram, call_targets, inferred_flags, glue_info, evidence, layout, and report store structured data as JSON strings. KnowledgeBase handles serialization transparently; callers work with Python list and dict objects.

Core concepts

How stable_id is computed and why the provenance economy exists.

CLI reference

Commands that read and write the KB: ingest, oracle, agent, diff, export.

MCP reference

Serve KB data to external tools over the Model Context Protocol.
Last modified on June 7, 2026