KnowledgeBase("project.db"). The schema is applied (idempotently via CREATE TABLE IF NOT EXISTS) on every open from src/warden/kb/schema.sql, loaded at runtime through
importlib.resources. No migration runner is required.
Two SQLite PRAGMAs are set at open time: PRAGMA journal_mode = WAL (concurrent readers
alongside the single writer, which matters when warden mcp and a local agent run
simultaneously) and PRAGMA foreign_keys = ON (version-scoped tables cascade-delete when a
module_versions row is removed).
Status: alpha. The schema is functional and exercised end-to-end by
warden demo. Additional
columns (DWARF source info, wasm2c harness references) are tracked in the
roadmap.The linchpin design choice
Annotations are keyed on a function’s stable content identity, not on a per-version row-ID.The
functions table is the per-version appearance log: it records that function index 42
appeared in version v1 with a given stable_id. The symbols table is the durable
annotation layer: it holds names, types, summaries, and evidence keyed on that same
stable_id.
When a vendor ships v2.wasm and every table index shifts, WARDEN re-ingests and
re-fingerprints. A function that was index 42 in v1 might become index 51 in v2, but its
stable_id (a composite content hash) stays the same. The annotation in symbols is still
there, automatically. That is the carry-over mechanism that makes RE incremental.
See core concepts for how stable_id is computed from structural, semantic, fuzzy,
and exact fingerprints.
Table reference
meta
Lightweight key/value store for project-level metadata. Accessed via KnowledgeBase.get_meta()
and set_meta().
| Column | Type | Constraints | Notes |
|---|---|---|---|
key | TEXT | PRIMARY KEY | Arbitrary metadata key. |
value | TEXT | NOT NULL | Current keys: schema_version (set to "1" on init) and project. |
module_versions
One row per ingested .wasm file, representing a point in the target’s version history. The id here is
the foreign key anchor for functions, thread_model, and diffs.
| Column | Type | Constraints | Notes |
|---|---|---|---|
id | INTEGER | PRIMARY KEY AUTOINCREMENT | Surrogate; referenced throughout. |
label | TEXT | NOT NULL UNIQUE | Human-readable name, e.g. "v1". Set via --label on ingest. |
wasm_path | TEXT | Recorded path to the original .wasm. | |
glue_path | TEXT | Path to the accompanying Emscripten JS glue, if any. | |
wasm_sha256 | TEXT | NOT NULL | SHA-256 of the raw bytes; guards against re-ingesting a changed file under the same label. |
emscripten_version | TEXT | Inferred from Oracle matches or the JS glue (e.g. "3.1.55"). | |
inferred_flags | TEXT | JSON object: Oracle/glue-inferred build flags (-O2, -pthread, etc.). | |
glue_info | TEXT | JSON: parsed GlueInfo from the Emscripten JS glue (export map, dynCall signatures, pthread shape). | |
num_functions | INTEGER | DEFAULT 0 | Total functions (imported + defined). |
num_imported | INTEGER | DEFAULT 0 | Imported function count. |
shared_memory | INTEGER | DEFAULT 0 | 1 if the module uses shared memory (pthreads). |
ingested_at | TEXT | DEFAULT CURRENT_TIMESTAMP | UTC timestamp of ingest. |
notes | TEXT | Free-text annotation set via --notes on ingest. |
functions
The per-version appearance log. Each row records one function as it appeared in one version,
with all fingerprinting data needed for identity matching and Oracle lookup. This table does
not hold human annotations. Those live in symbols.
| Column | Type | Constraints | Notes |
|---|---|---|---|
id | INTEGER | PRIMARY KEY AUTOINCREMENT | Referenced by oracle_matches. |
version_id | INTEGER | NOT NULL FK → module_versions.id ON DELETE CASCADE | Which version this appearance belongs to. |
func_index | INTEGER | NOT NULL | WebAssembly function index within this version. |
stable_id | TEXT | NOT NULL | The carry-over key. Content identity hash. See idx_functions_stable. |
exact_hash | TEXT | NOT NULL | Hash of the raw instruction bytes; changes on any edit. |
structural_hash | TEXT | NOT NULL | CFG-shape hash; robust to constant and immediate changes. See idx_functions_struct. |
minhash | TEXT | NOT NULL | JSON int array (MinHash signature for fuzzy/Jaccard similarity). |
histogram | TEXT | NOT NULL | JSON {opcode_class: count} (opcode-class frequency histogram). |
call_targets | TEXT | NOT NULL | JSON list of import names this function calls. |
local_calls | INTEGER | DEFAULT 0 | Number of calls to other locally-defined functions. |
type_signature | TEXT | WebAssembly type signature (always present for defined functions). | |
instruction_count | INTEGER | DEFAULT 0 | Total instructions in the function body. |
body_size | INTEGER | DEFAULT 0 | Raw byte size of the function body. |
is_import | INTEGER | DEFAULT 0 | 1 for imported functions (no body; just a type stub). |
raw_name | TEXT | Name as found in the name section, export table, or import descriptor, if any. |
(version_id, func_index). One row per function per version.
Indices: idx_functions_stable on stable_id; idx_functions_version on version_id;
idx_functions_struct on structural_hash.
symbols
The durable annotation layer. Keyed on stable_id so that an annotation created against
v1 is automatically present when the same function appears in v2, v3, and beyond. All
writes go through KnowledgeBase.upsert_symbol(), which enforces the provenance/confidence
economy (see below).
| Column | Type | Constraints | Notes |
|---|---|---|---|
id | INTEGER | PRIMARY KEY AUTOINCREMENT | |
stable_id | TEXT | NOT NULL | Content identity. Indexed (idx_symbols_stable). |
kind | TEXT | NOT NULL DEFAULT 'function' | 'function' | 'global' | 'struct' | 'type'. |
name | TEXT | Recovered or assigned name. | |
type_signature | TEXT | Inferred or confirmed type. | |
summary | TEXT | Free-text description of what the function does. | |
provenance | TEXT | NOT NULL | Who produced this annotation. See Provenance ranking. |
confidence | REAL | NOT NULL DEFAULT 0.0 | 0.0–1.0. Drives the write-economy gate. |
evidence | TEXT | JSON list of {kind, detail} objects (the evidence trail). | |
source_ref | TEXT | Upstream source URL or file/line (e.g. a musl commit URL). Set for Oracle-confirmed runtime code. | |
locked | INTEGER | DEFAULT 0 | 1 means human-verified; immutable to any automated write. |
created_at | TEXT | DEFAULT CURRENT_TIMESTAMP | |
updated_at | TEXT | DEFAULT CURRENT_TIMESTAMP | Updated on every accepted write. |
(stable_id, kind). One annotation record per identity per kind.
structs
Recovered memory-region and struct layouts, stored as first-class structured data rather than
free-text comments.
| Column | Type | Constraints | Notes |
|---|---|---|---|
id | INTEGER | PRIMARY KEY AUTOINCREMENT | |
name | TEXT | NOT NULL UNIQUE | Struct name, e.g. "pthread_mutex_t". |
layout | TEXT | NOT NULL | JSON list of {offset, size, type, name} field descriptors. |
provenance | TEXT | NOT NULL | Typically "agent" or "oracle". |
confidence | REAL | NOT NULL DEFAULT 0.0 | |
notes | TEXT | Free-text notes. | |
updated_at | TEXT | DEFAULT CURRENT_TIMESTAMP |
KnowledgeBase.upsert_struct(), which uses ON CONFLICT(name) DO UPDATE.
thread_model
The concurrency spine: lock-to-guarded-data relationships, atomic sites, TLS references, and
worker entry points. Version-scoped because the threading model may change between releases.
| Column | Type | Constraints | Notes |
|---|---|---|---|
id | INTEGER | PRIMARY KEY AUTOINCREMENT | |
version_id | INTEGER | FK → module_versions.id ON DELETE CASCADE | |
kind | TEXT | NOT NULL | 'lock' | 'atomic' | 'tls' | 'worker-entry'. |
site | TEXT | Function stable_id or address where the site was observed. | |
guarded_data | TEXT | Named memory region this lock protects (for kind='lock'). | |
detail | TEXT | Additional free-text detail. | |
provenance | TEXT | NOT NULL DEFAULT 'agent' | |
confidence | REAL | NOT NULL DEFAULT 0.0 |
KnowledgeBase.add_thread_fact(); retrieved via thread_facts(version_id).
oracle_matches
Records of individual Oracle identifications: a function fingerprint matched to a known upstream
runtime or libc function. References functions.id (not stable_id), so each match is
version-specific: the same logical function may match at a different func_index in each
version, generating a separate row.
| Column | Type | Constraints | Notes |
|---|---|---|---|
id | INTEGER | PRIMARY KEY AUTOINCREMENT | |
function_id | INTEGER | NOT NULL FK → functions.id ON DELETE CASCADE | Which per-version function row was matched. |
matched_name | TEXT | NOT NULL | The real upstream name, e.g. "__stdio_write". |
library | TEXT | Source library: musl | emscripten | libc++ | compiler-rt | dlmalloc. | |
emscripten_version | TEXT | Emscripten version range where this signature was observed. | |
opt_level | TEXT | Optimization level of the corpus artifact, e.g. "-O2". | |
score | REAL | NOT NULL | Match score (0.0–1.0). |
source_ref | TEXT | Upstream source URL or file/line from the corpus artifact. |
(function_id, matched_name).
An Oracle match is a separate record from the symbols annotation it triggers. warden oracle identify writes both: the match record here, and a symbols row (provenance "oracle") via
upsert_symbol.
diffs
Stored cross-version diff reports (the semantic changelogs produced by warden diff). One row
per ordered pair of versions. The report column holds the full JSON classification (unchanged,
structurally-equivalent, fuzzy-matched, added, removed) and carry-over counts.
| Column | Type | Constraints | Notes |
|---|---|---|---|
id | INTEGER | PRIMARY KEY AUTOINCREMENT | |
from_version_id | INTEGER | FK → module_versions.id ON DELETE CASCADE | Earlier version. |
to_version_id | INTEGER | FK → module_versions.id ON DELETE CASCADE | Later version. |
report | TEXT | NOT NULL | JSON diff report (function classifications, carry-over counts, changelog). |
created_at | TEXT | DEFAULT CURRENT_TIMESTAMP |
(from_version_id, to_version_id).
audit_log
Append-only trail of every symbol write attempt, whether accepted or rejected. Rows are never
modified, only inserted. Used to answer “who changed this name and when?” and “why was this
agent proposal rejected?”.
| Column | Type | Constraints | Notes |
|---|---|---|---|
id | INTEGER | PRIMARY KEY AUTOINCREMENT | Monotonically increasing. |
stable_id | TEXT | The identity that was affected. May be NULL for non-symbol actions. | |
action | TEXT | NOT NULL | 'created' | 'updated' | 'rejected'. |
actor | TEXT | NOT NULL | Provenance of the writer ('human', 'oracle', 'agent', etc.). |
detail | TEXT | Human-readable reason string returned by may_overwrite(). | |
created_at | TEXT | DEFAULT CURRENT_TIMESTAMP | UTC timestamp. |
KnowledgeBase.audit_log(limit=100).
Provenance ranking
Every symbol write carries aprovenance label. PROVENANCE_RANK in src/warden/kb/models.py
assigns an integer priority that the may_overwrite() gate uses to decide whether a new write
may displace an existing annotation.
10. Human is sovereign at the
top; raw agent proposals sit at the bottom. The ranking is also shown in the core
concepts table.
The may_overwrite() economy
may_overwrite() in src/warden/kb/models.py is the gatekeeper called by every
KnowledgeBase.upsert_symbol() invocation. It receives the new write’s provenance and
confidence alongside the existing annotation’s provenance, confidence, and lock state, and
returns (allowed: bool, reason: str).
The rules, exactly as implemented:
Empty slot: always allow
If
existing_provenance is None the slot is empty and any write succeeds. Reason:
"new symbol".Human write: always allow
If
new_provenance == "human" the write wins unconditionally, regardless of what already
exists. Reason: "human override". A subsequent lock_symbol() call then sets
locked = 1 to protect it.Locked symbol: reject any non-human write
If
existing_locked is true and the writer is not "human", the write is rejected.
Reason: "existing symbol is locked (human-verified)". The locked flag is set by
warden set-name (which locks by default) and by KnowledgeBase.lock_symbol().Agent write: narrow overwrite rights
An agent may only write if one of two conditions holds:
- The existing annotation is also
"agent"andnew_confidence > existing_confidence(strictly higher). Reason:"higher-confidence agent write". - The existing provenance has a rank strictly below 30 (the agent rank), i.e. an unknown
source. Reason:
"outranks existing automated source".
oracle, export, import, string-xref, diff-carry, or
human annotations. The rejection reason names the existing provenance and its confidence.(False, reason) and are appended to audit_log
with action='rejected'. This lets you re-run the entire agent crew on every update without
clobbering any verified human work or high-confidence Oracle annotations.
Example economy decisions
| New write | Existing | Outcome |
|---|---|---|
agent, conf 0.80 | (empty) | Allowed: new symbol |
agent, conf 0.80 | agent, conf 0.60 | Allowed: higher-confidence agent write |
agent, conf 0.80 | agent, conf 0.90 | Rejected: lower confidence |
agent, conf 0.95 | oracle, conf 0.85 | Rejected: agent may not overwrite oracle |
oracle, conf 0.90 | agent, conf 0.95 | Allowed: oracle outranks agent |
human, conf 1.00 | oracle, locked | Allowed: human override |
oracle, conf 0.90 | human, locked | Rejected: symbol is locked |
KnowledgeBase API
KnowledgeBase (in src/warden/kb/database.py) is a pure-stdlib sqlite3-backed class that
supports the context-manager protocol. Schema is applied idempotently on every open.
Representative calls
Module versions
Module versions
Write and read a symbol
Write and read a symbol
Bulk-fetch symbols for a version
Bulk-fetch symbols for a version
Coverage statistics
Coverage statistics
coverage_pct is computed over defined functions only (imports excluded from the
denominator).Oracle matches
Oracle matches
Lock a symbol
Lock a symbol
warden set-name <label> <index> <name> (which locks by default).Diff storage
Diff storage
Audit log
Audit log
The kb-text export format
warden export <label> --format kb-text emits a stable, deterministic text dump of every
symbol in a version, sorted by func_index. Because the output uses fixed-width fields and a
consistent sort, it diffs cleanly in git alongside version-controlled .wasm artifacts.
lk column is L for locked (human-verified) symbols and a space otherwise. The
provenance and conf columns are - for functions that have no symbol record yet.
Storage notes
Pure stdlib
KnowledgeBase imports only sqlite3 and json. No C extensions or native dependencies on
the core path.WAL mode
PRAGMA journal_mode = WAL allows concurrent readers alongside the single writer, so
warden mcp and a local agent can run simultaneously without blocking each other.Foreign-key cascade
ON DELETE CASCADE is set on all version-scoped tables. Deleting a module_versions row
automatically cleans up functions, oracle_matches, thread_model, and diffs.JSON columns
Columns such as
minhash, histogram, call_targets, inferred_flags, glue_info,
evidence, layout, and report store structured data as JSON strings.
KnowledgeBase handles serialization transparently; callers work with Python list and
dict objects.Related pages
Core concepts
How stable_id is computed and why the provenance economy exists.
CLI reference
Commands that read and write the KB: ingest, oracle, agent, diff, export.
MCP reference
Serve KB data to external tools over the Model Context Protocol.