← Enki Systems · read as Markdown · indexed & published by ENKI · enkisystems.com

Offline, Federated, Governed Intelligence Extraction

Proving a small-model federation can match — and in recall exceed — a frontier model, on real documents

Enki Systems · Brad Harris · 2026-06-23 Working whitepaper — results are from live runs on the G1 production node, not simulations.


1. Thesis

A single large language model is the obvious way to read documents and extract intelligence. It is also the wrong way for a system that must be self-hostable, offline-capable, auditable, and trustworthy. A frontier model is a remote dependency, a black box, a refusal surface, and a single point of failure.

We claim — and demonstrate below on real documents — that the load-bearing work can be done entirely offline on small local models plus deterministic governance, and that a federation of diverse small models is not a downgrade but an upgrade: because the models find largely different things, their union has higher recall than any single model, and because validation is proven rather than judged, the result is more trustworthy than a model's opinion.

Three design commitments make this work:

  1. The substrate is deterministic. Collapse (entity resolution + merge) is an explicit governed equation, not a model call. Same documents in → same graph out.
  2. The model only transcribes; it never judges. Extraction asks "who said what" (a transcription task small models do well). Validation — faithfulness, attribution, contradiction — is proven deterministically or by geometry (embeddings), never by asking a model for a verdict.
  3. Isolation is physical, not procedural. Each dataset lives in its own database pool. Cross-referencing can only happen after data is inside a pool, and never across pools.

2. The library system: grab → hash → lock → dedup → isolate

A library is a governed dataset: a named, addressed collection of source documents about a focal subject. Libraries are how we keep "the public pool" and "a private research pool" from ever mixing.

2.1 Definition

libraries (migration 067) carries library_id, a unique base60_address, canonical_name, focal_term (the seed subject), library_subtype, tenant_id, and counters. library_members (PK (library_id, member_kind, member_id)) ties documents/entities/events to the library. Evidence: storage/postgres/migrations/067_library_and_bibliographic.sql.

2.2 Grab

Documents are acquired by core/discovery/acquirer.py (archive.org search with token-overlap validation, then direct .pdf URL) and ingested through core/documents/ingester.py persist_document/persist_page/persist_paragraph.

2.3 Hash

On ingest, content is hashed: content_hash = "sha256:" + sha256(raw_bytes), and a bare 64-hex content_sha256 is derived for the federation key. Both are stored on the documents row. Evidence (live, enki_covid): every document carries a content_sha256 (424db387…, 613e808c…, bc9d5ad3…).

2.4 Lock

Content-addressing makes a document tamper-evident: the content_sha256 is its identity, and federation envelopes are stored at a content-addressed path and signed Ed25519 (enki_signer_pubkey, enki_signed_at; core/federation/extraction_publisher.py, peer_identity.py). Entities get a canonical_entity_key = sha256(name|type|strongest_id) — the cross-node identity lock (core/federation/identity.py). Honest status: content-hash locking is live; cryptographic signing of pool documents (enki_signed_at) exists in code but is not yet activated on the current pools. Activating it is part of the fresh rebuild.

2.5 Dedup ("do we already have this?")

documents.hash is a UNIQUE column; ingest is an INSERT … ON CONFLICT (hash) DO UPDATE — re-adding identical content upserts metadata, it never creates a second row. Entities dedup on canonical_entity_key. Evidence (live): documents_hash_key unique index present; enki_research shows 1,899 entities / 1,899 with canonical_entity_key = 100% coverage.

2.6 Isolate (the guarantee you asked us to prove)

Each pool is a separate PostgreSQL database selected by the ENKI_DB_DSN env var; a process binds to exactly one pool (storage/postgres/db.py, single ThreadedConnectionPool). There is no code path that opens two pools at once.

The guarantee is stronger than code discipline — it is enforced by PostgreSQL itself:

Live validation, 2026-06-23: every pool reports 0 dblink/postgres_fdw extensions and 0 foreign servers. A query inside enki_covid cannot reach enki. Cross-pool mixing is not "discouraged"; it is impossible.

Independent ID spaces: entity_id = 1 resolves to a different real-world entity in each pool — enki_research → "Johanna Sjoberg" (person), enki → "United States" (country), enki_covid → (none). There is no shared namespace to leak across.

Cross-referencing therefore happens only after data is inside a pool, and only within that pool (core/gdu/entity_resolution.py, core/workers/collapse_worker.py operate on one connection). Federation moves data between pools only via signed, content-addressed envelopes that the receiving pool re-resolves locally on its own canonical_entity_key — it never joins across databases.


3. Extraction methodology (fully offline)

Pass 1  ANCHORS        deterministic high-recall net: names, orgs, dates, cue words
Pass 2  ANCHOR-FILL    local model (gemma4:e2b) TRANSCRIBES, guided by Pass-1 anchors,
                       one proposition per statement — no interpretation
        ── VALIDATION (deterministic, no model judgment) ──
        · faithfulness wall : quote must appear verbatim in the source window
        · attribution check : the named speaker must sit adjacent to the quote
        · boilerplate filter: drop classification stamps / email headers
        · garble filter     : drop OCR-mangled quotes
COLLAPSE               deterministic governed equation → canonical graph, base-60 addresses
AFTER-CALL  COMPARE    embed each claim (nomic-embed, local) → cluster by proposition →
                       opposite polarity in a cluster = CONTRADICTION, proven by two quotes

The key move (Brad's): the open-ended task "find everything significant in this passage" — the task that favours a frontier model — is decomposed into a cheap high-recall scaffold (Pass 1) plus many narrow fill-in questions (Pass 2). Small models excel at narrow questions. And the scaffold's base-60 addresses are the shared coordinate that lets federated nodes agree on what they are filling in.

Why no model judges: small safety-tuned models refuse to adjudicate politically-charged claims (empirically: gemma4:e2b returned empty/reflexive-reject when asked to judge Fauci/COVID claims, while transcribing the same content fine). So judgment was removed from the model's job. A contradiction is proven by juxtaposition — one source asserts P, another asserts ¬P about the same proposition — with both quotes shown. No opinion sits between the reader and the evidence.


4. Results (real documents, live runs)

4.1 Recall: small-model two-pass vs. frontier one-pass

Same document (ODNI declassified COVID release, doc 3 in enki_covid, 57 windows):

Verified findings
Offline local — gemma4:e2b, anchored two-pass108
Claude sonnet-4-6 — one open-ended pass (baseline)52

The offline two-pass found 2× the recall. More striking: of the findings, only 4 quotes overlapped — 100 were local-only, 42 Claude-only. The two approaches are largely disjoint (union ≈ 150). This is the central evidence for the federation thesis: diverse models are complementary, so combining them beats any single model. The "other" model is not a ceiling we approach — it is one more contributor whose blind spots a different model covers.

4.2 Validation: contradictions proven offline, no judgment model

Comparison pass over 295 grounded statements (House Select Subcommittee final report + ODNI release), fully offline (local embeddings + deterministic polarity):

295 statements → 222 clusters → 13 proven contradictions, each backed by two attributed verbatim quotes. Representative results:

  • Tabak vs. Daszak — lab notebooks "would have been a requirement" vs. Daszak testified he was "not required to produce the lab notebooks." (cross-witness)
  • Daszak vs. Lauer"we were locked out of the system" vs. "we never found any evidence that they had been locked out." (cross-witness)
  • Andersen vs. Sen. Rand Paul — CROSS-DOCUMENT — House testimony "we kept the possibilities … open-ended" vs. ODNI/Paul "a paper that said absolutely this did not come from a lab." — a contradiction surfaced by comparing two different source documents: the federation payoff in one result.
  • Morens"I deleted everything with [EcoHealth] people from my entire outlook…" vs. denied attempting to circumvent FOIA.
  • Baric"a statistical difference … an increase in virulence … is it a gain-of-function phenotype? Absolutely" vs. "this is not a PPP."

No model rendered any of these verdicts. The machine clustered statements by subject (geometry) and flagged opposite polarity (deterministic); a human reads the two quotes and sees the conflict.

4.3 Isolation (from §2.6)

0 cross-DB bridges, independent ID spaces — verified live. Datasets cannot contaminate each other.


5. Honest limitations

We report these because a result you can't see the failure modes of isn't a result.

  1. Precision cost of small models. The 108 local findings include noise the 52 Claude findings do not — gemma turns classification stamps and email boilerplate into "findings" (~35% of local-only sampled). This is deterministically recoverable (the boilerplate/garble filters, now built, drop it) and is not a model-quality problem.
  2. Refinement, not magic, on contradictions. The first comparison pass produced 15 candidates including false positives (a "yes / yes" agreement mislabeled as opposed, and Q&A pairs). Tightening to explicit denial-verbs + shared-term requirement + same-exchange filtering cut these to 13 higher-precision pairs; ~2 weak ones (near-duplicate same-speaker quotes) remain and need a near-duplicate filter.
  3. Single-GPU contention. Generation (gemma) and embeddings (nomic) compete for one GPU; they must run in separate phases. In a federation this is naturally distributed.
  4. Locking partially activated. Content-hashing is live; Ed25519 signing of pool documents is not yet switched on (see §2.4).
  5. Federation independence requires model diversity. Running the same small model on N nodes gives correlated refusals, not an independent jury. Real cross-node validation needs different models per node.

6. Why this proves the architecture

  • Offline: every load-bearing step ran on local models or pure compute, with the Anthropic key removed. No egress.
  • Combination > monolith: 108 vs 52 with near-disjoint findings is direct evidence that a federation of diverse models out-recalls a single frontier model.
  • Trust by proof, not opinion: every contradiction is two real quotes with sources; the deterministic gates (verbatim wall, attribution, dedup) are model-agnostic and auditable.
  • Contamination is structurally impossible: separate databases, zero cross-DB bridges, independent ID spaces — proven live.

The frontier model is therefore optional — useful as one booster node's contribution to recall, never a requirement of the substrate. Enki can operate as a fully offline, federated, governed intelligence system, and the combination of many modest nodes is the source of its power.


Appendix — artifacts and reproduction

  • Offline extractor: scripts/atomic_extract.py (two-pass, deterministic gates)
  • Comparison/contradiction pass: scripts/compare_statements.py (embed-cluster + denial-opposition)
  • Redaction analysis: scripts/redaction_stats.py
  • Baseline (frontier): scripts/context_extractor.py / context_extractor_covid.py
  • Isolation validation: scripts/validate_isolation.sql
  • Live result files (G1): Claude baseline 52 findings; offline local 108 findings; 13 proven contradictions.
  • Models: extraction gemma4:e2b (local Ollama); embeddings nomic-embed-text (local); baseline claude-sonnet-4-6.
  • Pools: enki (main), enki_research (Epstein), enki_covid (COVID) — separate PostgreSQL databases on the G1 node.
Whitepaper by Brad Harris · Enki Systems · enkisystems.com. Results from live runs on the Enki G1 node.