# Offline, Federated, Governed Intelligence Extraction
### Proving a small-model federation can match — and in recall exceed — a frontier model, on real documents

**Enki Systems · Brad Harris · 2026-06-23**
*Working whitepaper — results are from live runs on the G1 production node, not simulations.*

---

## 1. Thesis

A single large language model is the obvious way to read documents and extract intelligence. It is also the wrong way for a system that must be **self-hostable, offline-capable, auditable, and trustworthy**. A frontier model is a remote dependency, a black box, a refusal surface, and a single point of failure.

We claim — and demonstrate below on real documents — that the load-bearing work can be done **entirely offline on small local models plus deterministic governance**, and that **a federation of diverse small models is not a downgrade but an upgrade**: because the models find largely *different* things, their union has higher recall than any single model, and because validation is *proven* rather than *judged*, the result is more trustworthy than a model's opinion.

Three design commitments make this work:

1. **The substrate is deterministic.** Collapse (entity resolution + merge) is an explicit governed equation, not a model call. Same documents in → same graph out.
2. **The model only transcribes; it never judges.** Extraction asks "who said what" (a transcription task small models do well). Validation — faithfulness, attribution, contradiction — is *proven* deterministically or by geometry (embeddings), never by asking a model for a verdict.
3. **Isolation is physical, not procedural.** Each dataset lives in its own database pool. Cross-referencing can only happen *after* data is inside a pool, and never across pools.

---

## 2. The library system: grab → hash → lock → dedup → isolate

A **library** is a governed dataset: a named, addressed collection of source documents about a focal subject. Libraries are how we keep "the public pool" and "a private research pool" from ever mixing.

### 2.1 Definition
`libraries` (migration 067) carries `library_id`, a unique `base60_address`, `canonical_name`, `focal_term` (the seed subject), `library_subtype`, `tenant_id`, and counters. `library_members` (PK `(library_id, member_kind, member_id)`) ties documents/entities/events to the library. *Evidence: `storage/postgres/migrations/067_library_and_bibliographic.sql`.*

### 2.2 Grab
Documents are acquired by `core/discovery/acquirer.py` (archive.org search with token-overlap validation, then direct `.pdf` URL) and ingested through `core/documents/ingester.py` `persist_document/persist_page/persist_paragraph`.

### 2.3 Hash
On ingest, content is hashed: `content_hash = "sha256:" + sha256(raw_bytes)`, and a bare 64-hex `content_sha256` is derived for the federation key. Both are stored on the `documents` row. *Evidence (live, `enki_covid`): every document carries a `content_sha256` (`424db387…`, `613e808c…`, `bc9d5ad3…`).*

### 2.4 Lock
Content-addressing makes a document tamper-evident: the `content_sha256` *is* its identity, and federation envelopes are stored at a content-addressed path and signed Ed25519 (`enki_signer_pubkey`, `enki_signed_at`; `core/federation/extraction_publisher.py`, `peer_identity.py`). Entities get a `canonical_entity_key = sha256(name|type|strongest_id)` — the cross-node identity lock (`core/federation/identity.py`).
**Honest status:** content-hash locking is **live**; cryptographic *signing* of pool documents (`enki_signed_at`) exists in code but is **not yet activated** on the current pools. Activating it is part of the fresh rebuild.

### 2.5 Dedup ("do we already have this?")
`documents.hash` is a **UNIQUE** column; ingest is an `INSERT … ON CONFLICT (hash) DO UPDATE` — re-adding identical content upserts metadata, it never creates a second row. Entities dedup on `canonical_entity_key`. *Evidence (live): `documents_hash_key` unique index present; `enki_research` shows 1,899 entities / 1,899 with `canonical_entity_key` = 100% coverage.*

### 2.6 Isolate (the guarantee you asked us to prove)
Each pool is a **separate PostgreSQL database** selected by the `ENKI_DB_DSN` env var; a process binds to exactly one pool (`storage/postgres/db.py`, single `ThreadedConnectionPool`). There is **no code path that opens two pools at once.**

The guarantee is stronger than code discipline — it is enforced by PostgreSQL itself:

> **Live validation, 2026-06-23:** every pool reports **0 `dblink`/`postgres_fdw` extensions and 0 foreign servers.** A query inside `enki_covid` *cannot* reach `enki`. Cross-pool mixing is not "discouraged"; it is impossible.
>
> **Independent ID spaces:** `entity_id = 1` resolves to a *different* real-world entity in each pool — `enki_research` → "Johanna Sjoberg" (person), `enki` → "United States" (country), `enki_covid` → (none). There is no shared namespace to leak across.

**Cross-referencing therefore happens only after data is inside a pool, and only within that pool** (`core/gdu/entity_resolution.py`, `core/workers/collapse_worker.py` operate on one connection). Federation moves data *between* pools only via signed, content-addressed envelopes that the receiving pool re-resolves locally on its own `canonical_entity_key` — it never joins across databases.

---

## 3. Extraction methodology (fully offline)

```
Pass 1  ANCHORS        deterministic high-recall net: names, orgs, dates, cue words
Pass 2  ANCHOR-FILL    local model (gemma4:e2b) TRANSCRIBES, guided by Pass-1 anchors,
                       one proposition per statement — no interpretation
        ── VALIDATION (deterministic, no model judgment) ──
        · faithfulness wall : quote must appear verbatim in the source window
        · attribution check : the named speaker must sit adjacent to the quote
        · boilerplate filter: drop classification stamps / email headers
        · garble filter     : drop OCR-mangled quotes
COLLAPSE               deterministic governed equation → canonical graph, base-60 addresses
AFTER-CALL  COMPARE    embed each claim (nomic-embed, local) → cluster by proposition →
                       opposite polarity in a cluster = CONTRADICTION, proven by two quotes
```

The key move (Brad's): the open-ended task "find everything significant in this passage" — the task that *favours* a frontier model — is decomposed into a cheap high-recall *scaffold* (Pass 1) plus many *narrow* fill-in questions (Pass 2). Small models excel at narrow questions. And the scaffold's base-60 addresses are the shared coordinate that lets federated nodes agree on *what* they are filling in.

**Why no model judges:** small safety-tuned models *refuse* to adjudicate politically-charged claims (empirically: gemma4:e2b returned empty/reflexive-reject when asked to *judge* Fauci/COVID claims, while *transcribing* the same content fine). So judgment was removed from the model's job. A contradiction is *proven by juxtaposition* — one source asserts P, another asserts ¬P about the same proposition — with both quotes shown. No opinion sits between the reader and the evidence.

---

## 4. Results (real documents, live runs)

### 4.1 Recall: small-model two-pass vs. frontier one-pass
Same document (ODNI declassified COVID release, doc 3 in `enki_covid`, 57 windows):

| | Verified findings |
|---|---|
| **Offline local** — gemma4:e2b, anchored two-pass | **108** |
| Claude sonnet-4-6 — one open-ended pass (baseline) | 52 |

The offline two-pass found **2× the recall.** More striking: of the findings, only **4 quotes overlapped** — 100 were local-only, 42 Claude-only. **The two approaches are largely disjoint** (union ≈ 150). This is the central evidence for the federation thesis: diverse models are *complementary*, so combining them beats any single model. The "other" model is not a ceiling we approach — it is one more contributor whose blind spots a different model covers.

### 4.2 Validation: contradictions proven offline, no judgment model
Comparison pass over 295 grounded statements (House Select Subcommittee final report + ODNI release), fully offline (local embeddings + deterministic polarity):

**295 statements → 222 clusters → 13 proven contradictions**, each backed by two attributed verbatim quotes. Representative results:

- **Tabak vs. Daszak** — lab notebooks *"would have been a requirement"* vs. Daszak testified he was *"not required to produce the lab notebooks."* (cross-witness)
- **Daszak vs. Lauer** — *"we were locked out of the system"* vs. *"we never found any evidence that they had been locked out."* (cross-witness)
- **Andersen vs. Sen. Rand Paul — CROSS-DOCUMENT** — House testimony *"we kept the possibilities … open-ended"* vs. ODNI/Paul *"a paper that said absolutely this did not come from a lab."* — a contradiction surfaced by comparing **two different source documents**: the federation payoff in one result.
- **Morens** — *"I deleted everything with [EcoHealth] people from my entire outlook…"* vs. denied attempting to circumvent FOIA.
- **Baric** — *"a statistical difference … an increase in virulence … is it a gain-of-function phenotype? Absolutely"* vs. *"this is not a PPP."*

No model rendered any of these verdicts. The machine clustered statements by subject (geometry) and flagged opposite polarity (deterministic); a human reads the two quotes and sees the conflict.

### 4.3 Isolation (from §2.6)
0 cross-DB bridges, independent ID spaces — verified live. Datasets cannot contaminate each other.

---

## 5. Honest limitations

We report these because a result you can't see the failure modes of isn't a result.

1. **Precision cost of small models.** The 108 local findings include noise the 52 Claude findings do not — gemma turns classification stamps and email boilerplate into "findings" (~35% of local-only sampled). This is *deterministically* recoverable (the boilerplate/garble filters, now built, drop it) and is not a model-quality problem.
2. **Refinement, not magic, on contradictions.** The first comparison pass produced 15 candidates including false positives (a "yes / yes" agreement mislabeled as opposed, and Q&A pairs). Tightening to explicit denial-verbs + shared-term requirement + same-exchange filtering cut these to 13 higher-precision pairs; ~2 weak ones (near-duplicate same-speaker quotes) remain and need a near-duplicate filter.
3. **Single-GPU contention.** Generation (gemma) and embeddings (nomic) compete for one GPU; they must run in separate phases. In a federation this is naturally distributed.
4. **Locking partially activated.** Content-hashing is live; Ed25519 *signing* of pool documents is not yet switched on (see §2.4).
5. **Federation independence requires model diversity.** Running the *same* small model on N nodes gives correlated refusals, not an independent jury. Real cross-node validation needs *different* models per node.

---

## 6. Why this proves the architecture

- **Offline:** every load-bearing step ran on local models or pure compute, with the Anthropic key removed. No egress.
- **Combination > monolith:** 108 vs 52 with near-disjoint findings is direct evidence that a federation of diverse models out-recalls a single frontier model.
- **Trust by proof, not opinion:** every contradiction is two real quotes with sources; the deterministic gates (verbatim wall, attribution, dedup) are model-agnostic and auditable.
- **Contamination is structurally impossible:** separate databases, zero cross-DB bridges, independent ID spaces — proven live.

The frontier model is therefore *optional* — useful as one booster node's contribution to recall, never a requirement of the substrate. Enki can operate as a fully offline, federated, governed intelligence system, and the combination of many modest nodes is the source of its power.

---

## Appendix — artifacts and reproduction
- Offline extractor: `scripts/atomic_extract.py` (two-pass, deterministic gates)
- Comparison/contradiction pass: `scripts/compare_statements.py` (embed-cluster + denial-opposition)
- Redaction analysis: `scripts/redaction_stats.py`
- Baseline (frontier): `scripts/context_extractor.py` / `context_extractor_covid.py`
- Isolation validation: `scripts/validate_isolation.sql`
- Live result files (G1): Claude baseline 52 findings; offline local 108 findings; 13 proven contradictions.
- Models: extraction `gemma4:e2b` (local Ollama); embeddings `nomic-embed-text` (local); baseline `claude-sonnet-4-6`.
- Pools: `enki` (main), `enki_research` (Epstein), `enki_covid` (COVID) — separate PostgreSQL databases on the G1 node.