Substrate Specification
The structural foundation for knowledge, computation, and navigation. Any reader — human, agent, browser, shell, website — navigates the same structure.
One Primitive: Chunk
A chunk is a unit of meaning with identity.
chunk {
id: globally unique, system-generated
name: optional, unique within parent scope
spec: structural contract (system-enforced)
body: everything else — text, properties, data (reader-interpreted)
}
Two fields, cleanly separated by who reads them:
Spec is for the system. A structural contract enforced on writes. When a chunk is placed on this chunk as instance, the system validates conformance. Non-conforming writes are rejected.
Body is for the reader. A JSON object. All content
lives here — readable text (body.text), structured
properties, any data. The system does not distinguish "readable content"
from "structured fields." That distinction is a reader concern. FTS
indexes all string values in body.
A chunk can serve as:
- Content — carrying meaning in body, placed on identity chunks
- Identity / Entity — a named chunk that other chunks are placed on
- Archetype — a chunk whose spec defines a type contract for instances
- Connection — a chunk placed on multiple identity chunks, bridging them
These roles are not declared. They emerge from how the chunk is placed and referenced.
One Mechanism: Placement
A placement puts a chunk into a scope (another chunk).
placement {
chunk_id: the chunk being placed
scope_id: the chunk it's placed on
type: 'instance' or 'relates'
seq: optional integer — position in an ordered sequence
}
Instance means the chunk IS a member of the scope. "This prompt IS an event in this session."
Relates means the chunk is ABOUT the scope. "This note is about Alan Turing but isn't a defining part of him."
Seq is the ordering mechanism. Position is relative
to the scope — the same chunk can be item 1 in one scope and item 43 in
another. When a scope's spec says ordered: true, all
instance placements must have a seq value.
What placement creates
Identity. A chunk with other chunks placed on it IS a scope. No declaration needed — it emerges from the placement graph.
Hierarchy. A chunk placed on nothing is root-level. A chunk placed on another is nested within it. The hierarchy is structural (organizational), not protective. Anything can reference anything regardless of nesting. Peers enforce boundaries, hierarchy structures.
Connections. A chunk placed on multiple identity
chunks bridges them. The chunk at the intersection of
turing and cambridge IS the relationship
between Turing and Cambridge. The meaning of the connection is in the
chunk's body. No separate link/edge primitive needed.
Consolidation principle. If content doesn't create a new intersection of scopes, it belongs on the identity chunk itself (in its body), not as a separate chunk. Separate chunks exist when they bridge multiple identities.
Name resolution
Names are unique within their parent scope, like files in a directory. The system references by ID internally. Names are human-readable labels resolved contextually:
- Scoped to
{session}:promptresolves to the session's prompt concept - Scoped to
{email}:promptresolves to the email's prompt concept - When ambiguous, the view disambiguates:
session/promptvsemail/prompt
Since the system uses IDs, renaming is trivial — nothing structural depends on the name.
Spec Language
A chunk's spec defines the structural contract for its instances. The system enforces this on writes.
{
"ordered": true, // instances must have seq on placement
"accepts": ["prompt", "answer"], // instances must be instance of one of these
// (names resolved within this chunk's scope)
"required": ["tool", "exit"], // instances must have these body keys
"unique": ["name"], // these body keys must be unique across instances
"propagate": true // spec propagates to instances' children (composition)
}ordered — when true, every instance
placement on this scope must have a seq value. Auto-assigned on append
if omitted. Explicit seq for specific positioning.
accepts — instance chunks must
themselves be instance of one of the listed types. Names are resolved
within the scope — prompt means the prompt
chunk placed on this scope, not a global prompt. A chunk
must not be an instance of two types that both appear in the same
scope's accepts list — the system rejects this on write to
prevent type ambiguity.
required — instance chunks must have
these keys present in their body.
unique — these body keys must have
values that are unique across all instances of this scope.
propagate — when true, the spec is not
enforced on direct placements on this chunk. Instead, it propagates to
instances: anything placed on an instance of this archetype is checked
against this spec. Multiple propagating archetypes compose by union — if
a chunk is an instance of two archetypes that both have
propagate: true, the effective contract on that chunk's
children is the union of both specs. Each archetype's
accepts names are resolved within that archetype's own
scope.
The distinction: required describes what
instances ARE (self-enforcement, no propagate).
accepts describes what instances CONTAIN
(content contract, propagate). ordered can be either —
ordered direct members (no propagate, e.g. context) or
ordered content on instances (propagate, e.g. session).
Archetypes
An archetype is a chunk whose spec defines a type contract. There is no "archetype" flag — a chunk with a non-empty spec is an archetype by nature.
Archetypes enable typed interfaces. The shell can require "I need an
instance of session" — meaning a chunk placed on the
session archetype, conforming to its spec.
Type definitions use relates.
Type-defining chunks (like prompt on session)
are placed as relates, not instance. This
keeps them out of ordered content (no seq required) while remaining
resolvable for accepts name resolution. Content chunks use
instance with dual placement — on the scope (with seq) and
on the archetype (for type membership).
Dual placement enforcement. A chunk with dual
placement (scope + type) needs its type-membership placement visible
before its scope placement is enforced — otherwise the
accepts check can't find the type membership. The
substrate's two-pass write-then-validate (see Mutations and validation) makes
this work.
Mutations and validation
Mutations are atomic. A declaration — one or many chunks, with their placements — succeeds entirely or fails entirely. If any part of a declaration fails validation, none of it is recorded; a commit appears only when every write in the declaration passes. This atomicity is a property of the field itself, not of any storage layer beneath it. Whatever materialization holds the field delivers it.
Spec validation
The spec language defines what conformance means; the substrate enforces conformance at write time.
For a chunk being placed instance on a scope, the
effective contract is the union of:
- The scope's own non-propagating spec, if any.
- The propagating spec of every archetype the scope is itself
instanceof (transitively).acceptsnames from each archetype are resolved within that archetype's own scope.
The chunk must satisfy the union:
ordered— the placement carries aseq. Auto-assigned tomax(existing seq on scope) + 1when omitted.accepts— the chunk isinstanceof at least one of the listed type chunks. A chunk that isinstanceof two of the listed types in the same union contract is rejected (type ambiguity).required— the chunk's body carries every listed key.unique— for each listed body key, the chunk's value is unique across all chunks placedinstanceon this scope.
Validation runs against the post-write state of the declaration: all placements declared together are recorded before any spec is checked. This lets a chunk and its dual placement (on a scope and on its type) appear in the same declaration without ordering failure.
Name uniqueness
Within a scope, the names of all chunks placed instance
on it — when name is non-null — are unique. The same chunk may carry the
same name across multiple scopes; two different chunks may not collide
on a name within any single scope. Enforced at placement-write time.
Example: Session archetype
chunk: session
name: "session"
spec: { propagate: true, ordered: true, accepts: ["prompt", "answer", "tool-call", "tool-result", "context"] }
body: { text: "A sequence of agent interaction events" }
chunk: prompt (placed on session, relates)
name: "prompt"
body: { text: "A user message to an agent" }
chunk: answer (placed on session, relates)
name: "answer"
body: { text: "An agent response" }
chunk: tool-call (placed on session, relates)
name: "tool-call"
spec: { required: ["program"] }
body: { text: "An agent invoking a tool" }
chunk: tool-result (placed on session, relates)
name: "tool-result"
spec: { required: ["program"] }
body: { text: "The result of a tool invocation" }
chunk: context (placed on session, relates)
name: "context"
spec: { ordered: true }
body: { text: "The knowledge context for a turn" }
An actual session instance:
chunk: my-session (placed on session, instance)
name: "my-session"
body: { text: "Debugging the scope algorithm", started: "2026-04-04T10:00Z", agent: "claude" }
chunk: (placed on my-session, instance, seq: 1; placed on prompt, instance)
body: { text: "Why is the scope query returning duplicates?" }
chunk: (placed on my-session, instance, seq: 2; placed on tool-call, instance)
body: { text: "grep -n 'active' src/commands/scope.zig", program: "filesystem" }
chunk: (placed on my-session, instance, seq: 3; placed on tool-result, instance)
body: { text: "src/commands/scope.zig:42: if (active) ...", program: "filesystem" }
chunk: (placed on my-session, instance, seq: 4; placed on answer, instance)
body: { text: "The scope query wasn't filtering by active=1. Fixed." }
History
Commits
Every successful declaration produces a commit.
commit {
id: unique identifier
parent_id: previous commit (NULL for root, forms a DAG)
timestamp: when committed
message: optional description
}
Commits record changes to chunks and placements. The current state on a branch is resolved by walking from that branch's HEAD to root, taking the latest version of each chunk and placement in the ancestry.
Branches
A branch is a movable pointer to a commit.
branch {
name: unique identifier
head: commit id
}
The substrate is branch-aware: every read and every write specifies a branch. The substrate stores all branches and their HEADs. Which branch is "active" for a given consumer at a given moment is consumer-level state, not field state — the consumer chooses which branch to read or commit to.
Branches fork (a new pointer from any commit) and merge (a commit with two parents; conflict resolution is explicit, above the primitives).
Lossless
Nothing is destroyed. Removal is logical: a removed chunk drops out of current-state reads, as do placements involving it (whether the chunk is the scope or the chunk being placed). The version history retains everything as it was — time-travel reads see the past state intact.
Removal is itself a mutation, recorded in a commit like any other write.
Queries / Scoping
Scope
The primary read operation. Declarative, pure, deterministic — same query against the same field state returns the same result.
Select scopes. Get everything at the intersection. Add a scope to narrow. Remove to widen. Scoping a set of chunks (identity chunks / scopes) returns all chunks placed on every one of them. Connected scopes are computed from shared placements.
Traversal
Follow connections. From an identity chunk, find chunks placed on it. Those chunks may also be placed on other identity chunks — those are the connections. Navigate by reading the connecting chunks.
Traversal and scoping work together: scope to narrow the space, then explore connections within it.
Full-text search
A native substrate feature, not an external add-on. Entry point into the structure when the scope vocabulary isn't known yet. The index covers chunk names and string values in chunk bodies. Find chunks by keyword, discover their scopes, then navigate structurally.
The FTS index is maintained by the substrate and stays consistent with current state on each commit.
Derived data — summaries, embeddings, and beyond
Summaries, vector embeddings, and other derived data are not special-cased in the substrate. They are chunks, placed using the same primitives as everything else.
A summary is a chunk placed on the scope it summarizes
(relates) and on a derivation scope like
summaries/opus-4.6. An embedding is a chunk with the vector
in its body, placed on its source chunk and on
embeddings/text-embedding-3-large. Any derived data follows
the same pattern.
Staleness convention. Derived chunks record what they were derived from in their body:
{
"text": "Summary of the scope...",
"source_commit": "c_abc123",
"model": "opus-4.6"
}source_commit is the commit the source scope was at when
the derivation was generated. Staleness check: has the source scope been
mutated after source_commit? If yes, the derived chunk is
stale. This is a reader concern — the substrate stores the data, the
reader decides when to regenerate.
No summary_cache table. No vector table. One primitive
handles it all. The derivation scopes (summaries/...,
embeddings/...) are just scopes — queryable, scopeable,
navigable like anything else.
Temporal scoping
Reconstruct state at any commit. A scope read at
--at <commit-id> resolves all chunks and placements
as of that point in history.
Time travel is read-only. Mutations apply to the chosen branch's HEAD; reading the field at a past commit does not enable mutation from that point. To work from a past state, fork a new branch from that commit and mutate forward.
Negation
Everything in scope A except what's also in scope B. Set difference over placements.
Ingestion
The substrate accepts any content. A chunk's body is a JSON object of any size — a full document, a dataset, an API response. The structural tools (placement, scopes, archetypes, FTS) make it possible for an agent to break content down into the substrate's structure and integrate it. The substrate stores the result; the agent provides the intelligence.
Integration
External content is referenced, not stored. A chunk with body fields containing resolution parameters points outside the system. An integration contract chunk (itself an archetype) defines how to resolve references of that type.
A file reference is a chunk placed on the scopes where it's relevant. Its body contains resolution parameters — at minimum a path, plus anchoring information. File references don't mirror filesystem hierarchy — the placement structure reflects knowledge relationships, not disk layout.
Git as first integration driver. If the referenced file is git-tracked, the substrate commit and the git commit together pin the reference in time. An agent or caretaker can later reconcile: what git commits have touched this file since the reference was made? If the file changed, is the surrounding knowledge still aligned? The substrate stores the fact; the intelligence evaluates it.
Multiple integration types exist — files, API endpoints, audio, video. Each has its own archetype defining required fields (path + commit for git, endpoint + method for REST, etc.). Each has its own resolution logic. The pattern is uniform: a chunk whose body contains resolution parameters, whose archetype defines the contract.
Staleness detection is a reader concern — the database knows when the reference was established (commit), not whether the external world has changed.
Peers
A peer is any directory with a substrate database. Multiple peers compose by mounting — declared dependencies or ad-hoc mounts. The entry point is the containment boundary: the directory you start from is read-write, everything peered in is read-only. Nesting can only reduce access.
Peers are separate databases, not partitions of one. Each peer owns its data. The mounting mechanism is runtime — the peer protocol, conflict resolution, and exact mounting mechanics are open.
Logical schema
The field is composed of:
- Chunks — id (system-generated, globally unique), optional name, spec (JSON), body (JSON). Identity is the id; name is unique within scope.
- Placements — a chunk placed on a scope (another
chunk), typed
instanceorrelates, optionally carryingseqfor ordered scopes. - Commits — forming a DAG via parent links. Each commit groups one declaration of mutations and carries a timestamp and optional message.
- Branches — named pointers to commits.
- A full-text index over chunk names and string values in chunk bodies, native to the substrate.
These are the field's data shape. How they persist — versioned tables
for history, materialized current-state tables for read performance, the
SQL DDL, indexes, FTS hookup, transaction discipline — is db-level. See
pilot/db.md for the physical schema
and access patterns.
What This System Is
A substrate. Not a database for a specific application. Not a retrieval layer for AI. The structural foundation that knowledge, computation, file systems, and agent coordination sit on.
The query is the portal. The quality of the query determines the quality of the context determines the quality of the output. Queries are declarative, composable, pure.
Meaning is reader-determined. The structure tells the reader where to look. The reader discovers what it means. The system does not bake meaning in at write time.
Shape is system-enforced. Archetypes define structural contracts. The system rejects non-conforming instances. Meaning is for the reader. Shape is for the system.
No imposed hierarchy. Placement creates nesting, but nesting is organizational, not protective. Peers enforce boundaries.
Derived data is just data. Summaries, embeddings, and other derivations are chunks placed on derivation scopes. No special tables, no special treatment. The structure is transparent.
Lossless. Nothing is destroyed. Knowledge evolves through addition.
What's Open
- Scope limit and offset. Ordered scopes can grow
large. A
limitparameter (from seq tail) andoffsetfor pagination prevent overload. Default limit prevents accidental flooding. The scope result already includes counts — agents probe shape before pulling data. Deferred for the pilot but needed soon. - Spec language evolution. The four fields (ordered, accepts, required, unique) cover the known cases. The vocabulary may grow through use.
- Merge semantics. The structure supports merge commits. Conflict resolution strategy is above the primitives.
- Peer protocol. Separate databases mounted read-only. The mounting mechanism is runtime, not structural.
- Temporal validity. Event time vs system time is
expressible through body properties. Whether
valid_from/valid_todeserve first-class status depends on use. - Views. Saved scopes with display settings. Drift detection when knowledge changes under an approved view. Concept is clear; mechanics are unbuilt.