Architecture — Humboldt

Overview

Humboldt is an artificial researcher — an autonomous agent that investigates laws of protocolized and artificial systems. It runs in two modes:

  • CLI mode: operator-driven research sessions (investigate, deep-read, assess, synthesize)
  • Daemon mode: always-on Discord presence + autonomous background tasks

Research output is structured and versioned (git): YAML law files, project arc documents, lab notebook entries, reading notes. The daemon extends the research into the PI community in real time — posting new findings, responding to conversations, capturing ideas and references from Discord.


Persona Architecture

Humboldt's persona is assembled dynamically from six documents, not a monolithic prompt:

Document Role Loaded by
IDENTITY.md Who Humboldt is — lineage, mission, temperament, voice All Claude calls
LINEAGE.md Intellectual lineage — grows as deep reads complete and laws establish Rich context calls
MEMORY.md Narrative memory of the research journey Rich context calls
METHOD.md Epistemic standards — evidence provenance, confidence levels, falsification CLI research calls
BOOTSTRAP.md Session startup sequence + Decide-phase configuration CLI sessions
methods/M-000-ooda.md OS kernel — the OODA decision gate and research loop CLI sessions

presence.py assembles two tiers of context for Discord calls:

  • _slim_context() — IDENTITY excerpt + law names + latest notebook paragraph. Used for proactive channel posts and notebook announcements.
  • _rich_context() — Full IDENTITY + LINEAGE excerpt + law statements + active hypotheses + recent notebook. Used for @mention responses.

System Components

agent/retrieval.py — Corpus Interface

Primary mode: Direct Pinecone (default) - Embeds queries with Voyage AI voyage-3 - Queries the shared c3po Pinecone index - Namespaces: pdfs, substack, videos, bibliography, discord, discord_links, sig, transcripts, humboldt - Retrieval strategy varies by task (see table below)

Secondary mode: C3PO Worker API (fallback) - HTTP calls to the deployed c3po worker - Used for cross-checking or when direct Pinecone access is unavailable

The humboldt namespace holds Humboldt's own output — notebook entries, reading notes, law and hypothesis YAMLs — indexed by agent/ingest.py. Self-retrieval enables corpus-grounded responses about Humboldt's own prior work.

agent/synthesizer.py — Claude Interface

Wraps the Anthropic API for research synthesis tasks:

  • Hypothesis generation: given a topic, propose candidate laws and sub-questions
  • Evidence analysis: extract relevant evidence from retrieved chunks and rate quality
  • Law formulation: draft structured law statements with scope conditions and falsification criteria
  • Theory sketching: scan existing laws for unification opportunities

Uses claude-sonnet-4-6. Prompt caching on the system block (persona documents are large and reused across calls in a session).

agent/ingest.py — Self-Indexing Pipeline

Chunks and embeds Humboldt's own research output into the humboldt Pinecone namespace:

  • notebook/*.md — lab notebook entries, chunked by ## section
  • bibliography/notes/*.md — deep-read notes, chunked by section
  • bibliography/shallow-reads/*.md — shallow-read notes, chunked by section
  • research/c/*.yaml — Curiosity items (exploration phase)
  • research/h/*.yaml — Hypothesis items (sensemaking phase)
  • research/cl/*.yaml — Candidate Law items (valley phase)
  • research/f/*.yaml — Falsification Monitor items (retrospective phase)
  • research/ds/*.md — Deep Story arc files, chunked by section
  • inbox/discord-idea-*.md — community-captured ideas

Each vector carries augmented metadata (document title, date, section, type) so retrieved results are self-identifying in prompts. Run after any session that produces new notebook entries or modifies research artifacts. The daemon runs ingest_all() automatically after new notebook entries are detected.

agent/publish.py — Website Publishing Pipeline

Renders lab notebook entries to the PI website (humboldt-notebook.html):

  • Converts notebook markdown to HTML via python-markdown
  • Inserts new entries into the website file by anchor marker
  • Commits and pushes to the website repo

Run manually with humboldt publish or humboldt publish --dry-run. The daemon triggers this automatically after ingest_all() when new notebook entries are detected.

agent/references.py — Reference Management

Manages bibliography/references.yaml, a curated list of papers and links:

  • humboldt references list — show reference list by status
  • humboldt references sort — classify unsorted items (read / deep_read / discard) via Claude
  • humboldt references promote — manually promote inbox link captures to the reference list

agent/humboldt.py — CLI Orchestrator

Entry point for all CLI operations. Key commands:

python3 -m agent.humboldt investigate "<topic>"        # corpus retrieval + synthesis
python3 -m agent.humboldt hypothesize "<topic>"        # candidate law generation only
python3 -m agent.humboldt assess <law-id>              # evidence gathering for a law
python3 -m agent.humboldt deepread "<doc-name>"        # M-003 deep read from PDF
python3 -m agent.humboldt inventory                    # display law inventory
python3 -m agent.humboldt ingest                       # embed own docs → humboldt namespace
python3 -m agent.humboldt publish [--dry-run]          # render notebook → website
python3 -m agent.humboldt daemon run                   # start daemon
python3 -m agent.humboldt daemon restart               # hot-reload daemon (SIGUSR1)
python3 -m agent.humboldt daemon status                # PID + state summary
python3 -m agent.humboldt discord post [--draft]       # manual notebook post to Discord
python3 -m agent.humboldt discord sweep [--since DATE] # capture sweep over channel history
python3 -m agent.humboldt references list/sort/promote # reference management

Daemon Layer

The daemon (daemon/) is a long-running Discord bot that runs Humboldt's online presence and background tasks. It is always-on and event-driven, distinct from the operator-driven CLI sessions.

daemon/runner.py — Process Manager

Starts the HumboldtBot Discord client. After the bot exits, checks bot.reload_requested — if set, calls os.execv() to replace the process with updated code, preserving all state.

daemon/discord_client.py — Discord Bot

HumboldtBot(discord.Client) with four scheduled tasks and two event handlers:

Scheduled tasks:

Task Interval Purpose
task_notebook 30 min Watch for new notebook commits; post to #new-nature; trigger ingest + publish
task_feeds 12 h Poll RSS/Atom feeds; run relevance check (Haiku); save to inbox/; DM operator
task_conversation_review 24 h Synthesize recent Discord into notebook; promote inbox links to references
_new_nature_loop Adaptive Proactive #new-nature presence (see below)

Event handlers:

  • on_message: handles @mentions in channels (full rich-context response) and DM commands from the operator (!reload, !status)
  • on_ready: records startup time, writes daemon.pid, triggers _scan_missed_mentions

_new_nature_loop — adaptive presence:

Replaces a fixed-interval task. Checks #new-nature on an exponential backoff schedule based on time since last human message activity: 90s → 3min → 8min → 20min → 30min. Skips @mention messages (those are on_message's responsibility). Thread creation uses the most recent non-mention message as the anchor; falls back to channel post if anchor is older than 15 minutes.

_scan_missed_mentions:

On startup, scans for @mentions that arrived while offline and responds to any not already in responded_mention_ids. Omits "(catching up from while I was offline)" prefix on brief restarts (< 5 min offline).

Graceful shutdown and hot-reload:

  • close() override saves last_clean_shutdown to state and deletes daemon.pid
  • SIGUSR1 handler triggers _graceful_reload(), which sets reload_requested = True and calls close(); runner.py then os.execv()s the process
  • !reload DM from operator triggers the same path

daemon/presence.py — Content Generation

All Claude calls for Discord output. Two context tiers (_slim_context / _rich_context) and six generation functions:

  • generate_notebook_post — post announcing a new notebook entry (Haiku)
  • generate_new_nature_response — proactive channel response to new messages (Haiku)
  • generate_mention_response — @mention reply with full research context (Sonnet)
  • generate_person_notebook_entry — notebook entry about a recurring interlocutor (Sonnet)
  • check_feed_relevance — assess whether a feed item bears on active research (Haiku)
  • generate_conversation_review — daily synthesis of Discord into notebook (Sonnet)

daemon/capture.py — Idea and Reference Capture

After every batch of Discord messages, runs a lightweight Haiku extraction to identify: 1. Ideas or arguments that bear on active hypotheses or challenge current laws 2. External papers, articles, or URLs cited by participants

Captured items are saved to inbox/ as dated markdown files. Deduplicates URLs within a daemon session.

daemon/people.py — Interlocutor Memory

Tracks recurring Discord participants in daemon/people.json (gitignored). After NOTEBOOK_THRESHOLD (3) interactions with a person, flags that a notebook entry should be written about them. Used to personalize @mention responses with interaction history.

daemon/conversation_review.py — Daily Synthesis

Runs every 24 hours: 1. Reads recent #new-nature messages and writes a reflective notebook section (Sonnet) — what emerged, what challenged current thinking 2. Promotes unseen inbox link captures to bibliography/references.yaml as unsorted entries

daemon/feed_monitor.py — Feed Polling

Fetches RSS/Atom feeds configured in daemon/config.yaml. Returns items newer than last_feed_check. Each item is checked for relevance against active hypotheses; relevant items are saved to inbox/.

daemon/state.py — Persistent State

Single JSON file (daemon/state.json, gitignored) tracks everything the daemon needs across restarts:

Field Purpose
last_notebook_commit Git commit hash; detects new notebook entries
notebook_entries_posted Dates already announced to Discord
last_new_nature_message_id Discord cursor for the tick loop
last_new_nature_activity Timestamp of last human message (drives adaptive intervals)
last_feed_check Timestamp; feeds only return items after this
last_conversation_review Date of last daily synthesis pass
responded_mention_ids Message IDs already replied to (cap 500); prevents restart duplicates
last_startup ISO timestamp of most recent daemon startup
last_clean_shutdown ISO timestamp of last graceful shutdown; absence implies crash

Research Inventory

Research output is organized around the Double Freytag phase model (Rao, Tempo). Each phase produces a typed artifact. The DS file is the narrative arc container spanning all phases of a single inquiry.

research/
├── ds/       DS-NNN — Deep Story arc files (one per inquiry thread)
│             The arc container: tracks phase position, tempo, transition trigger,
│             and blocking behavior for each thread. Opened at the start of any
│             new inquiry; closed after the separation event artifact is published.
├── c/        C-NNN — Curiosity items (exploration phase)
│             Provocations, not proto-laws. Flows in continuously from inbox,
│             Discord, reading, and observation. The only rule: not a candidate law.
├── h/        H-NNN — Hypothesis items (sensemaking phase, post-cheap-trick)
│             Tracks the developing framing from first insight to working claim.
│             Created at the cheap trick transition; closed when promoted to CL.
├── cl/       CL-NNN — Candidate Law items (valley phase)
│             Evidence accumulating under an organizing insight. Has a named
│             transition_trigger: the specific condition that would close the valley
│             and open the heavy lift.
├── theories/ T-NNN — Theory items (heavy lift phase)
│             Synthesis committed; writing the publishable artifact. A T item
│             only exists when Humboldt is actively writing toward publication.
└── f/        F-NNN — Falsification Monitor items (retrospective phase)
              Created only after a separation event — a published artifact available
              for independent review. Currently empty: no separation events have occurred.

Phase-to-artifact mapping:

Phase Artifact Created when
Liminal Passage
Exploration C (Curiosity) Any provocation worth keeping
Sensemaking H (Hypothesis) Cheap trick fires; organizing insight crystallizes
Valley CL (Candidate Law) Evidence accumulating; arc in sustained investigation
Heavy Lift T (Theory) Writing toward a publishable separation event
Retrospective F (Falsification Monitor) After a published artifact enters external scrutiny

Transitions: Cheap Trick (exploration → sensemaking) and Separation Event (heavy lift → retrospective) are named. Other transitions are unnamed and triggered by readiness assessment recorded in transition_trigger field of the arc's DS file.

No confidence field. There are no "established" laws — only laws that have not yet been falsified or superseded. F items use status: active | superseded | refuted.


Behavior Inventory

Humboldt's research techniques are called behaviors — named, documented habits rather than recipes. The canonical inventory is behaviors/registry.yaml. Each behavior has a stable hash ID; M-0xx legacy IDs are cross-referenced as legacy_id fields.

Two classification axes: - Classification: supervised (operator-triggered), live (autonomous during a session), daemon (runs outside sessions) - State: stub (defined, not implemented), prototyping (in active development), production (stable)

Boot behaviors (deterministically triggered by the bootstrap sequence):

ID Name State
boot-000 Wakeup Sequence production
boot-001 OODA Decision Gate stub

Supervised behaviors (operator-triggered):

ID Name State Legacy
behavior-t5m Deep Read prototyping M-003
behavior-m7v Cross-Training stub M-014
behavior-h4v Field Trip stub M-007
behavior-n1s Visual Thinking stub M-009

Live behaviors (can run autonomously):

ID Name State Legacy
behavior-q2n Random Links production M-001
behavior-c7r Curiosity Browsing stub
behavior-f8p Canonical Domains stub M-002
behavior-j6d Bullshit Detector stub M-008
behavior-z8l Fermi Estimation stub M-010
behavior-y2g Dyson Design stub M-011
behavior-r4k Thought Experiments stub M-012
behavior-c9p Design Fictions stub M-013
behavior-s5j Open Source Exploration stub M-018

Daemon behaviors (Discord bot + scheduled tasks):

ID Name State
behavior-e2h Feed Intake prototyping
behavior-a8r Conversation Synthesis prototyping
behavior-o4t Idea/Link Capture prototyping
behavior-g7u Notebook Publish production
behavior-v3c Thread Farming prototyping

The full registry with descriptions, source files, and implementation notes is in behaviors/registry.yaml. The methods/ directory contains the detailed specification documents for each behavior; behavior IDs are the canonical reference, M-0xx names are historical.


Deep-Read Library

Source PDFs in bibliography/deep-reads/. Reading notes in bibliography/notes/. READING-HINTS.md is the pre-read index: each entry records the operator's reading hint before the read begins. All reads must use the actual PDF — never from training memory. This is enforced by M-003 procedure.

Candidates not yet in hand are tracked in bibliography/deep-read-hopper.md, with source of recommendation (deep read discovery, shallow read escalation, Discord, operator, web) and PDF status.

Completed reads: Simon (Sciences of the Artificial), Hamming (You and Your Research), von Humboldt (Cosmos Vol. 1), Rao (Tempo), Iverson (Notation as a Tool of Thought).


Inbox

inbox/ receives captured items from three sources: 1. Discord capture (daemon/capture.py) — ideas and references extracted from #new-nature 2. Feed monitor (daemon/feed_monitor.py) — relevant RSS/Atom items 3. Discord sweep (humboldt discord sweep) — historical batch capture

Inbox files are markdown with a structured header. Processed at the start of research sessions; promoted to references via humboldt references promote or the daily conversation review.


Data Flow

CLI research session

humboldt investigate "<topic>"
    │
    ├── assemble_context(): IDENTITY + METHOD + BOOTSTRAP + M-000 + inventory
    │
    ├── retrieval.py: embed topic → Pinecone (c3po + humboldt namespaces)
    │
    ├── synthesizer.py: Claude synthesis pass
    │   System: assembled persona (cached) + existing inventory
    │   User: retrieved chunks + research task
    │
    ├── write/update research/c|h|cl|theories|ds/ YAML/MD artifacts
    │
    └── git add research/ notebook/ && git commit && git push

Daemon notebook cycle

New notebook commit detected (task_notebook, every 30 min)
    │
    ├── presence.generate_notebook_post() → #new-nature channel post
    │
    ├── ingest.ingest_all() → humboldt Pinecone namespace updated
    │
    └── publish.publish() → humboldt-notebook.html → git push website repo

Daemon Discord presence cycle

New #new-nature messages (adaptive: 90s–30min)
    │
    ├── Skip @mention messages (handled by on_message)
    │
    ├── presence.generate_new_nature_response() → maybe post or open thread
    │
    └── capture.run_capture() → ideas/links → inbox/

Connection to C3PO

C3PO Humboldt
User Human researchers via web UI Autonomous agent + PI Discord community
Task Answer questions about protocols Discover and formalize laws of protocolized systems
Output Conversational response + citations Law inventory, project arcs, theory drafts, notebook
Corpus access Own query path Shared Pinecone index (direct) + own humboldt namespace
Persona Reference librarian Naturalist investigator
Deployment Cloudflare Worker Local CLI + always-on daemon

The shared Pinecone index means Humboldt benefits immediately from every new corpus ingestion done by c3po. The humboldt namespace is Humboldt-exclusive — c3po does not index it.


Security

Keys follow the Protocol Institute security policy (../admin/security.md): - All secrets in .env (gitignored, Dropbox-ignored) - Values sourced from ../protocol-institute/.env.keys - Keys registered in ../admin/keys.md - Humboldt reuses c3po keys (VOYAGE, PINECONE, ANTHROPIC) — no new key provisioning required - Additional key: DISCORD_BOT_TOKEN, DISCORD_GUILD_ID, DISCORD_NEW_NATURE_CHANNEL_ID, DISCORD_OPERATOR_USER_ID — registered in ../admin/keys.md