System manual

the engine behind the map — schema v2 + v3 plumbing, 2026-06-09 ← how it works (the friendly version)
the funnel tables plumbing a lead's life roles keeping it true repo pointers

The funnel at a glance

TEJ (leads)        SCOUT (triage)          MAX (build)             machines
───────────        ──────────────          ───────────             ────────
knows the city ──► reads the site ───────► writes/points the  ──►  ingest.py  ─► events
add_venue_lead     GO: planned + doc       adapter, runs the       (+ series      │
add_source_lead    NO-GO: none + why       LIVE GATE,              derive,        ▼
connect_coverage                           build_source            cost parse)  RATER ─► event_scores
                                                                                  │
        STEWARD re-probes the fleet, re-derives drifted flags ◄───────────────────┘
                                                                   export.py ─► data.json ─► this site

Three independent axes, never conflated: discovery (how we learned a thing exists — provenance columns), extraction (sources + the coverage bridge — can we hear it?), monitoring (do we want it? — and an event's visibility is always derived from these, never stored).

The tables

SQLite (chicago.db), SQLAlchemy 2.0 models in models.py. Every status flag carries an honesty grade: LIVE has a writer and is current · FROZEN set once at seed, never recomputed · DEAD no writer, don't trust · DERIVED recomputed from other rows. The fourth failure mode we learned the hard way: DRIFTED — has a writer that never re-runs. The Steward exists to hunt those.

pipeline_sources — one row per scrapeable endpoint

columnmeaninggrade
adaptertransport type (tribe_wp, squarespace, wix, ical, html_jsonld, bibliocommons, localist, do312, ticketmaster_api, …) — one adapter per platform, reused across hundreds of venuesLIVE
urlTHE scrape endpoint. What the build gate probes is what ingest scrapes — by construction, since the 06-09 fixLIVE
statusnone → planned → built. Only the gated operators mint built; a fresh lead caps at plannedLIVE
healthstamped at build by the live gate: ok · no_events · http_error · parse_error (+ health_detail with the actual error)LIVE
is_aggregatorone endpoint fanning out to many venues (parks feed) vs a single-owner calendarLIVE
discovered_via/byprovenance, write-once (aggregator · websearch · llm_recall · manual). Added 06-09; older rows NULLLIVE

source_coverage — the M:N bridge (source ⇄ venue/group)

columnmeaninggrade
is_primarythis source is the target's canonical feed (the dedup signal)LIVE
adoptionadopted (curated) vs discovered (firehose found it — the vetting queue)LIVE
target_page_urlthe target's page ON that source (an Eventbrite /o/ page). Never a scrape endpointLIVE
routing_keyaggregator→target join key (park slug) ingest routes by — no name matchingLIVE
usefulnessan LLM guess at seed, never recomputed from yieldFROZEN
The old single url column carried three meanings (target page / routing slug / scrape endpoint) and produced a silent-zero bug: a source could pass the gate yet ingest scraped a stale homepage. Split on 2026-06-09; the endpoint now lives only on the source row.

venues / groups — the where and the who

columnmeaninggrade
statuscandidate → verified. verify_target is the only writer, fired when a working source proves the place is real and hostingLIVE
monitored"do we want it?" — true iff a dedicated source with status=built AND health=ok feeds it, or it's verified. Re-derived by the Steward; the leads_inbox view lists everything monitored=0DERIVED
discovered_via/byprovenance, write-onceLIVE
hosts_eventsLLM guess at seed; should one day derive from real yieldFROZEN
maybe_fakeno writer anywhereDEAD

events — the payload

columnmeaninggrade
dedupe_keyvenue|date|name-slug — re-ingests bump last_seen, never duplicateLIVE
cost / cost_min_cents / is_freeraw source text kept verbatim + deterministic parse at ingest (fact only — "library events are probably free" is persona-layer inference, never stored)LIVE
series_idstamped by series derivation after every ingestDERIVED
lifecycleactive · past · gone (vanished from its source — likely cancelled). Never deletedLIVE
visibilitynot a column. is_visible(event) ⇔ source built ∧ venue monitored ∧ group monitored, computed every readDERIVED

series — recurring clusters, first-class

columnmeaninggrade
idvenue|normalized-name (episode/volume numbers stripped, so "Footholds Vol. 7" clusters with "Vol. 8")DERIVED
cadencemedian gap between distinct dates: ≤3d = run (a play — see once) · ~7d weekly · ~14d biweekly · ~28d monthly. Pure date math, no LLMDERIVED

event_scores — append-only judgment

One row per (event, scoring run); newest wins on export; history is never deleted (the rows are the expensive asset). Six 1–10 dims + a weighted composite. Series inherit: the scorer rates a series once (recurrence noted in the prompt) and writes that row to every unscored instance — a 14-week knitting circle costs one LLM call. Re-weighting (weights.json) never requires re-scoring.

The plumbing — operators are the only writes

The durable layer is mutated only through operators.py — idempotent, never drop, no raw SQL, no JSON seeds. The allowlist a role gets is its job description:

operatordoescallers
add_venue_lead / add_group_leadmint a candidate (fills fields only on create)Tej
add_source_leadregister an endpoint; known adapter → planned, never builtTej, Scout
update_venue_fields / update_source_fieldspatch attrs post-mint (a passed value deliberately overwrites). update_source_fields refuses adapter/url on a built source — that would falsify the buildany
probe_source_liveTHE BUILD GATE — run the real adapter, classify ok/no_events/http_error/parse_error. Pure, writes nothingMax, Steward
mark_source_builtflip to built, gated — always re-runs the probe and stamps health honestly (built-but-failing is visible, never silent)Max; Steward re-stamps
build_sourceatomic happy path: gate + coverage + verify_target (iff health=ok) + decide_monitoredMax
set_source_statusthe honest downgrade (NO-GO at triage; rot at maintenance) — reason goes in notesScout, Max, Steward
connect_coveragelink source ⇄ target (idempotent on the triple)Tej, Max
verify_targetcandidate → verified; only fired on proof (a working source)via build_source
decide_monitoredowns venue.monitored: built+ok dedicated source, or verifiedvia build_source; Steward

A lead's life (the Empty Bottle worked example — real, 2026-06-09)

1  TEJ    add_venue_lead("Empty Bottle", …)  +  add_source_lead(platform_guess="squarespace")
          └ guessed wrong — and that's fine, because:
2  SCOUT  reads the site: the Squarespace is an empty shell; the real feed is
          Ticketmaster Discovery v2 (96 events). Corrects the source row, writes
          docs/sources/empty-bottle-….md, verdict GO (status=planned).
          (Same pass: Rainbo Club → honest NO-GO, "15 events ever, dead since 2024".)
3  MAX    ports scrape_ticketmaster into scrapers.py, smoke-tests live,
          build_source(…) → gate passes (77 valid) → built/ok, venue verified+monitored
4  INGEST ingest.py --source empty-bottle-… → 65 Event rows (cost parsed: $16.17, free, …)
          → series derived → "RATER HANDOFF: N unscored" printed
5  RATER  scores the new events (announced first — it spends money); series inherit
6  EXPORT export.py → data.json → the map & list you're using

The roles — a staff, not a script

A role = a charter (.claude/agents/<name>.md) + a scoped operator allowlist. The scoping is the safety: an AI employee's worst failure mode is confidently fabricating state, and here the fabrications are structurally impossible — Tej can't fake a built pipeline, Scout can't build, built is earned from a live probe, not a claim.

Tej — Head of Leads
Turns knowledge of the city into candidate rows. Industrial move: mining aggregator directories (Do312's venue index alone seeded 150 venues + 151 ready sources in one run).
operators: add_*_lead · connect_coverage
Scout — Source Researcher
The triage gate. Reads the site, sniffs the platform, writes the source doc, says GO or an honest NO-GO with the reason. Most sites aren't worth scraping; saying so is the deliverable.
operators: add_source_lead (→planned) · set_source_status
Max — Head of Scraping
Writes/points adapters, runs the live gate, flips sources to built. The only role that builds — and bounded: scrapes are hard-capped, long jobs stay in the main loop.
operators: mark_source_built · build_source · verify_target · connect_coverage
Rater — Head of Rating
Scores events (six universal dims), sanity-checks the ordering, tunes weights.json. Spends money, so runs are announced. Scores series once; instances inherit.
owns: event_scores · data/weights.json
Steward — Maintenance
The role that keeps things TRUE. Re-probes every built source, re-derives drifted flags, downgrades honestly. Report-first; --fix applies via operators. Born after an audit found 34/45 monitored venues drifted.
tool: steward.py · operators: probe · re-stamp · set_source_status · decide_monitored

The missing manager is deliberately thin: leads_inbox (a SQL view of every un-built row) is the work queue, and coverage-per-hood is the KPI. Dispatch is a query, not an agent — for now.

Keeping it true

Repo pointers (the deeper layer)