Chicago · system manual

the funnel tables plumbing a lead's life roles keeping it true repo pointers

The funnel at a glance

TEJ (leads)        SCOUT (triage)          MAX (build)             machines
───────────        ──────────────          ───────────             ────────
knows the city ──► reads the site ───────► writes/points the  ──►  ingest.py  ─► events
add_venue_lead     GO: planned + doc       adapter, runs the       (+ series      │
add_source_lead    NO-GO: none + why       LIVE GATE,              derive,        ▼
connect_coverage                           build_source            cost parse)  RATER ─► event_scores
                                                                                  │
        STEWARD re-probes the fleet, re-derives drifted flags ◄───────────────────┘
                                                                   export.py ─► data.json ─► this site

Three independent axes, never conflated: discovery (how we learned a thing exists — provenance columns), extraction (sources + the coverage bridge — can we hear it?), monitoring (do we want it? — and an event's visibility is always derived from these, never stored).

The tables

SQLite (chicago.db), SQLAlchemy 2.0 models in models.py. Every status flag carries an honesty grade: LIVE has a writer and is current · FROZEN set once at seed, never recomputed · DEAD no writer, don't trust · DERIVED recomputed from other rows. The fourth failure mode we learned the hard way: DRIFTED — has a writer that never re-runs. The Steward exists to hunt those.

pipeline_sources — one row per scrapeable endpoint

column	meaning	grade
adapter	transport type (tribe_wp, squarespace, wix, ical, html_jsonld, bibliocommons, localist, do312, ticketmaster_api, …) — one adapter per platform, reused across hundreds of venues	LIVE
url	THE scrape endpoint. What the build gate probes is what ingest scrapes — by construction, since the 06-09 fix	LIVE
status	none → planned → built. Only the gated operators mint `built`; a fresh lead caps at planned	LIVE
health	stamped at build by the live gate: ok · no_events · http_error · parse_error (+ health_detail with the actual error)	LIVE
is_aggregator	one endpoint fanning out to many venues (parks feed) vs a single-owner calendar	LIVE
discovered_via/by	provenance, write-once (aggregator · websearch · llm_recall · manual). Added 06-09; older rows NULL	LIVE

source_coverage — the M:N bridge (source ⇄ venue/group)

column	meaning	grade
is_primary	this source is the target's canonical feed (the dedup signal)	LIVE
adoption	adopted (curated) vs discovered (firehose found it — the vetting queue)	LIVE
target_page_url	the target's page ON that source (an Eventbrite /o/ page). Never a scrape endpoint	LIVE
routing_key	aggregator→target join key (park slug) ingest routes by — no name matching	LIVE
usefulness	an LLM guess at seed, never recomputed from yield	FROZEN

The old single url column carried three meanings (target page / routing slug / scrape endpoint) and produced a silent-zero bug: a source could pass the gate yet ingest scraped a stale homepage. Split on 2026-06-09; the endpoint now lives only on the source row.

venues / groups — the where and the who

column	meaning	grade
status	candidate → verified. `verify_target` is the only writer, fired when a working source proves the place is real and hosting	LIVE
monitored	"do we want it?" — true iff a dedicated source with status=built AND health=ok feeds it, or it's verified. Re-derived by the Steward; the leads_inbox view lists everything monitored=0	DERIVED
discovered_via/by	provenance, write-once	LIVE
hosts_events	LLM guess at seed; should one day derive from real yield	FROZEN
maybe_fake	no writer anywhere	DEAD

events — the payload

column	meaning	grade
dedupe_key	`venue\|date\|name-slug` — re-ingests bump last_seen, never duplicate	LIVE
cost / cost_min_cents / is_free	raw source text kept verbatim + deterministic parse at ingest (fact only — "library events are probably free" is persona-layer inference, never stored)	LIVE
series_id	stamped by series derivation after every ingest	DERIVED
lifecycle	active · past · gone (vanished from its source — likely cancelled). Never deleted	LIVE
visibility	not a column. `is_visible(event) ⇔ source built ∧ venue monitored ∧ group monitored`, computed every read	DERIVED

series — recurring clusters, first-class

column	meaning	grade
id	`venue\|normalized-name` (episode/volume numbers stripped, so "Footholds Vol. 7" clusters with "Vol. 8")	DERIVED
cadence	median gap between distinct dates: ≤3d = run (a play — see once) · ~7d weekly · ~14d biweekly · ~28d monthly. Pure date math, no LLM	DERIVED

event_scores — append-only judgment

One row per (event, scoring run); newest wins on export; history is never deleted (the rows are the expensive asset). Six 1–10 dims + a weighted composite. Series inherit: the scorer rates a series once (recurrence noted in the prompt) and writes that row to every unscored instance — a 14-week knitting circle costs one LLM call. Re-weighting (weights.json) never requires re-scoring.

The plumbing — operators are the only writes

The durable layer is mutated only through operators.py — idempotent, never drop, no raw SQL, no JSON seeds. The allowlist a role gets is its job description:

operator	does	callers
add_venue_lead / add_group_lead	mint a candidate (fills fields only on create)	Tej
add_source_lead	register an endpoint; known adapter → planned, never built	Tej, Scout
update_venue_fields / update_source_fields	patch attrs post-mint (a passed value deliberately overwrites). update_source_fields refuses adapter/url on a built source — that would falsify the build	any
probe_source_live	THE BUILD GATE — run the real adapter, classify ok/no_events/http_error/parse_error. Pure, writes nothing	Max, Steward
mark_source_built	flip to built, gated — always re-runs the probe and stamps health honestly (built-but-failing is visible, never silent)	Max; Steward re-stamps
build_source	atomic happy path: gate + coverage + verify_target (iff health=ok) + decide_monitored	Max
set_source_status	the honest downgrade (NO-GO at triage; rot at maintenance) — reason goes in notes	Scout, Max, Steward
connect_coverage	link source ⇄ target (idempotent on the triple)	Tej, Max
verify_target	candidate → verified; only fired on proof (a working source)	via build_source
decide_monitored	owns venue.monitored: built+ok dedicated source, or verified	via build_source; Steward

A lead's life (the Empty Bottle worked example — real, 2026-06-09)

1  TEJ    add_venue_lead("Empty Bottle", …)  +  add_source_lead(platform_guess="squarespace")
          └ guessed wrong — and that's fine, because:
2  SCOUT  reads the site: the Squarespace is an empty shell; the real feed is
          Ticketmaster Discovery v2 (96 events). Corrects the source row, writes
          docs/sources/empty-bottle-….md, verdict GO (status=planned).
          (Same pass: Rainbo Club → honest NO-GO, "15 events ever, dead since 2024".)
3  MAX    ports scrape_ticketmaster into scrapers.py, smoke-tests live,
          build_source(…) → gate passes (77 valid) → built/ok, venue verified+monitored
4  INGEST ingest.py --source empty-bottle-… → 65 Event rows (cost parsed: $16.17, free, …)
          → series derived → "RATER HANDOFF: N unscored" printed
5  RATER  scores the new events (announced first — it spends money); series inherit
6  EXPORT export.py → data.json → the map & list you're using

The roles — a staff, not a script

A role = a charter (.claude/agents/<name>.md) + a scoped operator allowlist. The scoping is the safety: an AI employee's worst failure mode is confidently fabricating state, and here the fabrications are structurally impossible — Tej can't fake a built pipeline, Scout can't build, built is earned from a live probe, not a claim.

Tej — Head of Leads

Turns knowledge of the city into candidate rows. Industrial move: mining aggregator directories (Do312's venue index alone seeded 150 venues + 151 ready sources in one run).

operators: add_*_lead · connect_coverage

Scout — Source Researcher

The triage gate. Reads the site, sniffs the platform, writes the source doc, says GO or an honest NO-GO with the reason. Most sites aren't worth scraping; saying so is the deliverable.

operators: add_source_lead (→planned) · set_source_status

Max — Head of Scraping

Writes/points adapters, runs the live gate, flips sources to built. The only role that builds — and bounded: scrapes are hard-capped, long jobs stay in the main loop.

operators: mark_source_built · build_source · verify_target · connect_coverage

Rater — Head of Rating

Scores events (six universal dims), sanity-checks the ordering, tunes weights.json. Spends money, so runs are announced. Scores series once; instances inherit.

owns: event_scores · data/weights.json

Steward — Maintenance

The role that keeps things TRUE. Re-probes every built source, re-derives drifted flags, downgrades honestly. Report-first; --fix applies via operators. Born after an audit found 34/45 monitored venues drifted.

tool: steward.py · operators: probe · re-stamp · set_source_status · decide_monitored

The missing manager is deliberately thin: leads_inbox (a SQL view of every un-built row) is the work queue, and coverage-per-hood is the KPI. Dispatch is a query, not an agent — for now.

Keeping it true

Test suites — tests.py in the engine (in-memory DB, stub adapters exercising the real gate) and in thirdplace. Zero deps, no network. They caught three real bugs on their first runs (a built-but-failing source still monitored its venue; "#3" never clustering; "doors at 8" parsing as $8).
The Steward — probe drift + flag drift on a schedule.
The recall harness (thirdplace/evalrecall.py) — the denominator for "almost all events": manually sweep one neighborhood for one weekend, freeze the list, measure what the pipeline caught and which channel the misses live on (expected answer: Instagram — the next frontier).

Repo pointers (the deeper layer)

reality/README.md — orientation; docs/vision.md — the thesis
event_wizard_chicago/CLAUDE.md — the model + run commands
docs/roles.md + docs/leads-pipeline.md — role charters context + the operator contract & flag audit
docs/sources/<id>.md — one page per source: platform, endpoint, gotchas, verdicts, build log
docs/plans/2026-06-09-engine-v3-spec.md — the rebuild spec; docs/research/2026-06-09-e2e-pipeline-audit.md — the live audit that drove it
On this site: raw tables · roles board · the friendly version