KEGG Assignment Pipeline

Assigning KEGG pathway IDs to ~49k pathway images via Gemini 3 Pro analysis of captions and PubMed metadata

Pipeline Overview

Signal 1

Jaccard

→

Signal 2

Caption LLM

→

Signal 3

Metadata LLM

→

Output

Final CSVs

48,779

Total Images

34,062

Have Captions

42,362

Papers w/ Metadata

239

KEGG Pathways

Execution Plan

Run 1: Caption Assignment (Signal 2)

Send each figure caption + 239 KEGG pathways to Gemini 3 Pro. Figure-level signal — strongest assignment.

LLM calls	34,062 (one per captioned image)
Model	Gemini 3 Pro (Vertex AI)
Throughput	~120–165/min sustained
Est. time	~4.7 hours
Cost	Free (GCP credits)
Output	`intermediate/caption_assignments.json`

caffeinate -s python scripts/run_caption_assignment_gemini.py --full

Run 2: Metadata Assignment (Signal 3)

Send paper title + abstract + MeSH + keywords + 239 KEGG pathways to Gemini 3 Pro. Paper-level signal — run for all papers, not just captionless. Deduped by PMC ID.

LLM calls	42,362 (one per unique paper)
Model	Gemini 3 Pro (Vertex AI)
Throughput	~120–165/min sustained
Est. time	~5.9 hours
Cost	Free (GCP credits)
Output	`intermediate/metadata_assignments.json`

caffeinate -s python scripts/run_metadata_assignment_gemini.py --full

Metadata assignment is paper-level: one paper may have multiple figures showing different pathways. Results fan out to all images from that paper. Running for all papers (not just captionless) gives an independent validation signal for captioned images too.

Run 3: Build Final CSVs

Merge all signals into three deliverable CSVs. No API calls.

python scripts/build_edges.py && python scripts/build_nodes.py && python scripts/build_metadata.py

Output output/edges.csv, output/nodes.csv, output/metadata.csv

Signal Coverage After Both Runs

Gemini annotation

100%

48,779

Metadata LLM (Signal 3)

99.9%

48,738

Jaccard score > 0

97.4%

47,526

Caption LLM (Signal 2)

69.8%

34,062

After both runs: 34,039 images get both caption AND metadata signals (cross-validation). 14,699 captionless images get metadata only. Only 18 images (~0.04%) have no caption and no PubMed metadata — these get Jaccard + Gemini label only.

Model Comparison & Throughput Testing

Tested 4 models on the caption assignment task. Gemini 3 Pro is the clear winner: best quality, free via credits, and fast enough with the native async client.

Model	RPM	34k time	Cost	Quality	Reasoning	Status
Gemini 3 Pro	120–165	~4.7 hrs	Free	Baseline (best)	Yes	Selected
GPT-4o-mini	~200	~3 hrs	~$20	70% agree w/ Pro	Yes	Backup
Gemini 2.0 Flash	716	~47 min	Free	Untested	Untested	Not used
Gemini 2.5 Flash	68	~8.3 hrs	Free	65% agree w/ Pro	Drops field	Rejected

Key finding: Gemini 3 Pro was initially measured at 30 RPM using a sync client. Switching to the native async client (client.aio.models.generate_content) with concurrency 200 yields 120–165 RPM — matching the original 49k image annotation throughput. The bottleneck was the test harness, not the model.

Script Robustness

Both assignment scripts are built for unattended overnight runs:

Semaphore per-call	Released between retries — stuck calls don't block other work
Atomic checkpoints	Write to .tmp then `os.replace()` — crash-safe
Batched processing	500 images per batch, checkpoint between batches
Graceful shutdown	Ctrl-C saves progress, resume with `--resume`
Circuit breaker	Pauses 2 min if >50% error rate over 50 calls
Task deadline	120s max per image (all retries) — no infinite hangs
Resume	`--resume` skips already-processed items, handles corrupted files

Prompt 1: Caption → KEGG Assignment

Sent for each of the 34,062 images with a figure caption. The {caption} placeholder is replaced with the actual caption text (up to 1000 chars). The {kegg_list} placeholder is replaced with all 239 KEGG pathways.

You are a biomedical pathway expert. Given a figure caption from a scientific paper, identify which KEGG human signaling/metabolic pathway the figure most specifically depicts. FIGURE CAPTION: {caption} KEGG HUMAN PATHWAYS: {kegg_list} ← 239 pathways, shown below INSTRUCTIONS: 1. Read the caption carefully. Identify the specific biological pathway or process shown. 2. Use your biomedical expertise to identify the KEGG pathway even if the exact pathway name is not stated in the caption. For example: if the caption mentions "metformin," recognize it acts through AMPK signaling (hsa04152). If it mentions "PERK/eIF2alpha/ATF4/CHOP," recognize this as the UPR branch within "Protein processing in endoplasmic reticulum" (hsa04141). Map from biological knowledge, not just keyword matching. 3. Match it to the single most specific KEGG pathway from the list above. Read the FULL list carefully — KEGG pathway names are sometimes non-obvious (e.g., "Protein processing in endoplasmic reticulum" covers the Unfolded Protein Response; "Glycerophospholipid metabolism" covers choline/phosphatidylcholine pathways). 4. Only respond with "none" if NONE of the biological processes in the caption can be mapped to ANY pathway in the KEGG list. If the caption describes a disease mechanism or drug effect that operates THROUGH a specific KEGG pathway, assign that pathway. Reserve "none" for: purely methodological figures, generic network visualizations, or captions too vague to identify any specific pathway. 5. Prefer specific pathways over broad ones. For example, if the caption clearly describes Wnt signaling, choose "Wnt signaling pathway" not "Pathways in cancer". 6. If the caption describes crosstalk or interaction between two specific pathways, choose the pathway that is the PRIMARY subject of the figure (usually the one in the title or first mentioned). If truly equal, choose the more specific of the two. Respond in JSON only: {"kegg_id": "hsaXXXXX or none", "kegg_name": "pathway name or none", "confidence": 0.0-1.0, "reasoning": "one sentence explaining your choice"} CONFIDENCE SCALE (use strictly): - 1.0: Caption explicitly names the KEGG pathway or its canonical synonym - 0.8: Caption names key pathway components (>2 specific genes/proteins from the pathway) - 0.6: Caption implies the pathway through drug/disease/mechanism inference - 0.4: Caption is ambiguous between 2-3 possible KEGG pathways - 0.2: Weak/indirect evidence only - 0.0: No pathway information in caption (returning "none")

Prompt 2: PubMed Metadata → KEGG Assignment

Sent for each of the 42,362 unique papers with PubMed metadata. Results fan out to all images from that paper. Abstract truncated to 2000 chars. MeSH descriptors comma-joined.

You are a biomedical pathway expert. A scientific paper contains pathway diagram figures, but the figure captions are not available. Using the paper's metadata below, identify which KEGG human signaling/metabolic pathway the paper most likely depicts in its figures. PAPER TITLE: {title} ABSTRACT (truncated): {abstract} MeSH TERMS: {mesh_terms} AUTHOR KEYWORDS: {keywords} KEGG HUMAN PATHWAYS: {kegg_list} ← same 239 pathways INSTRUCTIONS: 1. This is paper-level metadata, not figure-level. The paper may discuss multiple pathways. 2. Use your biomedical expertise to identify the KEGG pathway even if not named verbatim. 3. Identify the single most prominent pathway discussed in this paper. 4. Only respond with "none" if the paper doesn't focus on any specific KEGG pathway. 5. Prefer specific pathways over broad ones. 6. Note: this assignment is inherently lower confidence than caption-based assignment since we're matching paper-level info to figure-level pathways. Respond in JSON only: {"kegg_id": "hsaXXXXX or none", "kegg_name": "pathway name or none", "confidence": 0.0-1.0, "reasoning": "one sentence explaining your choice"} CONFIDENCE SCALE (use strictly): - 1.0: Paper title explicitly names the KEGG pathway - 0.8: Abstract describes key pathway components in detail - 0.6: MeSH terms or keywords strongly suggest a specific pathway - 0.4: Multiple pathways discussed, one slightly more prominent - 0.2: Weak/indirect evidence only - 0.0: No pathway information (returning "none")

KEGG Reference List (fed to both prompts)

The full list of 239 human KEGG pathways that both prompts receive. This is the menu of options the LLM picks from.

Expected Output

Each LLM call returns one JSON object. Caption assignments are keyed by filename (image-level). Metadata assignments are keyed by PMC ID (paper-level, fanned out to all images in build_metadata.py).

Per-image LLM response

"pmc_id": "PMC3056141", "filename": "PMC3056141__F1.jpg", "kegg_id": "hsa04620", "kegg_name": "Toll-like receptor signaling pathway", "confidence": 1.0, "reasoning": "The caption explicitly describes TLR9-dependent innate immune signaling, which is a key component of the Toll-like receptor signaling pathway."

Final `metadata.csv` columns

One row per image. The main deliverable — all three signals merged:

pmc_id — PMC article ID filename — image filename figure_number — extracted figure number (F1, F2, ...) gemini_label — Gemini's free-form pathway name gemini_related — related pathways (pipe-delimited) n_genes — number of Entrez genes extracted --- Signal 1: Jaccard --- jaccard_kegg_id — best KEGG match by gene overlap jaccard_kegg_name — pathway name jaccard_score — Jaccard similarity (0-1) --- Signal 2: Caption LLM --- caption_kegg_id — KEGG ID from caption analysis caption_kegg_name — pathway name caption_confidence— 0.0-1.0 (strict scale) caption_reasoning — one-sentence explanation --- Signal 3: Metadata LLM --- metadata_kegg_id — KEGG ID from paper metadata metadata_kegg_name — pathway name metadata_confidence— 0.0-1.0 (strict scale) metadata_reasoning — one-sentence explanation year — publication year

File Locations

File	Location	Records
Gemini annotations	`~/work/gemini_cell_pathways/gemini3_annotations/full_49k_v2_2026-01-26_clean.jsonl`	48,974
Figure captions	`~/work/gemini_cell_pathways/data/pmc_caption_pathways_full.json`	34,062
PubMed metadata	`~/work/gemini_cell_pathways/data/pubmed_metadata_full.json`	42,362
KEGG pathways	`~/work/gemini_cell_pathways/data/kegg_human_pathways.json`	239
Jaccard scores	`~/work/gemini_cell_pathways/data/image_pathway_scores.csv`	48,779
Generated by this pipeline
Caption assignments	`~/work/gemini-pathways-final/intermediate/caption_assignments.json`	34,062
Metadata assignments	`~/work/gemini-pathways-final/intermediate/metadata_assignments.json`	42,362
Final edges	`~/work/gemini-pathways-final/output/edges.csv`	—
Final nodes	`~/work/gemini-pathways-final/output/nodes.csv`	—
Final metadata	`~/work/gemini-pathways-final/output/metadata.csv`	~48,779