KEGG Assignment Pipeline

Assigning KEGG pathway IDs to ~49k pathway images via Gemini 3 Pro analysis of captions and PubMed metadata

Pipeline Overview

Signal 1
Jaccard
Signal 2
Caption LLM
Signal 3
Metadata LLM
Output
Final CSVs
48,779
Total Images
34,062
Have Captions
42,362
Papers w/ Metadata
239
KEGG Pathways

Execution Plan

Run 1: Caption Assignment (Signal 2)

Send each figure caption + 239 KEGG pathways to Gemini 3 Pro. Figure-level signal — strongest assignment.

LLM calls34,062 (one per captioned image)
ModelGemini 3 Pro (Vertex AI)
Throughput~120–165/min sustained
Est. time~4.7 hours
CostFree (GCP credits)
Outputintermediate/caption_assignments.json
caffeinate -s python scripts/run_caption_assignment_gemini.py --full

Run 2: Metadata Assignment (Signal 3)

Send paper title + abstract + MeSH + keywords + 239 KEGG pathways to Gemini 3 Pro. Paper-level signal — run for all papers, not just captionless. Deduped by PMC ID.

LLM calls42,362 (one per unique paper)
ModelGemini 3 Pro (Vertex AI)
Throughput~120–165/min sustained
Est. time~5.9 hours
CostFree (GCP credits)
Outputintermediate/metadata_assignments.json
caffeinate -s python scripts/run_metadata_assignment_gemini.py --full
Metadata assignment is paper-level: one paper may have multiple figures showing different pathways. Results fan out to all images from that paper. Running for all papers (not just captionless) gives an independent validation signal for captioned images too.

Run 3: Build Final CSVs

Merge all signals into three deliverable CSVs. No API calls.

python scripts/build_edges.py && python scripts/build_nodes.py && python scripts/build_metadata.py
Outputoutput/edges.csv, output/nodes.csv, output/metadata.csv

Signal Coverage After Both Runs

Gemini annotation
100%
48,779
Metadata LLM (Signal 3)
99.9%
48,738
Jaccard score > 0
97.4%
47,526
Caption LLM (Signal 2)
69.8%
34,062

After both runs: 34,039 images get both caption AND metadata signals (cross-validation). 14,699 captionless images get metadata only. Only 18 images (~0.04%) have no caption and no PubMed metadata — these get Jaccard + Gemini label only.

Model Comparison & Throughput Testing

Tested 4 models on the caption assignment task. Gemini 3 Pro is the clear winner: best quality, free via credits, and fast enough with the native async client.

ModelRPM34k timeCostQualityReasoningStatus
Gemini 3 Pro 120–165 ~4.7 hrs Free Baseline (best) Yes Selected
GPT-4o-mini ~200 ~3 hrs ~$20 70% agree w/ Pro Yes Backup
Gemini 2.0 Flash 716 ~47 min Free Untested Untested Not used
Gemini 2.5 Flash 68 ~8.3 hrs Free 65% agree w/ Pro Drops field Rejected
Key finding: Gemini 3 Pro was initially measured at 30 RPM using a sync client. Switching to the native async client (client.aio.models.generate_content) with concurrency 200 yields 120–165 RPM — matching the original 49k image annotation throughput. The bottleneck was the test harness, not the model.

Script Robustness

Both assignment scripts are built for unattended overnight runs:

Semaphore per-callReleased between retries — stuck calls don't block other work
Atomic checkpointsWrite to .tmp then os.replace() — crash-safe
Batched processing500 images per batch, checkpoint between batches
Graceful shutdownCtrl-C saves progress, resume with --resume
Circuit breakerPauses 2 min if >50% error rate over 50 calls
Task deadline120s max per image (all retries) — no infinite hangs
Resume--resume skips already-processed items, handles corrupted files

Prompt 1: Caption → KEGG Assignment

Sent for each of the 34,062 images with a figure caption. The {caption} placeholder is replaced with the actual caption text (up to 1000 chars). The {kegg_list} placeholder is replaced with all 239 KEGG pathways.

You are a biomedical pathway expert. Given a figure caption from a scientific paper, identify which KEGG human signaling/metabolic pathway the figure most specifically depicts. FIGURE CAPTION: {caption} KEGG HUMAN PATHWAYS: {kegg_list} ← 239 pathways, shown below INSTRUCTIONS: 1. Read the caption carefully. Identify the specific biological pathway or process shown. 2. Use your biomedical expertise to identify the KEGG pathway even if the exact pathway name is not stated in the caption. For example: if the caption mentions "metformin," recognize it acts through AMPK signaling (hsa04152). If it mentions "PERK/eIF2alpha/ATF4/CHOP," recognize this as the UPR branch within "Protein processing in endoplasmic reticulum" (hsa04141). Map from biological knowledge, not just keyword matching. 3. Match it to the single most specific KEGG pathway from the list above. Read the FULL list carefully — KEGG pathway names are sometimes non-obvious (e.g., "Protein processing in endoplasmic reticulum" covers the Unfolded Protein Response; "Glycerophospholipid metabolism" covers choline/phosphatidylcholine pathways). 4. Only respond with "none" if NONE of the biological processes in the caption can be mapped to ANY pathway in the KEGG list. If the caption describes a disease mechanism or drug effect that operates THROUGH a specific KEGG pathway, assign that pathway. Reserve "none" for: purely methodological figures, generic network visualizations, or captions too vague to identify any specific pathway. 5. Prefer specific pathways over broad ones. For example, if the caption clearly describes Wnt signaling, choose "Wnt signaling pathway" not "Pathways in cancer". 6. If the caption describes crosstalk or interaction between two specific pathways, choose the pathway that is the PRIMARY subject of the figure (usually the one in the title or first mentioned). If truly equal, choose the more specific of the two. Respond in JSON only: {"kegg_id": "hsaXXXXX or none", "kegg_name": "pathway name or none", "confidence": 0.0-1.0, "reasoning": "one sentence explaining your choice"} CONFIDENCE SCALE (use strictly): - 1.0: Caption explicitly names the KEGG pathway or its canonical synonym - 0.8: Caption names key pathway components (>2 specific genes/proteins from the pathway) - 0.6: Caption implies the pathway through drug/disease/mechanism inference - 0.4: Caption is ambiguous between 2-3 possible KEGG pathways - 0.2: Weak/indirect evidence only - 0.0: No pathway information in caption (returning "none")

Prompt 2: PubMed Metadata → KEGG Assignment

Sent for each of the 42,362 unique papers with PubMed metadata. Results fan out to all images from that paper. Abstract truncated to 2000 chars. MeSH descriptors comma-joined.

You are a biomedical pathway expert. A scientific paper contains pathway diagram figures, but the figure captions are not available. Using the paper's metadata below, identify which KEGG human signaling/metabolic pathway the paper most likely depicts in its figures. PAPER TITLE: {title} ABSTRACT (truncated): {abstract} MeSH TERMS: {mesh_terms} AUTHOR KEYWORDS: {keywords} KEGG HUMAN PATHWAYS: {kegg_list} ← same 239 pathways INSTRUCTIONS: 1. This is paper-level metadata, not figure-level. The paper may discuss multiple pathways. 2. Use your biomedical expertise to identify the KEGG pathway even if not named verbatim. 3. Identify the single most prominent pathway discussed in this paper. 4. Only respond with "none" if the paper doesn't focus on any specific KEGG pathway. 5. Prefer specific pathways over broad ones. 6. Note: this assignment is inherently lower confidence than caption-based assignment since we're matching paper-level info to figure-level pathways. Respond in JSON only: {"kegg_id": "hsaXXXXX or none", "kegg_name": "pathway name or none", "confidence": 0.0-1.0, "reasoning": "one sentence explaining your choice"} CONFIDENCE SCALE (use strictly): - 1.0: Paper title explicitly names the KEGG pathway - 0.8: Abstract describes key pathway components in detail - 0.6: MeSH terms or keywords strongly suggest a specific pathway - 0.4: Multiple pathways discussed, one slightly more prominent - 0.2: Weak/indirect evidence only - 0.0: No pathway information (returning "none")

KEGG Reference List (fed to both prompts)

The full list of 239 human KEGG pathways that both prompts receive. This is the menu of options the LLM picks from.

Loading...

Expected Output

Each LLM call returns one JSON object. Caption assignments are keyed by filename (image-level). Metadata assignments are keyed by PMC ID (paper-level, fanned out to all images in build_metadata.py).

Per-image LLM response

"pmc_id": "PMC3056141", "filename": "PMC3056141__F1.jpg", "kegg_id": "hsa04620", "kegg_name": "Toll-like receptor signaling pathway", "confidence": 1.0, "reasoning": "The caption explicitly describes TLR9-dependent innate immune signaling, which is a key component of the Toll-like receptor signaling pathway."

Final metadata.csv columns

One row per image. The main deliverable — all three signals merged:

pmc_id — PMC article ID filename — image filename figure_number — extracted figure number (F1, F2, ...) gemini_label — Gemini's free-form pathway name gemini_related — related pathways (pipe-delimited) n_genes — number of Entrez genes extracted --- Signal 1: Jaccard --- jaccard_kegg_id — best KEGG match by gene overlap jaccard_kegg_name — pathway name jaccard_score — Jaccard similarity (0-1) --- Signal 2: Caption LLM --- caption_kegg_id — KEGG ID from caption analysis caption_kegg_name — pathway name caption_confidence— 0.0-1.0 (strict scale) caption_reasoning — one-sentence explanation --- Signal 3: Metadata LLM --- metadata_kegg_id — KEGG ID from paper metadata metadata_kegg_name — pathway name metadata_confidence— 0.0-1.0 (strict scale) metadata_reasoning — one-sentence explanation year — publication year

File Locations

FileLocationRecords
Gemini annotations~/work/gemini_cell_pathways/gemini3_annotations/full_49k_v2_2026-01-26_clean.jsonl48,974
Figure captions~/work/gemini_cell_pathways/data/pmc_caption_pathways_full.json34,062
PubMed metadata~/work/gemini_cell_pathways/data/pubmed_metadata_full.json42,362
KEGG pathways~/work/gemini_cell_pathways/data/kegg_human_pathways.json239
Jaccard scores~/work/gemini_cell_pathways/data/image_pathway_scores.csv48,779
Generated by this pipeline
Caption assignments~/work/gemini-pathways-final/intermediate/caption_assignments.json34,062
Metadata assignments~/work/gemini-pathways-final/intermediate/metadata_assignments.json42,362
Final edges~/work/gemini-pathways-final/output/edges.csv
Final nodes~/work/gemini-pathways-final/output/nodes.csv
Final metadata~/work/gemini-pathways-final/output/metadata.csv~48,779