Pipeline Overview
42,362
Papers w/ Metadata
Execution Plan
Run 1: Caption Assignment (Signal 2)
Send each figure caption + 239 KEGG pathways to Gemini 3 Pro. Figure-level signal — strongest assignment.
| LLM calls | 34,062 (one per captioned image) |
| Model | Gemini 3 Pro (Vertex AI) |
| Throughput | ~120–165/min sustained |
| Est. time | ~4.7 hours |
| Cost | Free (GCP credits) |
| Output | intermediate/caption_assignments.json |
caffeinate -s python scripts/run_caption_assignment_gemini.py --full
Run 2: Metadata Assignment (Signal 3)
Send paper title + abstract + MeSH + keywords + 239 KEGG pathways to Gemini 3 Pro. Paper-level signal — run for all papers, not just captionless. Deduped by PMC ID.
| LLM calls | 42,362 (one per unique paper) |
| Model | Gemini 3 Pro (Vertex AI) |
| Throughput | ~120–165/min sustained |
| Est. time | ~5.9 hours |
| Cost | Free (GCP credits) |
| Output | intermediate/metadata_assignments.json |
caffeinate -s python scripts/run_metadata_assignment_gemini.py --full
Metadata assignment is paper-level: one paper may have multiple figures showing different pathways. Results fan out to all images from that paper. Running for all papers (not just captionless) gives an independent validation signal for captioned images too.
Run 3: Build Final CSVs
Merge all signals into three deliverable CSVs. No API calls.
python scripts/build_edges.py && python scripts/build_nodes.py && python scripts/build_metadata.py
| Output | output/edges.csv, output/nodes.csv, output/metadata.csv |
Signal Coverage After Both Runs
Metadata LLM (Signal 3)
48,738
Caption LLM (Signal 2)
34,062
After both runs: 34,039 images get both caption AND metadata signals (cross-validation). 14,699 captionless images get metadata only. Only 18 images (~0.04%) have no caption and no PubMed metadata — these get Jaccard + Gemini label only.
Model Comparison & Throughput Testing
Tested 4 models on the caption assignment task. Gemini 3 Pro is the clear winner: best quality, free via credits, and fast enough with the native async client.
| Model | RPM | 34k time | Cost | Quality | Reasoning | Status |
| Gemini 3 Pro |
120–165 |
~4.7 hrs |
Free |
Baseline (best) |
Yes |
Selected |
| GPT-4o-mini |
~200 |
~3 hrs |
~$20 |
70% agree w/ Pro |
Yes |
Backup |
| Gemini 2.0 Flash |
716 |
~47 min |
Free |
Untested |
Untested |
Not used |
| Gemini 2.5 Flash |
68 |
~8.3 hrs |
Free |
65% agree w/ Pro |
Drops field |
Rejected |
Key finding: Gemini 3 Pro was initially measured at 30 RPM using a sync client. Switching to the native async client (client.aio.models.generate_content) with concurrency 200 yields 120–165 RPM — matching the original 49k image annotation throughput. The bottleneck was the test harness, not the model.
Script Robustness
Both assignment scripts are built for unattended overnight runs:
| Semaphore per-call | Released between retries — stuck calls don't block other work |
| Atomic checkpoints | Write to .tmp then os.replace() — crash-safe |
| Batched processing | 500 images per batch, checkpoint between batches |
| Graceful shutdown | Ctrl-C saves progress, resume with --resume |
| Circuit breaker | Pauses 2 min if >50% error rate over 50 calls |
| Task deadline | 120s max per image (all retries) — no infinite hangs |
| Resume | --resume skips already-processed items, handles corrupted files |
Prompt 1: Caption → KEGG Assignment
Sent for each of the 34,062 images with a figure caption. The {caption} placeholder is replaced with the actual caption text (up to 1000 chars). The {kegg_list} placeholder is replaced with all 239 KEGG pathways.
You are a biomedical pathway expert. Given a figure caption from a scientific paper,
identify which KEGG human signaling/metabolic pathway the figure most specifically depicts.
FIGURE CAPTION:
{caption}
KEGG HUMAN PATHWAYS:
{kegg_list} ← 239 pathways, shown below
INSTRUCTIONS:
1. Read the caption carefully. Identify the specific biological pathway or process shown.
2. Use your biomedical expertise to identify the KEGG pathway even if the exact pathway
name is not stated in the caption. For example: if the caption mentions "metformin,"
recognize it acts through AMPK signaling (hsa04152). If it mentions "PERK/eIF2alpha/ATF4/CHOP,"
recognize this as the UPR branch within "Protein processing in endoplasmic reticulum" (hsa04141).
Map from biological knowledge, not just keyword matching.
3. Match it to the single most specific KEGG pathway from the list above. Read the FULL list
carefully — KEGG pathway names are sometimes non-obvious (e.g., "Protein processing in
endoplasmic reticulum" covers the Unfolded Protein Response; "Glycerophospholipid metabolism"
covers choline/phosphatidylcholine pathways).
4. Only respond with "none" if NONE of the biological processes in the caption can be mapped
to ANY pathway in the KEGG list. If the caption describes a disease mechanism or drug effect
that operates THROUGH a specific KEGG pathway, assign that pathway. Reserve "none" for:
purely methodological figures, generic network visualizations, or captions too vague to
identify any specific pathway.
5. Prefer specific pathways over broad ones. For example, if the caption clearly describes
Wnt signaling, choose "Wnt signaling pathway" not "Pathways in cancer".
6. If the caption describes crosstalk or interaction between two specific pathways, choose
the pathway that is the PRIMARY subject of the figure (usually the one in the title or
first mentioned). If truly equal, choose the more specific of the two.
Respond in JSON only:
{"kegg_id": "hsaXXXXX or none", "kegg_name": "pathway name or none", "confidence": 0.0-1.0, "reasoning": "one sentence explaining your choice"}
CONFIDENCE SCALE (use strictly):
- 1.0: Caption explicitly names the KEGG pathway or its canonical synonym
- 0.8: Caption names key pathway components (>2 specific genes/proteins from the pathway)
- 0.6: Caption implies the pathway through drug/disease/mechanism inference
- 0.4: Caption is ambiguous between 2-3 possible KEGG pathways
- 0.2: Weak/indirect evidence only
- 0.0: No pathway information in caption (returning "none")
Prompt 2: PubMed Metadata → KEGG Assignment
Sent for each of the 42,362 unique papers with PubMed metadata. Results fan out to all images from that paper. Abstract truncated to 2000 chars. MeSH descriptors comma-joined.
You are a biomedical pathway expert. A scientific paper contains pathway diagram figures,
but the figure captions are not available. Using the paper's metadata below, identify
which KEGG human signaling/metabolic pathway the paper most likely depicts in its figures.
PAPER TITLE: {title}
ABSTRACT (truncated): {abstract}
MeSH TERMS: {mesh_terms}
AUTHOR KEYWORDS: {keywords}
KEGG HUMAN PATHWAYS:
{kegg_list} ← same 239 pathways
INSTRUCTIONS:
1. This is paper-level metadata, not figure-level. The paper may discuss multiple pathways.
2. Use your biomedical expertise to identify the KEGG pathway even if not named verbatim.
3. Identify the single most prominent pathway discussed in this paper.
4. Only respond with "none" if the paper doesn't focus on any specific KEGG pathway.
5. Prefer specific pathways over broad ones.
6. Note: this assignment is inherently lower confidence than caption-based assignment
since we're matching paper-level info to figure-level pathways.
Respond in JSON only:
{"kegg_id": "hsaXXXXX or none", "kegg_name": "pathway name or none", "confidence": 0.0-1.0, "reasoning": "one sentence explaining your choice"}
CONFIDENCE SCALE (use strictly):
- 1.0: Paper title explicitly names the KEGG pathway
- 0.8: Abstract describes key pathway components in detail
- 0.6: MeSH terms or keywords strongly suggest a specific pathway
- 0.4: Multiple pathways discussed, one slightly more prominent
- 0.2: Weak/indirect evidence only
- 0.0: No pathway information (returning "none")
KEGG Reference List (fed to both prompts)
The full list of 239 human KEGG pathways that both prompts receive. This is the menu of options the LLM picks from.
Loading...
Expected Output
Each LLM call returns one JSON object. Caption assignments are keyed by filename (image-level). Metadata assignments are keyed by PMC ID (paper-level, fanned out to all images in build_metadata.py).
Per-image LLM response
"pmc_id": "PMC3056141",
"filename": "PMC3056141__F1.jpg",
"kegg_id": "hsa04620",
"kegg_name": "Toll-like receptor signaling pathway",
"confidence": 1.0,
"reasoning": "The caption explicitly describes TLR9-dependent innate immune signaling, which is a key component of the Toll-like receptor signaling pathway."
Final metadata.csv columns
One row per image. The main deliverable — all three signals merged:
pmc_id — PMC article ID
filename — image filename
figure_number — extracted figure number (F1, F2, ...)
gemini_label — Gemini's free-form pathway name
gemini_related — related pathways (pipe-delimited)
n_genes — number of Entrez genes extracted
--- Signal 1: Jaccard ---
jaccard_kegg_id — best KEGG match by gene overlap
jaccard_kegg_name — pathway name
jaccard_score — Jaccard similarity (0-1)
--- Signal 2: Caption LLM ---
caption_kegg_id — KEGG ID from caption analysis
caption_kegg_name — pathway name
caption_confidence— 0.0-1.0 (strict scale)
caption_reasoning — one-sentence explanation
--- Signal 3: Metadata LLM ---
metadata_kegg_id — KEGG ID from paper metadata
metadata_kegg_name — pathway name
metadata_confidence— 0.0-1.0 (strict scale)
metadata_reasoning — one-sentence explanation
year — publication year
File Locations
| File | Location | Records |
| Gemini annotations | ~/work/gemini_cell_pathways/gemini3_annotations/full_49k_v2_2026-01-26_clean.jsonl | 48,974 |
| Figure captions | ~/work/gemini_cell_pathways/data/pmc_caption_pathways_full.json | 34,062 |
| PubMed metadata | ~/work/gemini_cell_pathways/data/pubmed_metadata_full.json | 42,362 |
| KEGG pathways | ~/work/gemini_cell_pathways/data/kegg_human_pathways.json | 239 |
| Jaccard scores | ~/work/gemini_cell_pathways/data/image_pathway_scores.csv | 48,779 |
| Generated by this pipeline |
| Caption assignments | ~/work/gemini-pathways-final/intermediate/caption_assignments.json | 34,062 |
| Metadata assignments | ~/work/gemini-pathways-final/intermediate/metadata_assignments.json | 42,362 |
| Final edges | ~/work/gemini-pathways-final/output/edges.csv | — |
| Final nodes | ~/work/gemini-pathways-final/output/nodes.csv | — |
| Final metadata | ~/work/gemini-pathways-final/output/metadata.csv | ~48,779 |