Caption → KEGG Prompt Evaluation

Improving LLM-based KEGG pathway assignment from figure captions. 10-image audit across 4 model configurations.

Summary

We assign KEGG pathway IDs to ~34k pathway images by sending their PMC figure captions to an LLM along with the full 239-pathway KEGG list. A 10-image audit of the original prompt (GPT-4o-mini) revealed 4 false negatives out of 10 — the model was returning "none" for captions that clearly described KEGG pathways.

We rewrote the prompt to fix 3 root causes (poor list scanning, aggressive "none" trigger, no biological inference), then tested the new prompt on GPT-4o-mini, GPT-5.2, and Gemini 3 Pro.

Audit Images

4 → 1

False Negatives (mini)

4 → 0

False Negatives (5.2)

4 → 0

False Negatives (Gemini)

Original Prompt (v1)

The v1 prompt was straightforward: hand the LLM a caption and the KEGG list, ask for the best match.

You are a biomedical pathway expert. Given a figure caption from a scientific paper,
identify which KEGG human signaling/metabolic pathway the figure most specifically depicts.

FIGURE CAPTION:
{caption}

KEGG HUMAN PATHWAYS:
{kegg_list}   # 239 pathways

INSTRUCTIONS:
1. Read the caption carefully. Identify the specific biological pathway or process shown.
2. Match it to the single most specific KEGG pathway from the list above.
3. If the caption describes something not in the KEGG list (e.g., a disease overview,
   drug mechanism, or general cell biology without a specific pathway), respond with "none".
4. Prefer specific pathways over broad ones.

Respond in JSON: {"kegg_id", "kegg_name", "confidence": 0.0-1.0, "reasoning"}

Problems found:

KEGG list scan failure — model couldn't reliably search 239 items. Non-obvious names like "Protein processing in endoplasmic reticulum" (which covers UPR/PERK) were missed.
"None" too aggressive — any multi-mechanism or disease-context caption triggered "none" even when a specific pathway was clearly named.
No biological inference — pure keyword matching. "Metformin" → AMPK requires one hop of biology knowledge the model wasn't told to use.
Coarse confidence — everything got 0.9. The scale wasn't anchored.

Revised Prompt (v2)

Six targeted changes, all within a single prompt (no multi-pass):

Change 1: Conceptual matching instruction

1. Read the caption carefully. Identify the specific biological pathway or process shown.

2. Use your biomedical expertise to identify the KEGG pathway even if the exact pathway name is not stated in the caption. For example: if the caption mentions "metformin," recognize it acts through AMPK signaling (hsa04152). If it mentions "PERK/eIF2alpha/ATF4/CHOP," recognize this as the UPR branch within "Protein processing in endoplasmic reticulum" (hsa04141). Map from biological knowledge, not just keyword matching.

Change 2: Explicit list-scanning nudge

2. Match it to the single most specific KEGG pathway from the list above.

3. Match it to the single most specific KEGG pathway from the list above. Read the FULL list carefully — KEGG pathway names are sometimes non-obvious (e.g., "Protein processing in endoplasmic reticulum" covers the Unfolded Protein Response; "Glycerophospholipid metabolism" covers choline/phosphatidylcholine pathways).

Change 3: Tightened "none" criteria

3. If the caption describes something not in the KEGG list (e.g., a disease overview, drug mechanism, or general cell biology without a specific pathway), respond with "none".

4. Only respond with "none" if NONE of the biological processes in the caption can be mapped to ANY pathway in the KEGG list. If the caption describes a disease mechanism or drug effect that operates THROUGH a specific KEGG pathway, assign that pathway. Reserve "none" for: purely methodological figures, generic network visualizations, or captions too vague to identify any specific pathway.

Change 4: Crosstalk handling

6. If the caption describes crosstalk or interaction between two specific pathways, choose the pathway that is the PRIMARY subject of the figure (usually the one in the title or first mentioned). If truly equal, choose the more specific of the two.

Change 5: Anchored confidence scale

"confidence": 0.0-1.0

CONFIDENCE SCALE (use strictly):

- 1.0: Caption explicitly names the KEGG pathway or its canonical synonym

- 0.8: Caption names key pathway components (>2 specific genes/proteins)

- 0.6: Caption implies the pathway through drug/disease/mechanism inference

- 0.4: Caption is ambiguous between 2-3 possible KEGG pathways

- 0.2: Weak/indirect evidence only

- 0.0: No pathway information in caption (returning "none")

Results: 4-Way Model Comparison

#	Image	Expected	v1 Mini	v2 Mini	v2 GPT-5.2	v2 Gemini 3
1	cGAS-STING / DDR	hsa04623	none	hsa03440	hsa04623	hsa04623
2	Metformin / AMPK	hsa04152	none	hsa04152	hsa04152	hsa04152
3	PERK / UPR	hsa04141	none	hsa04141	hsa04141	hsa04141
4	DNA-sensing (truncated)	hsa04623	hsa04620	hsa04620	hsa04623	hsa04620
5	PLK1 / proteasome	ambiguous	hsa04120	hsa04120	hsa04120	hsa04110
6	Hippo / TGF-β crosstalk	ambiguous	hsa04350	hsa04350	hsa04390	hsa04390
7	HIF-1 (verbatim)	hsa04066	hsa04066	hsa04066	hsa04066	hsa04066
8	Wnt (verbatim)	hsa04310	hsa04310	hsa04310	hsa04310	hsa04310
9	Choline / glycerophospholipid	hsa00564	none	hsa00564	hsa00564	hsa00564
10	Vague (36 chars)	none	none	none	none	none

FIXED correct where v1 was wrong OK correct in both WRONG false negative CHANGED different but not ideal AMBIG genuinely ambiguous

Confidence Calibration

The v1 prompt produced 0.9 for everything — correct, wrong, and ambiguous alike. The anchored scale in v2 produces meaningful signal:

Image	v1 Mini	v2 Mini	v2 GPT-5.2	v2 Gemini 3
cGAS-STING (multi-panel)	0.9	0.8	0.8	0.8
Metformin (inference)	0.9	0.8	0.6	0.6
PERK/UPR (components)	0.9	0.8	0.8	0.9
Wnt (verbatim)	1.0	1.0	1.0	1.0
HIF-1 (verbatim)	1.0	1.0	1.0	1.0
Vague caption	0.0	0.0	0.0	0.0

GPT-5.2 and Gemini 3 correctly use 0.6 for inference-required cases (metformin → AMPK), while mini stays at 0.8. All three reserve 1.0 for verbatim matches only.

Per-Image Detail

v1: WRONG v2 mini: CHANGED 5.2/Gem: FIXED #1 PMC7934064__F2 — cGAS-STING / DDR

Gemini label:cGAS-STING signaling pathway

Expected:hsa04623 (Cytosolic DNA-sensing pathway)

The mechanisms involved in DDR and checkpoint immunotherapies. (A) PARPi promote antigen presentation by ICD owing to the failure of DNA repair; (B) PARPi upregulate the release of Th1 cytokines and recruit CD4+CD8+ T cells via CCL5 and CXCL10; (C) PARPi promote T-cell infiltration; (D) PARPi increase PD-L1 expression by GSK-3beta, ATM-ATR-CHK1, and cGAS-STING pathway; (E) PARPi regulate EMT; and (F) PARPi inhibit tumor angiogenesis by modulating PARP1/HIF1-alpha/VEGF signals. (truncated at 1000 chars)

v1 mini: "none" (0.9) — "The figure caption describes mechanisms related to DNA damage response and immunotherapy, which do not correspond to a specific KEGG pathway."

v2 mini: hsa03440 Homologous recombination (0.8) — latched onto DDR/PARPi theme instead of cGAS-STING.

GPT-5.2: hsa04623 (0.8) — "The caption highlights PARP inhibitor-induced PD-L1 upregulation via the cGAS-STING axis."

Gemini 3: hsa04623 (0.8) — "The caption explicitly names the cGAS-STING pathway as a key mechanism."

Analysis: Multi-panel caption covering 6 mechanisms. cGAS-STING is panel D. Both frontier models found it; mini picked the broader DDR theme. Truncation at 1000 chars also hides context.

v1: WRONG ALL v2: FIXED #2 PMC7822116__F3 — Metformin / AMPK

Gemini label:AMPK signaling

Expected:hsa04152 (AMPK signaling pathway)

Schematic picture of the mechanisms by which metformin exerts a protective action against oxidative stress and endothelial dysfunction (direct arrows indicate the direction of the pathway, whilst blocked arrows stands for inhibition of that specific pathway. Double arrow stands instead for a bidirectional reaction).

v1 mini: "none" (0.9) — "does not directly correspond to a specific KEGG pathway listed."

v2 mini: hsa04152 (0.8) — "metformin's protective action... is known to involve the AMPK signaling pathway."

GPT-5.2: hsa04152 (0.6) — correctly flags as inference-based.

Gemini 3: hsa04152 (0.6) — "metformin primarily functions by activating the AMPK signaling pathway."

Analysis: The caption never says "AMPK" — it says "metformin." The conceptual matching instruction (Change 1) directly fixed this. The 0.6 confidence from frontier models is well-calibrated: it's an inference, not a verbatim match.

v1: WRONG ALL v2: FIXED #3 PMC8771506__F6 — PERK / UPR

Gemini label:Unfolded Protein Response (UPR)

Expected:hsa04141 (Protein processing in endoplasmic reticulum)

A simple diagram shows that POU4F3 suppresses LUAD via activating PERK/eIF2α/ATF4/CHOP pathway.

v1 mini: "none" (0.9) — "PERK/eIF2alpha/ATF4/CHOP... is not listed in the provided KEGG pathways."

v2 mini: hsa04141 (0.8) — "PERK/eIF2α/ATF4/CHOP pathway... is a key component of the Unfolded Protein Response."

GPT-5.2: hsa04141 (0.8)

Gemini 3: hsa04141 (0.9)

Analysis: Classic list-scanning failure. "Protein processing in endoplasmic reticulum" is a non-obvious name for UPR. The explicit examples in Change 1 (mentioning PERK → hsa04141) and the list-scanning nudge in Change 2 both helped.

v1/v2 mini/Gem: TLR 5.2: FIXED #4 PMC3056141__F1 — DNA-sensing (truncated caption)

Gemini label:Cytosolic DNA Sensing Pathway

Expected:hsa04623 (Cytosolic DNA-sensing pathway)

DNA-mediated activation of innate immune signaling. a TLR9-dependent innate immune signaling. TLR9 localizes in the ER and interacts with UNC-93B, which mediates TLR9 translocation. Upon stimulation of DNA-containing CpG motifs (CpG DNA) including viral genome, TLR9 traffics from ER to endosome to contact CpG DNA, and recruits a signaling complex consisting of MyD88, IRAK4, and TRAF6. [...] b Cytosolic DNA-mediated innate immune signaling. Cytosolic AT-rich dsDNA is recognized by RNA PolIII... (truncated at 1000 chars — panel b details cut off)

All mini + Gemini 3: hsa04620 (TLR) — latched onto panel (a) which is fully described in the truncated text.

GPT-5.2: hsa04623 (0.8) — "the caption centers on DNA-triggered innate immune signaling, explicitly describing cytosolic DNA recognition."

Analysis: Truncation at 1000 chars is the root cause. Panel (a) TLR9 is fully described; panel (b) cytosolic DNA-sensing is cut off mid-sentence. Only GPT-5.2 inferred the overall figure theme from the title ("DNA-mediated activation") rather than the first panel's detail.

AMBIGUOUS #5 PMC8745771__F3 — PLK1 / proteasome

Gemini label:Cell cycle

PLK1 phosphorylation and proteasome degradation pathway.

Mini + GPT-5.2: hsa04120 (Ubiquitin mediated proteolysis) — keyword-matched "proteasome degradation."

Gemini 3: hsa04110 (Cell cycle) — PLK1 is a canonical cell-cycle kinase; its proteasomal degradation is a cell-cycle regulatory step.

Analysis: Both are defensible. 56-char caption is simply too short to disambiguate. Gemini 3 made the deeper biological call (PLK1 = G2/M checkpoint regulator).

AMBIGUOUS #6 PMC7214151__F3 — Hippo / TGF-β crosstalk

Gemini label:Hippo signaling

The interaction between Rap2-Hippo-Yap1 pathway and TGF-β signaling

Mini (both): hsa04350 (TGF-beta) — picked the second pathway mentioned.

GPT-5.2 + Gemini 3: hsa04390 (Hippo) — chose the primary subject (Rap2-Hippo-Yap1 is the compound noun, TGF-β is the interactor).

Analysis: The crosstalk instruction (Change 4) was designed for this case. Frontier models applied it correctly; mini didn't. KEGG's Hippo pathway includes TGF-β as upstream input, making Hippo the better contextual choice.

ALL CORRECT #7 PMC6885181__F5 — HIF-1 (verbatim)

HIF-1-miR-219-SMC4 axis regulates proliferation and migration of HCC under hypoxic condition.

All models: hsa04066 (1.0). Caption explicitly names HIF-1. Ideal case.

ALL CORRECT #8 PMC6520938__F1 — Wnt (verbatim)

Canonical Wnt signaling pathway. Left: In the absence of Wnt ligands, the multi-protein complex with APC, GSK3, and CK1 cause phosphorylation and subsequent proteosomal degradation of beta-catenin...

All models: hsa04310 (1.0). Textbook Wnt description. Ideal case.

v1: WRONG ALL v2: FIXED #9 PMC8234087__F2 — Choline / glycerophospholipid

Gemini label:Glycerophospholipid metabolism

Expected:hsa00564 (Glycerophospholipid metabolism)

Oncogenic pathways upstream and downstream of aberrant choline metabolism discovered in B cell malignancies. Pathway schematics depict the anabolic and catabolic pathways of PC. Enzymes are denoted in Italic font in the schematics. Metabolites, lipids and metabolic enzymes that are dysregulated in malignant B cells are indicated in red (for up-regulated) or blue (for down-regulated)... (truncated)

v1 mini: "none" (0.9) — "does not directly correspond to a specific KEGG pathway listed." This despite the Gemini label being a near-verbatim match to the KEGG name.

All v2: hsa00564 (0.8-0.9) — correctly identified choline/PC biosynthesis as glycerophospholipid metabolism.

Analysis: The clearest list-scanning failure in the set. The tightened "none" criteria (Change 3) and list-scanning nudge (Change 2) both contributed to the fix.

ALL CORRECT #10 PMC9396278__F7 — Vague (36-char caption)

Schematic of the regulatory pathway.

All models: "none" (0.0). Caption provides zero pathway information. Correct behavior preserved — the tightened "none" criteria didn't cause false positives.

Cost Comparison

Model	10-image cost	Est. 34k full run	Accuracy (10-img)
GPT-4o-mini (v2 prompt)	$0.006	~$20	7/10 (1 wrong, 2 ambig)
GPT-5.2 (v2 prompt)	$0.10	~$340	8/10 (0 wrong, 2 ambig)
Gemini 3 Pro (v2 prompt)	~$0.08	~$270	8/10 (0 wrong, 2 ambig)

v2 mini gets 80% of the improvement at 1/17th the cost. Frontier models (GPT-5.2, Gemini 3) converge on the same answers and split only on genuinely ambiguous cases.

Conclusions

Prompt > model for most failures. 3 of 4 false negatives were fixed by prompt changes alone on the same cheap model (GPT-4o-mini). The prompt changes that mattered most: conceptual matching examples and tightened "none" criteria.

Frontier models handle edge cases. The remaining failure (multi-panel DDR caption) requires the model to identify a specific pathway named in one sub-panel among six. Only GPT-5.2 managed this.

Caption truncation is a real bottleneck. Two of the hardest cases (#1, #4) involve captions truncated at 1000 chars. Re-extracting at 2000 chars would help but is a separate pipeline step.

Confidence is now meaningful. The anchored scale produces 1.0 (verbatim), 0.8 (components), 0.6 (inference), and 0.0 (none) — each tier maps to a distinct evidence quality.