Improving LLM-based KEGG pathway assignment from figure captions. 10-image audit across 4 model configurations.
We assign KEGG pathway IDs to ~34k pathway images by sending their PMC figure captions to an LLM along with the full 239-pathway KEGG list. A 10-image audit of the original prompt (GPT-4o-mini) revealed 4 false negatives out of 10 — the model was returning "none" for captions that clearly described KEGG pathways.
We rewrote the prompt to fix 3 root causes (poor list scanning, aggressive "none" trigger, no biological inference), then tested the new prompt on GPT-4o-mini, GPT-5.2, and Gemini 3 Pro.
The v1 prompt was straightforward: hand the LLM a caption and the KEGG list, ask for the best match.
You are a biomedical pathway expert. Given a figure caption from a scientific paper, identify which KEGG human signaling/metabolic pathway the figure most specifically depicts. FIGURE CAPTION: {caption} KEGG HUMAN PATHWAYS: {kegg_list} # 239 pathways INSTRUCTIONS: 1. Read the caption carefully. Identify the specific biological pathway or process shown. 2. Match it to the single most specific KEGG pathway from the list above. 3. If the caption describes something not in the KEGG list (e.g., a disease overview, drug mechanism, or general cell biology without a specific pathway), respond with "none". 4. Prefer specific pathways over broad ones. Respond in JSON: {"kegg_id", "kegg_name", "confidence": 0.0-1.0, "reasoning"}
Problems found:
Six targeted changes, all within a single prompt (no multi-pass):
| # | Image | Expected | v1 Mini | v2 Mini | v2 GPT-5.2 | v2 Gemini 3 |
|---|---|---|---|---|---|---|
| 1 | cGAS-STING / DDR | hsa04623 | none | hsa03440 | hsa04623 | hsa04623 |
| 2 | Metformin / AMPK | hsa04152 | none | hsa04152 | hsa04152 | hsa04152 |
| 3 | PERK / UPR | hsa04141 | none | hsa04141 | hsa04141 | hsa04141 |
| 4 | DNA-sensing (truncated) | hsa04623 | hsa04620 | hsa04620 | hsa04623 | hsa04620 |
| 5 | PLK1 / proteasome | ambiguous | hsa04120 | hsa04120 | hsa04120 | hsa04110 |
| 6 | Hippo / TGF-β crosstalk | ambiguous | hsa04350 | hsa04350 | hsa04390 | hsa04390 |
| 7 | HIF-1 (verbatim) | hsa04066 | hsa04066 | hsa04066 | hsa04066 | hsa04066 |
| 8 | Wnt (verbatim) | hsa04310 | hsa04310 | hsa04310 | hsa04310 | hsa04310 |
| 9 | Choline / glycerophospholipid | hsa00564 | none | hsa00564 | hsa00564 | hsa00564 |
| 10 | Vague (36 chars) | none | none | none | none | none |
FIXED correct where v1 was wrong OK correct in both WRONG false negative CHANGED different but not ideal AMBIG genuinely ambiguous
The v1 prompt produced 0.9 for everything — correct, wrong, and ambiguous alike. The anchored scale in v2 produces meaningful signal:
| Image | v1 Mini | v2 Mini | v2 GPT-5.2 | v2 Gemini 3 |
|---|---|---|---|---|
| cGAS-STING (multi-panel) | 0.9 | 0.8 | 0.8 | 0.8 |
| Metformin (inference) | 0.9 | 0.8 | 0.6 | 0.6 |
| PERK/UPR (components) | 0.9 | 0.8 | 0.8 | 0.9 |
| Wnt (verbatim) | 1.0 | 1.0 | 1.0 | 1.0 |
| HIF-1 (verbatim) | 1.0 | 1.0 | 1.0 | 1.0 |
| Vague caption | 0.0 | 0.0 | 0.0 | 0.0 |
GPT-5.2 and Gemini 3 correctly use 0.6 for inference-required cases (metformin → AMPK), while mini stays at 0.8. All three reserve 1.0 for verbatim matches only.
v1 mini: "none" (0.9) — "The figure caption describes mechanisms related to DNA damage response and immunotherapy, which do not correspond to a specific KEGG pathway."
v2 mini: hsa03440 Homologous recombination (0.8) — latched onto DDR/PARPi theme instead of cGAS-STING.
GPT-5.2: hsa04623 (0.8) — "The caption highlights PARP inhibitor-induced PD-L1 upregulation via the cGAS-STING axis."
Gemini 3: hsa04623 (0.8) — "The caption explicitly names the cGAS-STING pathway as a key mechanism."
Analysis: Multi-panel caption covering 6 mechanisms. cGAS-STING is panel D. Both frontier models found it; mini picked the broader DDR theme. Truncation at 1000 chars also hides context.
v1 mini: "none" (0.9) — "does not directly correspond to a specific KEGG pathway listed."
v2 mini: hsa04152 (0.8) — "metformin's protective action... is known to involve the AMPK signaling pathway."
GPT-5.2: hsa04152 (0.6) — correctly flags as inference-based.
Gemini 3: hsa04152 (0.6) — "metformin primarily functions by activating the AMPK signaling pathway."
Analysis: The caption never says "AMPK" — it says "metformin." The conceptual matching instruction (Change 1) directly fixed this. The 0.6 confidence from frontier models is well-calibrated: it's an inference, not a verbatim match.
v1 mini: "none" (0.9) — "PERK/eIF2alpha/ATF4/CHOP... is not listed in the provided KEGG pathways."
v2 mini: hsa04141 (0.8) — "PERK/eIF2α/ATF4/CHOP pathway... is a key component of the Unfolded Protein Response."
GPT-5.2: hsa04141 (0.8)
Gemini 3: hsa04141 (0.9)
Analysis: Classic list-scanning failure. "Protein processing in endoplasmic reticulum" is a non-obvious name for UPR. The explicit examples in Change 1 (mentioning PERK → hsa04141) and the list-scanning nudge in Change 2 both helped.
All mini + Gemini 3: hsa04620 (TLR) — latched onto panel (a) which is fully described in the truncated text.
GPT-5.2: hsa04623 (0.8) — "the caption centers on DNA-triggered innate immune signaling, explicitly describing cytosolic DNA recognition."
Analysis: Truncation at 1000 chars is the root cause. Panel (a) TLR9 is fully described; panel (b) cytosolic DNA-sensing is cut off mid-sentence. Only GPT-5.2 inferred the overall figure theme from the title ("DNA-mediated activation") rather than the first panel's detail.
Mini + GPT-5.2: hsa04120 (Ubiquitin mediated proteolysis) — keyword-matched "proteasome degradation."
Gemini 3: hsa04110 (Cell cycle) — PLK1 is a canonical cell-cycle kinase; its proteasomal degradation is a cell-cycle regulatory step.
Analysis: Both are defensible. 56-char caption is simply too short to disambiguate. Gemini 3 made the deeper biological call (PLK1 = G2/M checkpoint regulator).
Mini (both): hsa04350 (TGF-beta) — picked the second pathway mentioned.
GPT-5.2 + Gemini 3: hsa04390 (Hippo) — chose the primary subject (Rap2-Hippo-Yap1 is the compound noun, TGF-β is the interactor).
Analysis: The crosstalk instruction (Change 4) was designed for this case. Frontier models applied it correctly; mini didn't. KEGG's Hippo pathway includes TGF-β as upstream input, making Hippo the better contextual choice.
All models: hsa04066 (1.0). Caption explicitly names HIF-1. Ideal case.
All models: hsa04310 (1.0). Textbook Wnt description. Ideal case.
v1 mini: "none" (0.9) — "does not directly correspond to a specific KEGG pathway listed." This despite the Gemini label being a near-verbatim match to the KEGG name.
All v2: hsa00564 (0.8-0.9) — correctly identified choline/PC biosynthesis as glycerophospholipid metabolism.
Analysis: The clearest list-scanning failure in the set. The tightened "none" criteria (Change 3) and list-scanning nudge (Change 2) both contributed to the fix.
All models: "none" (0.0). Caption provides zero pathway information. Correct behavior preserved — the tightened "none" criteria didn't cause false positives.
| Model | 10-image cost | Est. 34k full run | Accuracy (10-img) |
|---|---|---|---|
| GPT-4o-mini (v2 prompt) | $0.006 | ~$20 | 7/10 (1 wrong, 2 ambig) |
| GPT-5.2 (v2 prompt) | $0.10 | ~$340 | 8/10 (0 wrong, 2 ambig) |
| Gemini 3 Pro (v2 prompt) | ~$0.08 | ~$270 | 8/10 (0 wrong, 2 ambig) |
v2 mini gets 80% of the improvement at 1/17th the cost. Frontier models (GPT-5.2, Gemini 3) converge on the same answers and split only on genuinely ambiguous cases.
Prompt > model for most failures. 3 of 4 false negatives were fixed by prompt changes alone on the same cheap model (GPT-4o-mini). The prompt changes that mattered most: conceptual matching examples and tightened "none" criteria.
Frontier models handle edge cases. The remaining failure (multi-panel DDR caption) requires the model to identify a specific pathway named in one sub-panel among six. Only GPT-5.2 managed this.
Caption truncation is a real bottleneck. Two of the hardest cases (#1, #4) involve captions truncated at 1000 chars. Re-extracting at 2000 chars would help but is a separate pipeline step.
Confidence is now meaningful. The anchored scale produces 1.0 (verbatim), 0.8 (components), 0.6 (inference), and 0.0 (none) — each tier maps to a distinct evidence quality.