Caption Validation Pilot: 50 images x 3 models

Key Findings

Overall Picture

All three models agree that Gemini's pathway labels are biologically correct in the vast majority of cases (96-100% "yes"). The main issue is label granularity, not correctness: Gemini tends to assign category-level labels (e.g., "Cancer signaling," "Autophagy") rather than the specific pathway described in the caption.

Precision Is the Discriminator

GPT-4o-mini is the most lenient: it calls 36% of labels "exact" and nearly all others "too_broad."
GPT-4o is stricter, marking only ~42% as "exact" but also flagging 3 entries as "wrong" (2 bio=no, 1 unclear) and catching nuances where 4o-mini accepted the label.
GPT-5 is the most demanding on precision: only ~24% "exact," with the majority rated "too_broad" and several "too_narrow." It provides the most detailed and context-rich reasoning.

Model Disagreements

TLR vs. RIG-I (PMC8716696): GPT-4o flags the Gemini label "Toll-like receptor signaling" as wrong since the caption describes RIG-I/MDA5 signaling. GPT-4o-mini and GPT-5 keep bio=yes but note the label is too broad/narrow. GPT-4o appears most accurate here.
Methionine vs. Transsulfuration (PMC8007787): GPT-4o says the label "Methionine Metabolism" is wrong, calling the described pathway "Reverse Transsulfuration." The other models call it too_broad but biologically correct. A judgment call, but GPT-4o's position is defensible.
Insulin signaling (PMC7470804): The caption is too short to determine pathway; all three models give low confidence, with GPT-4o marking bio=unclear and GPT-5 marking bio=unclear. Only GPT-4o-mini defaults to bio=yes but with wrong precision.

Cost-Quality Tradeoff

GPT-4o-mini is 20x cheaper than GPT-4o and 45x cheaper than GPT-5. For a binary "is the label biologically correct?" check, 4o-mini performs nearly identically. But for nuanced precision grading and detailed reasoning, GPT-5 provides the richest output. Recommendation: Use GPT-4o for the full 49k-image run -- it balances cost, accuracy, and the ability to catch genuine errors (bio=no).

What This Means for the Pipeline

Gemini's pathway labels are reliable for biological domain identification but not for precise pathway assignment.
The "too_broad" finding aligns with the hub pathway problem observed in Jaccard scoring: both signals tend toward generic pathway categories.
Caption-based LLM validation adds real value by catching the ~4-6% of labels that are genuinely wrong or misleading.
Next step: run GPT-4o validation on the full 34k captioned images.