Automated extraction and validation of signaling pathways from scientific figures
We use Google Gemini to extract gene/protein interaction networks from published pathway diagrams. The model identifies nodes (genes, chemicals, phenotypes), edges (interactions), and pathway classifications from ~49,000 images sourced from PubMed Central.
The dashboards below show our validation work: mapping extracted pathway names to canonical databases (KEGG, WikiPathways) and measuring how well individual images match reference gene sets.
Explore how 8,349 extracted pathway names map to KEGG and WikiPathways. Compare pure Jaccard vs combined scoring methods.
Deep dive into Hippo signaling pathway. Per-image gene overlap analysis against KEGG and WikiPathways Hippo references.
Test Hippo images against ALL pathways. Validates that Hippo-labeled images actually match Hippo best (not hub pathways like PI3K-Akt).
Extends validation to 118 pathways. For each pathway, measures how often images match their labeled pathway vs. other pathways.
Validates Gemini labels against original figure captions from PMC articles. 62% of images have extractable caption data; 24% have caption-confirmed labels.
Prompt engineering for LLM-based KEGG assignment from captions. 10-image audit across GPT-4o-mini, GPT-5.2, and Gemini 3 Pro. Reduced false negatives from 4/10 to 0/10.
Full dataset overview: 48.8k images, 1.4M edges, signal coverage (99.9%), agreement rates, confidence tiers, top genes, and pathway distributions across all 249 KEGG pathways.
Placebo test: real best-match Jaccard is 22.8x higher than a randomly-assigned KEGG pathway. Validates that gene extraction captures meaningful pathway signal.
Documentation of all 4 signals (Gemini vision, Jaccard, captions, PubMed metadata), data sources, file locations, and parsing methods (LLM vs regex vs API).
Gene overlap between extracted and reference gene sets: |A ∩ B| / |A ∪ B|. Simple but effective for measuring pathway similarity.
Pure Jaccard has a hub pathway problem: large pathways like PI3K-Akt or Cancer pathways match almost everything. We use 0.7 × Jaccard + 0.3 × StringMatch to balance gene overlap with pathway name similarity.