Gemini Pathway Analysis

Automated extraction and validation of signaling pathways from scientific figures

Project Overview

We use Google Gemini to extract gene/protein interaction networks from published pathway diagrams. The model identifies nodes (genes, chemicals, phenotypes), edges (interactions), and pathway classifications from ~49,000 images sourced from PubMed Central.

The dashboards below show our validation work: mapping extracted pathway names to canonical databases (KEGG, WikiPathways) and measuring how well individual images match reference gene sets.

49k

Images Processed

8,349

Unique Pathways

239

KEGG References

984

WikiPathways Refs

Analysis Pipeline

Image Extraction

Gemini processes pathway images, extracting gene names, interactions, and pathway classifications

Pathway Alignment

Map extracted pathway names to canonical KEGG/WikiPathways IDs using gene overlap (Jaccard) and string matching

Per-Image Validation

Test individual images against reference gene sets to measure extraction quality

Caption Validation

Cross-reference Gemini labels with original PMC figure captions to validate pathway identification

Interactive Dashboards

Pathway Alignment Dashboard Primary

Explore how 8,349 extracted pathway names map to KEGG and WikiPathways. Compare pure Jaccard vs combined scoring methods.

Hippo Pathway Case Study

Deep dive into Hippo signaling pathway. Per-image gene overlap analysis against KEGG and WikiPathways Hippo references.

Hippo Cross-Pathway Validation

Test Hippo images against ALL pathways. Validates that Hippo-labeled images actually match Hippo best (not hub pathways like PI3K-Akt).

Full Pathway Validation

Extends validation to 118 pathways. For each pathway, measures how often images match their labeled pathway vs. other pathways.

PMC Caption Audit

Validates Gemini labels against original figure captions from PMC articles. 62% of images have extractable caption data; 24% have caption-confirmed labels.

Caption → KEGG Prompt Eval

Prompt engineering for LLM-based KEGG assignment from captions. 10-image audit across GPT-4o-mini, GPT-5.2, and Gemini 3 Pro. Reduced false negatives from 4/10 to 0/10.

Summary Statistics

Full dataset overview: 48.8k images, 1.4M edges, signal coverage (99.9%), agreement rates, confidence tiers, top genes, and pathway distributions across all 249 KEGG pathways.

Random Jaccard Baseline New

Placebo test: real best-match Jaccard is 22.8x higher than a randomly-assigned KEGG pathway. Validates that gene extraction captures meaningful pathway signal.

Metadata Pipeline

Documentation of all 4 signals (Gemini vision, Jaccard, captions, PubMed metadata), data sources, file locations, and parsing methods (LLM vs regex vs API).

Methodology Notes

Jaccard Similarity

Gene overlap between extracted and reference gene sets: |A ∩ B| / |A ∪ B|. Simple but effective for measuring pathway similarity.

Why Combined Scoring?

Pure Jaccard has a hub pathway problem: large pathways like PI3K-Akt or Cancer pathways match almost everything. We use 0.7 × Jaccard + 0.3 × StringMatch to balance gene overlap with pathway name similarity.

Known Limitations

Images often show partial pathway views, leading to low Jaccard scores
Gene naming varies: "ERK" vs. "MAPK1/3", "AKT" vs. "AKT1/2/3"
Some real pathways missing from KEGG/WikiPathways (e.g., cGAS-STING)
KEGG and WikiPathways have different curation philosophies (broad vs. focused)