Metadata Pipeline

Data sources, parsing methods, and signal architecture for pathway image labeling

Overview

Each pathway image can draw on 4 independent signals for KEGG pathway assignment. These signals vary in availability, method (LLM vs regex vs computation), and reliability.

49k Images
Dropbox

→

Signal 1
Gemini Vision

→

Signal 2
Jaccard Overlap

→

Signal 3
Caption → LLM

→

Signal 4
Metadata → LLM

→

KEGG ID

48,974

Gemini Annotations

~22k

Captions Available

42,402

PubMed Records

239

KEGG Pathways

Signal 1: Gemini Vision Annotations

LLM Gemini 3 Pro via Vertex AI

The primary data source. Gemini looks at each pathway diagram image and extracts a structured annotation: pathway name, gene/protein nodes, interaction edges, and figure metadata.

gemini3_annotations/full_49k_v2_2026-01-26_clean.jsonl

48,974 images — 318 MB JSONL

Schema

{
  "figure_metadata": {
    "canonical_pathway": "Cell cycle",       // Free-form label
    "related_pathways": ["Apoptosis", "p53 signaling"],
    "pathway_description": "...",
    "filename": "PMC100005__F11.jpg"
  },
  "nodes": [
    { "label": "CDK2", "node_type": "gene", "entrez_id": 1017 }
  ],
  "edges": [
    { "source": "p53", "target": "CDK2", "interaction_type": "inhibition" }
  ]
}

Method: Async batch processing (~117 img/min, 500 concurrent). Images fetched from Dropbox via API (Smart Sync = 0-byte placeholders locally).

Key output: canonical_pathway is Gemini's free-text pathway label (e.g. "Hippo signaling", "TrkA signaling"). These are not KEGG IDs — they're 8,349 unique free-form names that need mapping.

Limitation: Free-form labels have variant spellings. "PI3K/AKT signaling", "PI3K-AKT pathway", "Akt signaling" all mean hsa04151.

Signal 2: Jaccard Gene Overlap

Computation Set intersection on Entrez gene IDs

For each image, compare its extracted gene set (from Gemini nodes) against every KEGG pathway's reference gene set.

data/image_pathway_scores.csv

48,779 images scored — 6.8 MB CSV

How it works

Jaccard similarity: |image_genes ∩ pathway_genes| / |image_genes ∪ pathway_genes|

All 239 KEGG pathways are ranked per image. Top 2 matches stored.

Reference data

File	Records	Description
`kegg_human_pathways.json`	239	KEGG pathway names + Entrez gene sets (10-1000+ genes each)
`wikipathways_human.gmt`	984	WikiPathways gene sets (GMT format, alternative reference)
`ncbi_gene_cache.json`	9,500	Gene symbol → Entrez ID cache (avoids repeated NCBI lookups)

Known issues

Hub pathway problem: Large pathways (PI3K-AKT = 354 genes, Cancer pathways = 500+) match almost everything because they share genes with many smaller pathways.
Gene sparsity: ~50% of images have <10 Entrez genes. With so few genes, Jaccard scores become near-random.
Gene family ambiguity: "ERK" maps to MAPK1 and MAPK3. "AKT" maps to AKT1/2/3. Gemini sometimes reports the family name.

Signal 3: PMC Figure Captions

API extraction + Regex broken matching → LLM replacement in progress

data/pmc_caption_pathways_full.json

~34k entries — 42 MB JSON

Step 1: Caption extraction API

Script: scripts/pmc_caption_extractor_full.py

For each PMC ID, fetch the full article XML via NCBI EFetch API, then parse <fig> elements to extract figure captions. Rate limited to 3-10 req/sec.

Caption availability

64% with caption

30% no caption

6% restricted

Captions are truncated at 1000 characters during extraction. ~36.5% of captions hit this limit.

Step 2a: Regex matching (original, deprecated) Regex Broken

The original extractor ran 11 regex patterns over each caption to find pathway mentions, then used Python's SequenceMatcher (similarity ≥ 0.4) to match against KEGG names.

Problem: Regex picks up junk. "this signaling" matches "Thiamine metabolism". "general cell biology" triggers pathway patterns. These regex-based matches in kegg_match and caption_pathways fields are unreliable and being replaced.

// Example of broken regex match
{
  "caption_pathways": ["this signaling pathway"],
  "kegg_match": {
    "kegg_id": "hsa00730",
    "kegg_name": "Thiamine metabolism",  // Wrong!
    "score": 0.42
  }
}

Step 2b: LLM assignment (current, validated) LLM

Script: scripts/assign_kegg_from_caption.py

Sends the full caption + 239 KEGG pathway list to an LLM. The model uses biological reasoning (not keyword matching) to identify the best KEGG pathway.

Prompt validated on 10-image audit — see prompt eval report.

Schema (per caption entry)

{
  "pmc_id": "PMC4831128",
  "filename": "PMC4831128__F6.jpg",
  "gemini_pathway": "Lipid metabolism regulation",
  "status": "ok",
  "caption": "The figure shows bile acid...",  // max 1000 chars
  "caption_pathways": [...],         // regex matches (deprecated)
  "kegg_match": {...}               // SequenceMatcher (deprecated)
}

LLM assignment schema (output)

// data/kegg_assignments/*.json
{
  "filename": "PMC4831128__F6.jpg",
  "mode": "caption",
  "gemini_pathway": "Lipid metabolism regulation",
  "kegg_id": "hsa00120",
  "kegg_name": "Primary bile acid biosynthesis",
  "confidence": 0.9,
  "reasoning": "Caption mentions cholesterol and bile acid genes"
}

Signal 4: PubMed Metadata

API extraction + LLM assignment

For images without captions (~30% of the dataset), we fall back to paper-level metadata: title, abstract, MeSH terms, and author keywords.

data/pubmed_metadata_full.json

42,402 records — 131 MB JSON

Extraction pipeline API

Script: scripts/fetch_pubmed_metadata.py (519 LOC, stdlib only)

PMC → PMID: NCBI ID Converter API (batch=200)
PMID → PubMed XML: NCBI EFetch API
XML → structured fields: Python xml.etree.ElementTree parsing

All parsing is deterministic XML extraction — no LLM or regex at this stage.

Field coverage (200-article pilot)

Field	Coverage	Notes
Title	98.5%	Almost always available
Abstract	98.5%	Some structured (BACKGROUND/METHODS/...)
MeSH headings	70.6%	NLM-curated subject terms + qualifiers
Author keywords	49.5%	Author-provided, less standardized
Publication types	100%	"Journal Article", "Review", etc.

Key finding: For restricted images (no caption available), 92% still get MeSH terms. This makes metadata the primary fallback signal.

Schema

{
  "pmc_id": "PMC4831128",
  "pmid": "27148032",
  "title": "Bile acid signaling in liver...",
  "abstract": "...",                       // truncated to 2000 chars in prompt
  "abstract_structured": false,
  "mesh_headings": [
    { "descriptor": "Signal Transduction", "qualifiers": ["genetics"] }
  ],
  "keywords": [
    { "term": "bile acid", "owner": "Author" }
  ],
  "is_retracted": false,
  "year": "2016"
}

LLM assignment (metadata mode) LLM

Script: scripts/assign_kegg_from_caption.py --mode metadata

Same script, different prompt. Sends title + abstract (2000 chars) + MeSH terms + keywords + KEGG list to an LLM. The model identifies the most prominent KEGG pathway discussed in the paper.

Important caveat: This is paper-level metadata, not figure-level. A paper may have 5 figures showing different pathways, but we only get one paper-level assignment. Confidence is inherently lower than caption-based assignment.

Supporting Data & Caches

Caches & ID mappings

File	Records	Purpose
`pmc_to_pmid_map.json`	42.4k	PMC ↔ PMID + DOI mapping
`pmc_fetch_status_cache.json`	varies	Avoids re-fetching restricted/errored PMCs
`ncbi_gene_cache.json`	9,500	Gene symbol → Entrez ID (avoids NCBI API calls)
`pmc_publication_years.json`	42.4k	PMC ID → publication year
`pubmed_fetch_checkpoint.json`	—	Resume point for interrupted PubMed fetches

Validation & alignment data

File	Description
`pathway_alignment.json`	Aggregated pathway-level stats (12 MB)
`full_pathway_validation.json`	118-pathway Jaccard validation results
`normalized_pathway_groups.json`	Groups variant Gemini labels into canonical names (2 MB)
`hippo_image_jaccard.json`	Hippo pathway deep-dive validation
`pmc_audit_showcase.json`	Curated caption audit examples

LLM assignment output files

File	Description
`kegg_assignments/eval_200_results.json`	200-image stratified eval (v1 prompt, GPT-4o-mini)
`kegg_assignments/eval_sample_200.json`	The 200 test images with metadata
`kegg_assignments/test_set_both_gpt-4o-mini.json`	Caption + metadata mode comparison
`kegg_assignments/prompt_v2_test_10.json`	v2 prompt test (GPT-4o-mini)
`kegg_assignments/prompt_v2_gpt52_test_10.json`	v2 prompt test (GPT-5.2)
`kegg_assignments/prompt_v2_gemini3_test_10.json`	v2 prompt test (Gemini 3 Pro)

Parsing Methods at a Glance

Step	Method	Tool	Status
Image → pathway label + genes	LLM	Gemini 3 Pro (Vertex AI)	Complete (49k)
PMC ID → figure caption text	API	NCBI EFetch + XML parsing	Complete (34k)
Caption → pathway phrase (old)	Regex	11 patterns + SequenceMatcher	Deprecated
Caption → KEGG ID (new)	LLM	GPT-4o-mini / GPT-5.2 / Gemini 3	Validated (10-img), pending full run
PMC ID → PMID → PubMed XML	API	NCBI ID Converter + EFetch	Complete (42.4k)
PubMed XML → title/abstract/MeSH	XML Parse	Python ElementTree (stdlib)	Complete (42.4k)
Metadata → KEGG ID	LLM	GPT-4o-mini (metadata mode)	Pending (~14.7k captionless)
Gene set → KEGG score	Computation	Jaccard similarity (Python)	Complete (48.8k)
Gene symbol → Entrez ID	API + cache	NCBI Gene API + local cache	Complete (9.5k cached)

Pipeline Status

Step	Status	Cost
Full 34k caption LLM assignment	Ready to run (v2 prompt validated)	~$20 (mini) / ~$340 (5.2)
14.7k metadata LLM assignment	Ready to run (metadata prompt exists)	~$10 (mini)
Multi-signal consolidation	Pending — needs decision on weighting	—
Caption re-extraction at 2000 chars	Optional improvement, low priority	API time only