Metadata Pipeline

Data sources, parsing methods, and signal architecture for pathway image labeling

Overview

Each pathway image can draw on 4 independent signals for KEGG pathway assignment. These signals vary in availability, method (LLM vs regex vs computation), and reliability.

49k Images
Dropbox
Signal 1
Gemini Vision
Signal 2
Jaccard Overlap
Signal 3
Caption → LLM
Signal 4
Metadata → LLM
KEGG ID
48,974
Gemini Annotations
~22k
Captions Available
42,402
PubMed Records
239
KEGG Pathways

Signal 1: Gemini Vision Annotations

LLM   Gemini 3 Pro via Vertex AI

The primary data source. Gemini looks at each pathway diagram image and extracts a structured annotation: pathway name, gene/protein nodes, interaction edges, and figure metadata.

gemini3_annotations/full_49k_v2_2026-01-26_clean.jsonl

48,974 images — 318 MB JSONL

Schema

{
  "figure_metadata": {
    "canonical_pathway": "Cell cycle",       // Free-form label
    "related_pathways": ["Apoptosis", "p53 signaling"],
    "pathway_description": "...",
    "filename": "PMC100005__F11.jpg"
  },
  "nodes": [
    { "label": "CDK2", "node_type": "gene", "entrez_id": 1017 }
  ],
  "edges": [
    { "source": "p53", "target": "CDK2", "interaction_type": "inhibition" }
  ]
}

Method: Async batch processing (~117 img/min, 500 concurrent). Images fetched from Dropbox via API (Smart Sync = 0-byte placeholders locally).

Key output: canonical_pathway is Gemini's free-text pathway label (e.g. "Hippo signaling", "TrkA signaling"). These are not KEGG IDs — they're 8,349 unique free-form names that need mapping.

Limitation: Free-form labels have variant spellings. "PI3K/AKT signaling", "PI3K-AKT pathway", "Akt signaling" all mean hsa04151.

Signal 2: Jaccard Gene Overlap

Computation   Set intersection on Entrez gene IDs

For each image, compare its extracted gene set (from Gemini nodes) against every KEGG pathway's reference gene set.

data/image_pathway_scores.csv

48,779 images scored — 6.8 MB CSV

How it works

Jaccard similarity: |image_genes ∩ pathway_genes| / |image_genes ∪ pathway_genes|

All 239 KEGG pathways are ranked per image. Top 2 matches stored.

Reference data

FileRecordsDescription
kegg_human_pathways.json 239 KEGG pathway names + Entrez gene sets (10-1000+ genes each)
wikipathways_human.gmt 984 WikiPathways gene sets (GMT format, alternative reference)
ncbi_gene_cache.json 9,500 Gene symbol → Entrez ID cache (avoids repeated NCBI lookups)

Known issues

Signal 3: PMC Figure Captions

API extraction  +  Regex broken matching  →  LLM replacement in progress

data/pmc_caption_pathways_full.json

~34k entries — 42 MB JSON

Step 1: Caption extraction API

Script: scripts/pmc_caption_extractor_full.py

For each PMC ID, fetch the full article XML via NCBI EFetch API, then parse <fig> elements to extract figure captions. Rate limited to 3-10 req/sec.

Caption availability

64% with caption
30% no caption
6% restricted

Captions are truncated at 1000 characters during extraction. ~36.5% of captions hit this limit.

Step 2a: Regex matching (original, deprecated) Regex Broken

The original extractor ran 11 regex patterns over each caption to find pathway mentions, then used Python's SequenceMatcher (similarity ≥ 0.4) to match against KEGG names.

Problem: Regex picks up junk. "this signaling" matches "Thiamine metabolism". "general cell biology" triggers pathway patterns. These regex-based matches in kegg_match and caption_pathways fields are unreliable and being replaced.

// Example of broken regex match
{
  "caption_pathways": ["this signaling pathway"],
  "kegg_match": {
    "kegg_id": "hsa00730",
    "kegg_name": "Thiamine metabolism",  // Wrong!
    "score": 0.42
  }
}

Step 2b: LLM assignment (current, validated) LLM

Script: scripts/assign_kegg_from_caption.py

Sends the full caption + 239 KEGG pathway list to an LLM. The model uses biological reasoning (not keyword matching) to identify the best KEGG pathway.

Prompt validated on 10-image audit — see prompt eval report.

Schema (per caption entry)

{
  "pmc_id": "PMC4831128",
  "filename": "PMC4831128__F6.jpg",
  "gemini_pathway": "Lipid metabolism regulation",
  "status": "ok",
  "caption": "The figure shows bile acid...",  // max 1000 chars
  "caption_pathways": [...],         // regex matches (deprecated)
  "kegg_match": {...}               // SequenceMatcher (deprecated)
}

LLM assignment schema (output)

// data/kegg_assignments/*.json
{
  "filename": "PMC4831128__F6.jpg",
  "mode": "caption",
  "gemini_pathway": "Lipid metabolism regulation",
  "kegg_id": "hsa00120",
  "kegg_name": "Primary bile acid biosynthesis",
  "confidence": 0.9,
  "reasoning": "Caption mentions cholesterol and bile acid genes"
}

Signal 4: PubMed Metadata

API extraction  +  LLM assignment

For images without captions (~30% of the dataset), we fall back to paper-level metadata: title, abstract, MeSH terms, and author keywords.

data/pubmed_metadata_full.json

42,402 records — 131 MB JSON

Extraction pipeline API

Script: scripts/fetch_pubmed_metadata.py (519 LOC, stdlib only)

  1. PMC → PMID: NCBI ID Converter API (batch=200)
  2. PMID → PubMed XML: NCBI EFetch API
  3. XML → structured fields: Python xml.etree.ElementTree parsing

All parsing is deterministic XML extraction — no LLM or regex at this stage.

Field coverage (200-article pilot)

FieldCoverageNotes
Title98.5%Almost always available
Abstract98.5%Some structured (BACKGROUND/METHODS/...)
MeSH headings70.6%NLM-curated subject terms + qualifiers
Author keywords49.5%Author-provided, less standardized
Publication types100%"Journal Article", "Review", etc.

Key finding: For restricted images (no caption available), 92% still get MeSH terms. This makes metadata the primary fallback signal.

Schema

{
  "pmc_id": "PMC4831128",
  "pmid": "27148032",
  "title": "Bile acid signaling in liver...",
  "abstract": "...",                       // truncated to 2000 chars in prompt
  "abstract_structured": false,
  "mesh_headings": [
    { "descriptor": "Signal Transduction", "qualifiers": ["genetics"] }
  ],
  "keywords": [
    { "term": "bile acid", "owner": "Author" }
  ],
  "is_retracted": false,
  "year": "2016"
}

LLM assignment (metadata mode) LLM

Script: scripts/assign_kegg_from_caption.py --mode metadata

Same script, different prompt. Sends title + abstract (2000 chars) + MeSH terms + keywords + KEGG list to an LLM. The model identifies the most prominent KEGG pathway discussed in the paper.

Important caveat: This is paper-level metadata, not figure-level. A paper may have 5 figures showing different pathways, but we only get one paper-level assignment. Confidence is inherently lower than caption-based assignment.

Supporting Data & Caches

Caches & ID mappings
FileRecordsPurpose
pmc_to_pmid_map.json42.4kPMC ↔ PMID + DOI mapping
pmc_fetch_status_cache.jsonvariesAvoids re-fetching restricted/errored PMCs
ncbi_gene_cache.json9,500Gene symbol → Entrez ID (avoids NCBI API calls)
pmc_publication_years.json42.4kPMC ID → publication year
pubmed_fetch_checkpoint.jsonResume point for interrupted PubMed fetches
Validation & alignment data
FileDescription
pathway_alignment.jsonAggregated pathway-level stats (12 MB)
full_pathway_validation.json118-pathway Jaccard validation results
normalized_pathway_groups.jsonGroups variant Gemini labels into canonical names (2 MB)
hippo_image_jaccard.jsonHippo pathway deep-dive validation
pmc_audit_showcase.jsonCurated caption audit examples
LLM assignment output files
FileDescription
kegg_assignments/eval_200_results.json200-image stratified eval (v1 prompt, GPT-4o-mini)
kegg_assignments/eval_sample_200.jsonThe 200 test images with metadata
kegg_assignments/test_set_both_gpt-4o-mini.jsonCaption + metadata mode comparison
kegg_assignments/prompt_v2_test_10.jsonv2 prompt test (GPT-4o-mini)
kegg_assignments/prompt_v2_gpt52_test_10.jsonv2 prompt test (GPT-5.2)
kegg_assignments/prompt_v2_gemini3_test_10.jsonv2 prompt test (Gemini 3 Pro)

Parsing Methods at a Glance

StepMethodToolStatus
Image → pathway label + genes LLM Gemini 3 Pro (Vertex AI) Complete (49k)
PMC ID → figure caption text API NCBI EFetch + XML parsing Complete (34k)
Caption → pathway phrase (old) Regex 11 patterns + SequenceMatcher Deprecated
Caption → KEGG ID (new) LLM GPT-4o-mini / GPT-5.2 / Gemini 3 Validated (10-img), pending full run
PMC ID → PMID → PubMed XML API NCBI ID Converter + EFetch Complete (42.4k)
PubMed XML → title/abstract/MeSH XML Parse Python ElementTree (stdlib) Complete (42.4k)
Metadata → KEGG ID LLM GPT-4o-mini (metadata mode) Pending (~14.7k captionless)
Gene set → KEGG score Computation Jaccard similarity (Python) Complete (48.8k)
Gene symbol → Entrez ID API + cache NCBI Gene API + local cache Complete (9.5k cached)

Pipeline Status

StepStatusCost
Full 34k caption LLM assignmentReady to run (v2 prompt validated)~$20 (mini) / ~$340 (5.2)
14.7k metadata LLM assignmentReady to run (metadata prompt exists)~$10 (mini)
Multi-signal consolidationPending — needs decision on weighting
Caption re-extraction at 2000 charsOptional improvement, low priorityAPI time only