Data sources, parsing methods, and signal architecture for pathway image labeling
Each pathway image can draw on 4 independent signals for KEGG pathway assignment. These signals vary in availability, method (LLM vs regex vs computation), and reliability.
LLM Gemini 3 Pro via Vertex AI
The primary data source. Gemini looks at each pathway diagram image and extracts a structured annotation: pathway name, gene/protein nodes, interaction edges, and figure metadata.
48,974 images — 318 MB JSONL
{
"figure_metadata": {
"canonical_pathway": "Cell cycle", // Free-form label
"related_pathways": ["Apoptosis", "p53 signaling"],
"pathway_description": "...",
"filename": "PMC100005__F11.jpg"
},
"nodes": [
{ "label": "CDK2", "node_type": "gene", "entrez_id": 1017 }
],
"edges": [
{ "source": "p53", "target": "CDK2", "interaction_type": "inhibition" }
]
}
Method: Async batch processing (~117 img/min, 500 concurrent). Images fetched from Dropbox via API (Smart Sync = 0-byte placeholders locally).
Key output: canonical_pathway is Gemini's free-text pathway label (e.g. "Hippo signaling", "TrkA signaling"). These are not KEGG IDs — they're 8,349 unique free-form names that need mapping.
Limitation: Free-form labels have variant spellings. "PI3K/AKT signaling", "PI3K-AKT pathway", "Akt signaling" all mean hsa04151.
Computation Set intersection on Entrez gene IDs
For each image, compare its extracted gene set (from Gemini nodes) against every KEGG pathway's reference gene set.
48,779 images scored — 6.8 MB CSV
Jaccard similarity: |image_genes ∩ pathway_genes| / |image_genes ∪ pathway_genes|
All 239 KEGG pathways are ranked per image. Top 2 matches stored.
| File | Records | Description |
|---|---|---|
kegg_human_pathways.json |
239 | KEGG pathway names + Entrez gene sets (10-1000+ genes each) |
wikipathways_human.gmt |
984 | WikiPathways gene sets (GMT format, alternative reference) |
ncbi_gene_cache.json |
9,500 | Gene symbol → Entrez ID cache (avoids repeated NCBI lookups) |
API extraction + Regex broken matching → LLM replacement in progress
~34k entries — 42 MB JSON
Script: scripts/pmc_caption_extractor_full.py
For each PMC ID, fetch the full article XML via NCBI EFetch API, then parse <fig> elements to extract figure captions. Rate limited to 3-10 req/sec.
Captions are truncated at 1000 characters during extraction. ~36.5% of captions hit this limit.
The original extractor ran 11 regex patterns over each caption to find pathway mentions, then used Python's SequenceMatcher (similarity ≥ 0.4) to match against KEGG names.
Problem: Regex picks up junk. "this signaling" matches "Thiamine metabolism". "general cell biology" triggers pathway patterns. These regex-based matches in kegg_match and caption_pathways fields are unreliable and being replaced.
// Example of broken regex match { "caption_pathways": ["this signaling pathway"], "kegg_match": { "kegg_id": "hsa00730", "kegg_name": "Thiamine metabolism", // Wrong! "score": 0.42 } }
Script: scripts/assign_kegg_from_caption.py
Sends the full caption + 239 KEGG pathway list to an LLM. The model uses biological reasoning (not keyword matching) to identify the best KEGG pathway.
Prompt validated on 10-image audit — see prompt eval report.
{
"pmc_id": "PMC4831128",
"filename": "PMC4831128__F6.jpg",
"gemini_pathway": "Lipid metabolism regulation",
"status": "ok",
"caption": "The figure shows bile acid...", // max 1000 chars
"caption_pathways": [...], // regex matches (deprecated)
"kegg_match": {...} // SequenceMatcher (deprecated)
}
// data/kegg_assignments/*.json { "filename": "PMC4831128__F6.jpg", "mode": "caption", "gemini_pathway": "Lipid metabolism regulation", "kegg_id": "hsa00120", "kegg_name": "Primary bile acid biosynthesis", "confidence": 0.9, "reasoning": "Caption mentions cholesterol and bile acid genes" }
API extraction + LLM assignment
For images without captions (~30% of the dataset), we fall back to paper-level metadata: title, abstract, MeSH terms, and author keywords.
42,402 records — 131 MB JSON
Script: scripts/fetch_pubmed_metadata.py (519 LOC, stdlib only)
xml.etree.ElementTree parsingAll parsing is deterministic XML extraction — no LLM or regex at this stage.
| Field | Coverage | Notes |
|---|---|---|
| Title | 98.5% | Almost always available |
| Abstract | 98.5% | Some structured (BACKGROUND/METHODS/...) |
| MeSH headings | 70.6% | NLM-curated subject terms + qualifiers |
| Author keywords | 49.5% | Author-provided, less standardized |
| Publication types | 100% | "Journal Article", "Review", etc. |
Key finding: For restricted images (no caption available), 92% still get MeSH terms. This makes metadata the primary fallback signal.
{
"pmc_id": "PMC4831128",
"pmid": "27148032",
"title": "Bile acid signaling in liver...",
"abstract": "...", // truncated to 2000 chars in prompt
"abstract_structured": false,
"mesh_headings": [
{ "descriptor": "Signal Transduction", "qualifiers": ["genetics"] }
],
"keywords": [
{ "term": "bile acid", "owner": "Author" }
],
"is_retracted": false,
"year": "2016"
}
Script: scripts/assign_kegg_from_caption.py --mode metadata
Same script, different prompt. Sends title + abstract (2000 chars) + MeSH terms + keywords + KEGG list to an LLM. The model identifies the most prominent KEGG pathway discussed in the paper.
Important caveat: This is paper-level metadata, not figure-level. A paper may have 5 figures showing different pathways, but we only get one paper-level assignment. Confidence is inherently lower than caption-based assignment.
| File | Records | Purpose |
|---|---|---|
pmc_to_pmid_map.json | 42.4k | PMC ↔ PMID + DOI mapping |
pmc_fetch_status_cache.json | varies | Avoids re-fetching restricted/errored PMCs |
ncbi_gene_cache.json | 9,500 | Gene symbol → Entrez ID (avoids NCBI API calls) |
pmc_publication_years.json | 42.4k | PMC ID → publication year |
pubmed_fetch_checkpoint.json | — | Resume point for interrupted PubMed fetches |
| File | Description |
|---|---|
pathway_alignment.json | Aggregated pathway-level stats (12 MB) |
full_pathway_validation.json | 118-pathway Jaccard validation results |
normalized_pathway_groups.json | Groups variant Gemini labels into canonical names (2 MB) |
hippo_image_jaccard.json | Hippo pathway deep-dive validation |
pmc_audit_showcase.json | Curated caption audit examples |
| File | Description |
|---|---|
kegg_assignments/eval_200_results.json | 200-image stratified eval (v1 prompt, GPT-4o-mini) |
kegg_assignments/eval_sample_200.json | The 200 test images with metadata |
kegg_assignments/test_set_both_gpt-4o-mini.json | Caption + metadata mode comparison |
kegg_assignments/prompt_v2_test_10.json | v2 prompt test (GPT-4o-mini) |
kegg_assignments/prompt_v2_gpt52_test_10.json | v2 prompt test (GPT-5.2) |
kegg_assignments/prompt_v2_gemini3_test_10.json | v2 prompt test (Gemini 3 Pro) |
| Step | Method | Tool | Status |
|---|---|---|---|
| Image → pathway label + genes | LLM | Gemini 3 Pro (Vertex AI) | Complete (49k) |
| PMC ID → figure caption text | API | NCBI EFetch + XML parsing | Complete (34k) |
| Caption → pathway phrase (old) | Regex | 11 patterns + SequenceMatcher | Deprecated |
| Caption → KEGG ID (new) | LLM | GPT-4o-mini / GPT-5.2 / Gemini 3 | Validated (10-img), pending full run |
| PMC ID → PMID → PubMed XML | API | NCBI ID Converter + EFetch | Complete (42.4k) |
| PubMed XML → title/abstract/MeSH | XML Parse | Python ElementTree (stdlib) | Complete (42.4k) |
| Metadata → KEGG ID | LLM | GPT-4o-mini (metadata mode) | Pending (~14.7k captionless) |
| Gene set → KEGG score | Computation | Jaccard similarity (Python) | Complete (48.8k) |
| Gene symbol → Entrez ID | API + cache | NCBI Gene API + local cache | Complete (9.5k cached) |
| Step | Status | Cost |
|---|---|---|
| Full 34k caption LLM assignment | Ready to run (v2 prompt validated) | ~$20 (mini) / ~$340 (5.2) |
| 14.7k metadata LLM assignment | Ready to run (metadata prompt exists) | ~$10 (mini) |
| Multi-signal consolidation | Pending — needs decision on weighting | — |
| Caption re-extraction at 2000 chars | Optional improvement, low priority | API time only |