Summary Statistics

48,779 pathway diagram images → 1.4M edges, 1.4M nodes, 3 KEGG assignment signals

Dataset Size

48,779
Images
1.38M
Edges
1.36M
Nodes
249
KEGG Pathways Hit

Data Samples

metadata.csv (20 columns)

One row per image. KEGG assignments from three independent signals plus Gemini's own label and related pathways.

pmc_idfilenamefigure_numbergemini_labelgemini_related_pathwaysn_genesjaccard_kegg_idjaccard_kegg_namejaccard_scorecaption_kegg_idcaption_kegg_namecaption_confidencecaption_reasoningcaption_jaccardmetadata_kegg_idmetadata_kegg_namemetadata_confidencemetadata_reasoningmetadata_jaccardyear
PMC5403593PMC5403593__F2.jpgF2Cell cycleMAPK/ERK signaling|Hippo signaling|PI3K-AKT signaling2hsa04392Hippo signaling pathway - multiple species0.03333hsa04014Ras signaling pathway0.6The paper's keywords explicitly list KRAS, RAS…0.000002017
PMC5795421PMC5795421__F2.jpgF2MicroRNA biogenesisRNA interference|Gene silencing|Post-transcriptional regulation5hsa03013Nucleocytoplasmic transport0.00893hsa03250Viral life cycle - HIV-10.6The paper discusses therapies to achieve an HI…0.000002017
PMC6599049PMC6599049__F6.jpgF6cGAS-STING signalingAntiviral innate immunity|Mitochondrial signaling4hsa04623Cytosolic DNA-sensing pathway0.04819hsa04623Cytosolic DNA-sensing pathway0.9The caption explicitly describes the mechanism…0.04819hsa04623Cytosolic DNA-sensing pathway0.9The paper title and abstract explicitly attrib…0.048192019
PMC4509066PMC4509066__F2.jpgF2ApoptosisCalcium signaling|MAPK signaling|G protein signaling11hsa04215Apoptosis - multiple species0.20000hsa04210Apoptosis0.8The abstract explicitly states that the 'most…0.057552015
PMC3309942PMC3309942__F1.jpgF1RANK signalingNF-kB signaling|Autophagy|MAPK signaling6hsa04064NF-kappa B signaling pathway0.02778hsa04064NF-kappa B signaling pathway1.0The caption explicitly describes the role of p…0.02778hsa04064NF-kappa B signaling pathway0.8The abstract details the role of p62 as a scaf…0.027782012

edges.csv (7 columns)

One row per directed interaction extracted from a figure.

pmc_idfilenamefigure_numbersourcetargetinteractionuncertain
PMC8096095PMC8096095__F5.jpgF5CASP3PARP1inhibitionFalse
PMC8096095PMC8096095__F5.jpgF5CASP3ApoptosisactivationFalse
PMC8096095PMC8096095__F5.jpgF5DNA DamageATMactivationFalse
PMC8096095PMC8096095__F5.jpgF5ATMTP53activationFalse
PMC8096095PMC8096095__F5.jpgF5TP53ApoptosisactivationTrue

nodes.csv (8 columns)

One row per node extracted from a figure. Genes mapped to Entrez IDs where possible.

pmc_idfilenamefigure_numberlabelnode_typeentrez_idis_family_representativenotes
PMC5403593PMC5403593__F2.jpgF2YAP1gene10413False
PMC5403593PMC5403593__F2.jpgF2ERKgeneTrueMAPK family
PMC5403593PMC5403593__F2.jpgF2PI3KcomplexTrue
PMC5403593PMC5403593__F2.jpgF2beta-cateningene1499FalseCTNNB1
PMC5403593PMC5403593__F2.jpgF2G0phenotypeFalseQuiescence

Per-Image Stats

Nodes/ImageEdges/Image
mean27.928.3
std16.116.0
min00
25%1616
50%2425
75%3537
max135194
Images by year, average nodes and edges per image over time
Left: image volume by publication year. Center/right: average graph complexity over time.

Signal Coverage

Three independent KEGG pathway assignment signals:

Jaccard (best-match) — gene-set overlap between image nodes and every KEGG pathway

Caption KEGG — LLM assignment from figure caption

Metadata KEGG — LLM assignment from paper title/abstract/MeSH

SignalImagesCoverage
Jaccard (best-match)47,52697.4%
Caption KEGG32,29966.2%
Caption Jaccard29,66260.8%
Metadata KEGG46,57795.5%
Metadata Jaccard44,20090.6%
Caption OR Metadata47,71997.8%
Any signal48,74799.9%

Signal Agreement

Pairwise agreement on KEGG pathway assignment (n = 31,157 images with both caption and metadata):

PairAgreeRate
Caption = Metadata17,66656.7%
Caption = Best Jaccard9,79931.5%
Metadata = Best Jaccard7,58424.3%
All 3 agree6,68721.5%

Agreement by Confidence Tier

Caption–Metadata agreement stratified by confidence. At 1.0/1.0, agreement reaches 88.9%.

Caption ConfMetadata ConfnAgreeRate
1.01.05,3904,79488.9%
1.00.86,4154,54070.8%
1.00.62,45981533.1%
0.81.082241450.4%
0.80.84,0012,39059.7%
0.80.62,8261,05237.2%
0.61.01585534.8%
0.60.859224140.7%
0.60.682535142.5%

Jaccard Score Distributions

Best-matchCaptionMetadata
count47,68429,66244,200
mean0.0930.0700.053
std0.0840.0880.071
25%0.0420.0190.011
50%0.0680.0430.031
75%0.1140.0840.067
max0.9570.9570.938

Best-match is highest by construction. Caption-assigned pathways have higher Jaccard than metadata-assigned, suggesting captions are more figure-specific.

Jaccard score distributions
Jaccard score histograms for best-match, caption-assigned, and metadata-assigned pathways.

Top 25 Genes

PI3K/AKT/mTOR axis dominates. AKT and Akt counted separately (case-sensitive Gemini labels).

#GeneAppearances#GeneAppearances
1PI3K6,79714NF-kB2,771
2AKT5,98015ERK12,663
3mTOR4,99116ERK22,655
4Akt4,29017TLR42,636
5ERK4,27718TRAF62,360
6PTEN3,62719MAPK2,310
7JNK3,60120PKC2,289
8IL-63,27721MyD882,282
9STAT33,13422PDK12,199
10p533,11523beta-catenin2,164
11Ras3,05624Wnt2,059
12EGFR2,93925APC2,000
13MEK2,842
Top 10 genes over time
Top 10 most frequently extracted genes by publication year.

Top 40 Pathways by Signal

Sorted by total count across all three signals. Jaccard favors smaller, specific pathways (ErbB, Prolactin) where gene overlap is high; LLM signals favor broader well-known pathways (PI3K-Akt, MAPK, JAK-STAT).

PathwayCaptionMetadataJaccard
Toll-like receptor signaling1,4971,8072,182
Wnt signaling1,8842,3031,074
PI3K-Akt signaling1,7772,22444
TGF-beta signaling1,2701,6991,019
NF-kappa B signaling1,0351,4371,041
ErbB signaling4536992,160
MAPK signaling1,2331,66897
Apoptosis9601,548254
p53 signaling4736691,474
JAK-STAT signaling1,0641,243300
Apoptosis - multiple species422,475
mTOR signaling7621,162546
Hedgehog signaling426620927
Cell cycle620920391
VEGF signaling2143831,289
Cytosolic DNA-sensing3804091,001
RIG-I-like receptor signaling414469842
HIF-1 signaling390533800
Protein processing in ER569735380
Adipocytokine signaling2324061,001
Complement & coagulation cascades359611608
Autophagy - animal54890165
Hippo signaling588707219
Insulin signaling468758284
NOD-like receptor signaling59268468
AMPK signaling372417549
Prolactin signaling28411,241
Notch signaling361449487
IL-17 signaling110188987
Ferroptosis338384547
Th17 cell differentiation91156980
T cell receptor signaling359582253
Pluripotency signaling269578339
Glycolysis / Gluconeogenesis354559211
Autophagy - other771,095
TNF signaling258319426
Adherens junction6593838
PPAR signaling262460223
Fc epsilon RI signaling4067794
Regulation of actin cytoskeleton30651345
Top 10 pathways over time
Top 10 pathways (by metadata signal) over publication year.

Full Distribution — All 249 Pathways

249 KEGG pathways have at least one figure assigned by any signal:

SignalPathways with ≥1 figure
Caption229
Metadata240
Jaccard237
Full KEGG pathway distribution
All 249 KEGG pathways with figure counts from each assignment signal.