Comparing real best-match Jaccard scores against a random KEGG pathway assignment per image (n = 47,537 images with genes)
Best-Match vs Random (n = )
Caption KEGG vs Random (n = , images with caption assignment)
Metadata KEGG vs Random (n = , images with metadata assignment)
Paired tests (n = 47,537): does the real best-match Jaccard significantly exceed the random baseline?
| Test | Statistic | p-value | Interpretation |
|---|
Method: For each image, a random KEGG pathway (out of 239) was drawn (seed=42) and Jaccard similarity computed between the image's extracted gene set and that random pathway's gene set. The best-match Jaccard picks the KEGG pathway that maximizes overlap. Caption and metadata Jaccard use the LLM-assigned KEGG pathway.
Interpretation: A 22.8x ratio means the real gene-overlap signal is far above chance. The random baseline is near zero for most images because a random KEGG pathway shares few genes with any given image's extracted gene set.