GWAS Spillover Pipeline - Results Summary

Generated: December 5, 2025

Pipeline Status

Step Description Status Runtime
01 GWAS Cleaning Complete ~1 min
02 Patent Preparation Complete ~2 min
03 Panel Filtering Complete ~3 min
04 Spillover Creation Complete ~2 min
05 Panel Creation Complete 31.5 min
06 Spillover Integration Complete 4.6 min
07 Stata Export Pending -

Step 5: Panel Creation Results

  • Panel rows: 148,124,655
  • Unique gene-disease pairs: 7,053,555
  • Unique genes: 14,368
  • Unique diseases: 2,435
  • Years covered: 2000-2020 (21 years)
  • Total patents in panel: 49,724,209
  • Mean patents per pair-year: 0.336
  • Extensive margin (any patents): 15.91%
  • Memory usage: 5.24 GB

Step 6: Spillover Integration Results

Treatment Coverage in Final Panel

Spillover Level Rows Percentage
Direct GWAS 247,485 0.2%
1-hop spillover 1,764,021 1.2%
2-hop spillover 5,830,755 3.9%
3-hop spillover 7,007,658 4.7%
No treatment 133,274,736 90.0%

Spillover Pair Counts (pre-panel)

  • Direct GWAS pairs: 35,008
  • 1-hop pairs: 189,088
  • 2-hop pairs: 814,121
  • 3-hop pairs: 1,024,077
  • Total treated pairs: 2,062,294

Patent Count Analysis (Before vs After 99% Filtering)

Metric Before Filtering After Filtering Removed
Rows 140,520,525 49,724,209 90,796,316 (64.6%)
Unique Patents 163,928 92,611 71,317 (43.5%)

Probability Threshold Analysis

The pipeline uses a 99% BioBERT probability threshold to filter patent-gene/disease matches.

Probability Distribution (gene_prob column)

  • Mean: 0.9532
  • Median: 0.9908
  • Std: 0.0932
  • Range: 0.3362 - 1.0000

Threshold Sensitivity

Threshold Rows Kept % Kept Unique Patents
50% 137,332,498 97.7% 163,434
70% 131,869,342 93.8% 160,943
80% 127,101,726 90.5% 159,130
90% 119,073,479 84.7% 155,823
95% 110,657,709 78.7% 152,174
99% 72,016,306 51.2% 128,067
99.9% 1,486,096 1.1% 7,986

Characteristics by Confidence Level

  High Confidence (>=99%) Low Confidence (<99%)
Rows 72,016,306 65,725,744
GWAS match rate 0.46% 0.34%
Median patent year 2014 2015
Unique genes 15,366 16,050

Conclusion: The 99% threshold keeps 51.2% of observations and 128,067 unique patents. This is a reasonable trade-off between data quality and coverage.


Panel Diagnostics Results

Basic Panel Structure

  • Observations: 148,124,655
  • Unique gene-disease pairs: 7,053,555
  • Unique genes: 14,368
  • Unique diseases: 2,435
  • Years: 21 (2000-2020)
  • Observations per pair: 21.0
  • Panel balance: Yes (all pairs have exactly 21 observations)

Treatment Coverage

  • Pairs ever treated: 11,785 (0.2%)
  • Pairs never treated: 7,041,770 (99.8%)

Treatment Timing

  • Earliest treatment year: 2005
  • Median treatment year: 2017
  • Latest treatment year: 2020

Fixed Effects Group Sizes

FE Type Groups Mean Obs/Group Singletons
Gene-Year 301,728 490.9 4,767 (1.6%)
Disease-Year 51,135 2,896.7 0 (0.0%)
Gene-Disease (pairs) 7,053,555 21.0 0 (0.0%)

Output Files

File Size Description
gwas_cleaned.tsv 2.3 MB Cleaned GWAS catalog
patent_with_gwas.parquet 82 MB Patents matched to GWAS
processed_for_spillovers.parquet 36 MB Filtered patents (99% threshold)
spillovers_pre_panel.parquet 5.6 MB Spillover pairs by hop level
full_panel_pre_spillovers.parquet 200 MB Balanced panel before spillovers
full_panel_with_spillovers.parquet 252 MB Final panel with treatment indicators