GWAS Extension: Orange Book & Patent Valuation

Extension Plan: Valuing GWAS Spillovers Using Orange Book & KPSS Data

Created: December 28, 2025

📊 Current Status: See Orange Book Progress Report for implementation status, learnings from Days 1-4, and what data we have/don’t have.

Executive Summary

This document outlines two extensions to the GWAS spillover project that move beyond counting patents to valuing them. Both approaches follow Azoulay et al. (2019, Restud) by focusing on economically important patents rather than treating all patents equally.

Current approach: Measure GWAS treatment effects on patent counts at the gene-disease (G-D) level

Proposed extensions:

Orange Book approach: Focus only on FDA-approved drug patents (the most valuable patents)
KPSS approach: Weight patents by their market value (dollar-denominated outcomes)

Context: What We Currently Have

The GWAS Spillover Panel

Our current analysis uses a balanced panel with:

148M observations (7M gene-disease pairs × 21 years, 2000-2020)
Outcome variable: Patent counts at the G-D-year level
Treatment: GWAS discoveries, classified as:
- Direct GWAS hits (35K pairs)
- 1-hop spillovers (189K pairs)
- 2-hop spillovers (814K pairs)
- 3-hop spillovers (1.02M pairs)
Spillover definition: Network distance in KEGG pathways graph

Current Identification

We run a difference-in-differences (DID) specification:

patents_gdt = β × post_GWAS_gdt + γ_gt + δ_dt + ε_gdt

Where:

patents_gdt = patent count for gene g, disease d, year t
post_GWAS_gdt = indicator for whether G-D pair has been “treated” by GWAS
γ_gt = gene-year fixed effects
δ_dt = disease-year fixed effects

β captures the causal effect of GWAS discoveries on related patenting activity

What Azoulay et al. (2019) Did

The Paper: “Public R&D Investments and Private-Sector Patenting”

Citation: Azoulay, Pierre, Joshua S. Graff Zivin, Danielle Li, and Bhaven N. Sampat. “Public R&D investments and private-sector patenting: evidence from NIH funding shocks.” Review of Economic Studies 86.1 (2019): 117-152.

Section 5.4: Orange Book Patents (page 146)

Key insight: Not all patents are equally valuable. Patents that protect FDA-approved drugs represent the commercial pinnacle of pharmaceutical innovation.

What they did:

Obtained FDA’s “Orange Book” - official list of approved drugs with associated patents
Matched Orange Book patents to their NIH grant-funded research
Re-ran their analysis using Orange Book patents as the outcome
Found that NIH grants have larger effects on high-value patents than on average patents

Why this matters for us: If GWAS discoveries have differential effects on high-value vs. average patents, we’re missing important heterogeneity by lumping all patents together.

Extension 1: Orange Book Approach

Concept

Build a parallel analysis focusing exclusively on FDA-approved drug patents. These are the patents that made it through clinical trials and reached market - representing the upper tail of innovation value.

Methodology

Get Orange Book data at the patent level
Merge with our current BioBERT patent-gene-disease mappings
Create sparse panel: Same G-D-year structure, but outcome = Orange Book patent counts
Run identical DID specification:

OB_patents_gdt = β_OB × post_GWAS_gdt + γ_gt + δ_dt + ε_gdt

Compare β_OB to β_all:
- If β_OB > β_all: GWAS has larger effects on high-value patents
- Interpretation: GWAS discoveries guide R&D toward commercially successful outcomes

Expected Panel Characteristics

Much sparser than current panel (Orange Book has ~3,000 patents vs. our 92K)
More zeros: Most G-D-year cells will have zero Orange Book patents
Higher value per patent: Each patent represents a marketed drug

Data Source: FDA Orange Book

Official source: FDA Orange Book

Available formats:

Download page: https://www.fda.gov/drugs/drug-approvals-and-databases/orange-book-data-files
Direct download: https://www.fda.gov/media/76860/download (full database, updated monthly)

Data structure:

products.txt - approved drug products
patent.txt - patent numbers and expiration dates
exclusivity.txt - market exclusivity information

Key fields we need:

Patent numbers (to merge with USPTO data)
Approval dates
Active ingredients (to link to genes/diseases)

Merge strategy:

Extract patent numbers from Orange Book
Match to our Full_Patent.parquet using patent IDs
Keep only patents that appear in both datasets
These patents already have BioBERT gene-disease extractions

Implementation Steps

Step 1: Download and Process Orange Book (1-2 hours)

# Download from FDA
# Parse patent.txt and products.txt
# Extract unique patent numbers
# Create mapping: patent_id -> approval_year, drug_name, active_ingredient

Step 2: Merge with Our Patent Data (1 hour)

# Load Full_Patent.parquet (our BioBERT extractions)
# Filter to only Orange Book patents
# Keep gene-disease-patent-year structure
# Apply same 99% probability threshold

Step 3: Build Orange Book Panel (2 hours)

# Create balanced panel: all G-D pairs × all years (2000-2020)
# Aggregate Orange Book patents by G-D-year
# Most cells will be zero (sparse)
# Add same spillover treatment indicators as main analysis

Step 4: Run DID Regressions (1 hour)

# Same specification as main analysis
# Outcome: OB_patent_count instead of patent_count
# Compare coefficients

Pricing the Effects (Azoulay’s Method)

Azoulay et al. price their effects using average patent values from prior literature.

Our approach:

Baseline: Use average Orange Book patent value from literature
- Harhoff et al. (1999): Patent values are log-normally distributed
- Average pharmaceutical patent value: ~$10-20M (outdated, need 2020s estimate)
Direct pricing:
- Each Orange Book patent protects a marketed drug
- Could manually lookup first-year sales for subset of drugs
- Use as benchmark for “value per Orange Book patent”
Calculate economic value:
- If β_OB = 0.5 (GWAS increases Orange Book patents by 0.5 per G-D pair)
- And average value = $15M per Orange Book patent
- Then value created = 0.5 × $15M = $7.5M per treated G-D pair

Expected Challenges

Sparsity: Orange Book has ~3,000 patents, we have 7M G-D pairs
- Most pairs will have zero Orange Book patents in all years
- Need Poisson or negative binomial models (not OLS)
- Statistical power will be limited
Merge rate: Not all Orange Book patents will match our data
- Some may pre-date our panel (before 2000)
- Some may not mention genes/diseases in patent text
- Expected match rate: 30-50%?
Timing: Orange Book lists patents by approval date, not grant date
- Need to decide: use approval year or grant year?
- Probably use approval year (economically relevant timing)

Extension 2: KPSS Patent Valuation Approach

Concept

Instead of counting patents, weight each patent by its estimated market value. This gives us a dollar-denominated outcome: the change in innovation value, not just innovation quantity.

The KPSS Data

Source: Kogan, Leonid, Dimitris Papanikolaou, Amit Seru, and Noah Stoffman. “Technological innovation, resource allocation, and growth.” Quarterly Journal of Economics 132.2 (2017): 665-712.

What it provides: Market value of each patent granted to publicly traded US firms (1926-2010, extended to 2018)

How values are calculated:

Use stock market reactions around patent grants
3-day event window around grant announcement
Aggregate market cap change = implied value of patent
Accounts for investor expectations about commercial value

Data location:

Original paper: https://github.com/KPSS2017/Technological-Innovation-Resource-Allocation-and-Growth-Extended-Data
Extended data (through 2018): Available from Noah Stoffman’s website
File format: Patent number → dollar value

Coverage: ~3-4 million patents from publicly traded firms

Methodology

Merge KPSS values with our patent-gene-disease data
Create value-weighted panel:
- Instead of counting patents: sum patent values by G-D-year
- Outcome: patent_value_gdt = Σ(KPSS value) for all patents mentioning G-D in year t
Run DID with dollar outcomes:

patent_value_gdt = β_value × post_GWAS_gdt + γ_gt + δ_dt + ε_gdt

Rescale by median G-D mentions per patent:
- Current median: ~110 mentions per patent (from patent-analysis page)
- Interpretation: β_value measures dollars per G-D mention
- Multiply by median to get “dollars per patent”

Data Sources

Primary: KPSS GitHub Repository

URL: https://github.com/KPSS2017/Technological-Innovation-Resource-Allocation-and-Growth-Extended-Data
Files needed:
- patent_values.csv - patent number → dollar value
- Documentation on methodology
Coverage: 1926-2010 (original), extended versions available through 2018

Extended Data (2018+)

Check Noah Stoffman’s website: https://kelley.iu.edu/nstoffma/
NBER Patent Data Project may have merged KPSS values
Kogan et al. may have unpublished updates

Alternative: Replicate KPSS Methodology

If KPSS data doesn’t cover our time period (2000-2020), we could replicate:

Get patent grant dates from USPTO
Get stock prices from CRSP for assignee firms
Calculate 3-day abnormal returns around grant dates
Multiply by market cap to get patent value

Feasibility: Medium difficulty, requires:

CRSP stock price data (available via university subscription)
Patent-firm matching (from USPTO)
Event study code (standard in finance)

Implementation Steps

Step 1: Download KPSS Data (1 hour)

# Clone GitHub repo
# Load patent_values.csv
# Check coverage: how many of our 92K patents have KPSS values?
# Check time coverage: does it extend to 2020?

Step 2: Merge with Our Data (2 hours)

# Load Full_Patent.parquet
# Merge KPSS values by patent_id
# Check match rate
# For unmatched patents: value = 0 (conservative) or drop?

Step 3: Build Value-Weighted Panel (2 hours)

# Aggregate KPSS values by G-D-year
# Create panel: G-D pairs × years
# Outcome = total dollar value of patents mentioning G-D in year t
# Add spillover treatment indicators

Step 4: Run DID Regressions (1 hour)

# Same specification as main analysis
# Outcome: patent_value instead of patent_count
# Interpret β as "dollars of innovation value created"

Step 5: Rescaling (30 minutes)

# Current median mentions per patent: 110 (from patent-analysis)
# If β_value = $1M (increase in total value per G-D pair)
# Then value per patent = $1M / 110 = $9,091 per patent
# Compare to literature benchmarks

Expected Challenges

Coverage: KPSS only includes public firms
- Our 92K patents may have low match rate with KPSS
- Expected match rate: 20-40%?
- Missing: patents from private firms, universities, individuals
- Bias: Public firm patents may be systematically different
Time period: KPSS data may not extend to 2020
- Original: 1926-2010
- Extended: possibly to 2018
- Our panel: 2000-2020
- May need to replicate methodology for recent years
Zero values: Many patents have zero estimated value in KPSS
- Stock market doesn’t react to every patent
- Zero value ≠ worthless, could mean:
  - Patent granted on non-trading day
  - Market already anticipated the patent
  - Patent has long-term value not captured in 3-day window
Skewness: Patent values are extremely right-skewed
- Mean » Median
- A few blockbuster patents dominate
- Need robust regression methods (winsorizing, quantile regression)

Extensions to KPSS: Private Firm Patents?

Question: Has anyone extended KPSS to non-public firms?

Possible sources to investigate:

Venture capital data:
- VentureXpert, Pitchbook have funding rounds
- Could estimate patent value from VC valuations?
- Requires matching patents → startups → VC deals
Acquisition data:
- When private firms are acquired, deal values become public
- Could allocate acquisition value across firm’s patent portfolio
- Sources: SDC Platinum, CapitalIQ
NPE (Non-Practicing Entity) data:
- Patent trolls buy and monetize patents
- Licensing/litigation settlements provide value signals
- Data from Stanford NPE database, RPX
Literature search:
- Google Scholar: “patent value” + “private firms” + post-2017
- Check citations to KPSS (2017) for methodological extensions
- NBER working papers on innovation valuation

Recommended: Start with public firm data (KPSS), acknowledge limitation, suggest private firm extension as future work.

Data Acquisition Plan

Orange Book Data

Timeline: 1-2 hours

Steps:

Download from FDA: https://www.fda.gov/media/76860/download
Unzip and inspect files:
- patent.txt
- products.txt
- exclusivity.txt
Parse into pandas DataFrame
Extract patent numbers and approval dates
Merge with our patent IDs (from Full_Patent.parquet)

Expected output:

~3,000 Orange Book patents
~500-1,500 matched with our BioBERT extractions (rough estimate)

KPSS Data

Timeline: 2-3 hours

Steps:

Clone GitHub repo: git clone https://github.com/KPSS2017/Technological-Innovation-Resource-Allocation-and-Growth-Extended-Data
Locate patent value file (likely CSV or Stata format)
Check documentation for:
- Time coverage
- Data structure
- Known limitations
Load into pandas
Merge with our patent IDs
Calculate match rate and coverage

Expected output:

3-4M KPSS patent values (original data)
Unknown match rate with our 92K patents (need to check)
Possible coverage gaps after 2010

If KPSS data is outdated:

Option A: Use it for 2000-2010 subset of our panel
Option B: Replicate methodology for 2011-2020 (requires CRSP data)
Option C: Acknowledge limitation, treat missing = zero value

Implementation Timeline

Phase 1: Data Acquisition (1 week)

Day 1-2: Orange Book

Download FDA Orange Book data
Parse patent and product files
Document data structure
Create patent ID → drug name mapping

Day 3-4: KPSS

Clone KPSS GitHub repo
Load patent value data
Check time coverage and match rate with our patents
Document any coverage gaps

Day 5: Integration

Merge Orange Book patents with Full_Patent.parquet
Merge KPSS values with Full_Patent.parquet
Calculate descriptive statistics for both datasets
Document match rates and coverage

Phase 2: Panel Construction (1 week)

Day 1-2: Orange Book Panel

Create balanced G-D-year panel
Aggregate Orange Book patent counts by G-D-year
Add spillover treatment indicators
Calculate summary statistics

Day 3-4: KPSS Value Panel

Create balanced G-D-year panel
Aggregate KPSS patent values by G-D-year
Add spillover treatment indicators
Calculate summary statistics

Day 5: Validation

Compare panels to original main analysis
Check for data quality issues
Document any anomalies

Phase 3: Analysis (1 week)

Day 1-2: Orange Book Regressions

Run DID specification with Orange Book counts
Compare coefficients to main analysis
Test robustness (Poisson, negative binomial)
Create event study plots

Day 3-4: KPSS Value Regressions

Run DID specification with KPSS values
Rescale by median G-D mentions per patent
Test robustness (winsorizing, quantile regression)
Create event study plots

Day 5: Synthesis

Compare all three approaches:
- Main analysis (all patents)
- Orange Book (high-value patents)
- KPSS (dollar-weighted patents)
Calculate economic magnitudes
Draft results summary

Phase 4: Documentation (3 days)

Day 1: Methods

Write up Orange Book methodology
Write up KPSS methodology
Document data sources and merge procedures

Day 2: Results

Create results tables
Create figures (event studies, comparisons)
Write results summary

Day 3: Integration

Update main paper draft with new sections
Create appendix with robustness checks
Prepare slides for professor

Expected Results & Interpretation

Orange Book Results

Hypothesis: β_OB > β_all (larger effects on FDA-approved drug patents)

Interpretation if confirmed:

GWAS discoveries don’t just increase patent quantity
They specifically guide R&D toward commercially successful outcomes
Knowledge spillovers translate to market-ready products
Social value of GWAS > what patent counts alone suggest

Economic magnitude:

If β_OB = 0.3 (30% increase in Orange Book patents per treated G-D pair)
Average Orange Book patent value = $15M (literature estimate)
Value created = 0.3 × $15M = $4.5M per treated G-D pair
With 35K direct GWAS pairs: total value = 0.3 × 35K × $15M = $157.5B

KPSS Results

Hypothesis: β_value shows economically significant innovation value creation

Interpretation:

GWAS discoveries increase the market value of innovation, not just quantity
Dollar estimates are directly policy-relevant
Can compare to cost of GWAS studies (typically ~$1-5M each)
Calculate ROI of public genomics research

Economic magnitude:

If β_value = $500K (increase in patent value per treated G-D pair)
Median G-D mentions per patent = 110
Implied value per patent = $500K / 110 = $4,545
Compare to KPSS mean (~$1-2M) and median (~$50K)

Comparison Across Approaches

Approach	Outcome	Interpretation	Strengths	Limitations
Main (current)	Patent counts	Innovation quantity	Full coverage, clean	Treats all patents equally
Orange Book	FDA patent counts	Commercial success	High-value focus	Very sparse, small sample
KPSS	Patent dollar value	Market value	Economic magnitude	Public firms only, possibly outdated

Robustness check: If all three show consistent positive effects, very strong evidence that GWAS creates real economic value.

Key Questions to Resolve

Methodological

Orange Book timing: Use patent grant year or drug approval year?
- Grant year: consistent with main analysis
- Approval year: economically relevant (market entry)
- Recommendation: Try both, probably use approval year
KPSS zeros: How to handle patents with zero estimated value?
- Include as zeros (conservative)
- Drop (selection bias)
- Recommendation: Include as zeros, test robustness
Rescaling: What denominator for “per patent” calculation?
- Median mentions per patent (current approach)
- Mean mentions per patent
- Recommendation: Use median (robust to outliers)

Data

KPSS coverage: Does it extend to 2020?
- If not, subset analysis to 2000-2010?
- Or attempt to replicate for 2011-2020?
- Next step: Check GitHub repo and Stoffman’s website
Orange Book match rate: How many OB patents have BioBERT extractions?
- Expected: 30-50%?
- Next step: Run merge and check
Private firm extensions: Can we extend KPSS to non-public firms?
- Next step: Literature search + contact KPSS authors?

Success Criteria

Minimum viable product:

Orange Book panel constructed and merged
KPSS values merged (even if partial coverage)
DID regressions run for both approaches
Results compared to main analysis
Economic magnitudes calculated

Stretch goals:

Extend KPSS to recent years (2011-2020)
Find private firm patent valuations
Calculate formal ROI of GWAS research
Compare to Azoulay et al. estimates for NIH grants

References

Core Papers

Azoulay et al. (2019): Public R&D Investments and Private-Sector Patenting

Review of Economic Studies 86(1): 117-152
Section 5.4 on Orange Book patents (page 146)
Shows NIH grants have larger effects on FDA-approved drug patents

Kogan et al. (2017): Technological Innovation, Resource Allocation, and Growth

Quarterly Journal of Economics 132(2): 665-712
Creates patent value estimates using stock market reactions
Data: https://github.com/KPSS2017/Technological-Innovation-Resource-Allocation-and-Growth-Extended-Data

Data Sources

FDA Orange Book:

Homepage: https://www.fda.gov/drugs/drug-approvals-and-databases/approved-drug-products-FDA-approved-drugs-orange-book
Download: https://www.fda.gov/media/76860/download
Updated: Monthly

KPSS Patent Values:

GitHub: https://github.com/KPSS2017/Technological-Innovation-Resource-Allocation-and-Growth-Extended-Data
Contact: Noah Stoffman (Kelley School of Business, Indiana University)

Methodological References

Patent valuation literature:

Harhoff et al. (1999): Citation frequency and patent value. Review of Economics and Statistics
Trajtenberg (1990): Economic Analysis of Product Innovation. Harvard University Press
Hall, Jaffe, Trajtenberg (2005): Market value and patent citations. RAND Journal

Next Steps (Immediate)

This Week

Download Orange Book data (2 hours)
- Parse patent and product files
- Document structure
- Check match rate with our patents
Download KPSS data (2 hours)
- Clone GitHub repo
- Load patent values
- Check coverage (time period and match rate)
- Contact authors if needed for updated data
Preliminary merges (2 hours)
- Merge Orange Book patents → Full_Patent.parquet
- Merge KPSS values → Full_Patent.parquet
- Calculate match rates
- Document coverage statistics
Update professor (1 hour)
- Report match rates and coverage
- Flag any data availability issues
- Confirm methodology before proceeding to panel construction

Next Week

Build Orange Book panel
Build KPSS value panel
Run preliminary regressions
Draft results summary

Questions for Professor

Before proceeding, confirm:

Orange Book timing: Patent grant year or drug approval year for treatment timing?
KPSS coverage: If data only goes to 2010, subset our analysis to 2000-2010 for comparability?
Sparsity: Orange Book panel will be very sparse. Okay to use Poisson/NB models instead of OLS?
Comparison: Present all three approaches (count, Orange Book, KPSS) as complementary robustness checks, or separate analyses?
Azoulay pricing: Use their literature-based patent values, or develop our own estimates?

Implementation Status

📊 See: Orange Book Progress Report for detailed status including:

What we’ve accomplished (Days 1-4 complete)
Schemas discovered (tilde-delimited format, composite keys)
Current struggles (network restrictions, KPSS time coverage uncertainty)
What data we have vs. don’t have (no actual files yet, need manual download)
Next steps once data is available

This plan will be updated as we acquire data and learn about coverage/match rates.