Orange Book & KPSS Extension: Progress Report
Orange Book & KPSS Extension: Progress Report
Last Updated: December 28, 2025
Executive Summary
This document tracks progress on extending the GWAS spillover project to value patents rather than just count them. We’re implementing two approaches inspired by Azoulay et al. (2019):
- Orange Book Approach: Focus on FDA-approved drug patents (highest-value patents)
- KPSS Approach: Weight patents by market value (dollar-denominated outcomes)
Current Status: Days 1-4 complete - data exploration framework ready, awaiting manual data downloads
Key Achievement: Discovered Orange Book uses tilde-delimited format with composite keys, ready to merge with USPTO data
Critical Unknown: KPSS time coverage may not extend to our 2000-2020 analysis period
Background: The Original Plan
Motivation
Our current GWAS spillover analysis measures patent quantity (counts). But not all patents are equally valuable. Some protect FDA-approved blockbuster drugs worth billions, while others gather dust.
Azoulay et al. (2019) showed that NIH funding has larger effects on high-value patents than on average patents. We want to test if GWAS discoveries have similar heterogeneous effects.
Two Valuation Approaches
| Approach | What It Measures | Data Source | Expected Outcome |
|---|---|---|---|
| Orange Book | Count of FDA-approved drug patents | FDA Orange Book | β_OB > β_all (larger effects on high-value patents) |
| KPSS | Dollar value of patents (market-based) | Kogan et al. (2017) | Direct economic magnitude in dollars |
Why both?
- Orange Book: Very high-value patents, but extremely sparse (~3K patents vs. 92K in our data)
- KPSS: Broader coverage (3-4M patents), but only public firms, possible time gap
What We’ve Accomplished (Days 1-4)
✅ Infrastructure Built
- Exploration Scripts Created
/data/orange_book/explore_orange_book.py- Analyzes FDA Orange Book structure/data/kpss/explore_kpss.py- Analyzes KPSS patent values- Both scripts auto-detect delimiters, document schemas, check FK/PK relationships
- Documentation Framework
/data/orange_book/README.md- Orange Book data sources and schemas/data/kpss/README.md- KPSS data sources and challenges/data/DATA_ACQUISITION_STATUS.md- Overall status and next steps/data/ONTOLOGY_AND_SCHEMA_QUESTIONS.md- Detailed schema analysis plan
- Website Integration
- Created Orange Book plan page with full methodology
- Added to research page navigation
- This progress report page
✅ Key Discoveries: Orange Book Schema
Critical Finding: FDA Orange Book uses tilde-delimited format (~ separator)
File Structure
Three core files:
patent.txt- Patent numbers and expiration datesproducts.txt- Drug names, active ingredients, approval datesexclusivity.txt- Market exclusivity periods
Primary Key: Composite Key Structure
All three files link via: Appl_Type + Appl_No + Product_No
Products (drug details, ingredients, approval dates)
↓ [Appl_Type + Appl_No + Product_No]
Patents (patent numbers, expiration dates)
↓ [Appl_Type + Appl_No + Product_No]
Exclusivity (exclusivity periods)
Schema Details
patent.txt columns:
Appl_Type~Appl_No~Product_No~Patent_No~Patent_Expire_Date_Text~
Drug_Substance_Flag~Drug_Product_Flag~Patent_Use_Code~Delist_Flag~Submission_Date
Key fields:
Patent_No: USPTO patent number (7-digit format, e.g., “7625884”) ✅ Matches our data!Appl_Type: Application type (N = NDA, A = ANDA)Patent_Expire_Date_Text: Patent expiration dateDrug_Substance_Flag: Y = chemical compound patentDrug_Product_Flag: Y = formulation patent
products.txt columns:
Ingredient~DF;Route~Trade_Name~Applicant~Strength~Appl_Type~Appl_No~
Product_No~TE_Code~Approval_Date~RLD~RS~Type~Applicant_Full_Name
Key fields:
Ingredient: Active ingredient name ⭐ This maps to genes!- Examples: “BUDESONIDE”, “MINOCYCLINE HYDROCHLORIDE”
- Generic/chemical names (good for mapping)
Trade_Name: Brand nameApproval_Date: FDA approval date (economically relevant timing)Applicant_Full_Name: Company name
exclusivity.txt columns:
Appl_Type~Appl_No~Product_No~Exclusivity_Code~Exclusivity_Date
Patent Number Format
✅ CONFIRMED: Orange Book patent numbers match USPTO 7-digit format
- Example:
7625884,8455524 - No “US” prefix, no formatting issues observed
- Should merge directly with our
Full_Patent.parquetdata
Data We Have vs. Don’t Have
✅ What We Have
Code & Infrastructure:
- Exploration scripts ready to run
- Patent number standardization functions
- Merge strategy planned
- Schema documentation complete
- Composite key handling logic
Existing Data Assets (from main GWAS project):
Full_Patent.parquet: 92,611 unique patents with BioBERT extractions- Gene-disease-patent mappings at 99% probability threshold
- Panel structure: 7M gene-disease pairs × 21 years (2000-2020)
- GWAS treatment indicators (direct hits, 1-hop, 2-hop, 3-hop spillovers)
Understanding:
- Orange Book schema fully documented
- FK/PK relationships identified
- Patent number format compatibility confirmed
- Merge strategy designed
❌ What We DON’T Have
Critical Missing Data:
- Actual Orange Book data files 🔴
- Need to download from: https://www.fda.gov/media/76860/download
- ~5-10 MB download
- Cannot download programmatically due to network restrictions
- User must manually download
- Actual KPSS patent value data 🔴
- Need to download from: https://github.com/KPSS2017/Technological-Innovation-Resource-Allocation-and-Growth-Extended-Data
- Check Noah Stoffman’s website for extended data (2011-2018+)
- User must manually download
- KPSS time coverage confirmation 🚨 CRITICAL
- Original KPSS data: 1926-2010
- Our analysis needs: 2000-2020
- Gap: Unknown if data extends to 2011-2020
- If not available: Lose half our panel OR need to replicate methodology
Empirical Unknowns (need data to answer):
- Orange Book match rate with our 92K patents (expected: 3-5%, ~1,500-3,000 patents)
- KPSS match rate with our patents (expected: 20-40%, ~18K-37K patents)
- Orange Book date format and completeness
- KPSS value distribution (zeros, outliers, skewness)
- Active ingredient naming conventions in Orange Book
Current Struggles & Challenges
1. Network Restrictions Prevent Automated Downloads
Problem: Cannot use wget, curl, or web scraping to download data
Impact:
- Delays Days 1-4 completion
- Cannot run exploration scripts until data available
- Cannot validate schema assumptions
Solution: User must manually download files
Status: ⏸️ Blocked - awaiting manual downloads
2. KPSS Time Coverage Uncertainty 🚨 CRITICAL
Problem: Original KPSS data covers 1926-2010, but our panel needs 2000-2020
Impact:
- If data stops at 2010: We lose half our panel (2011-2020)
- If extended data exists but stops at 2018: Still missing 2019-2020
- If no extended data: Must replicate methodology (requires CRSP, 1-2 weeks work)
What we need to check immediately:
- Does KPSS GitHub have extended data?
- Does Noah Stoffman’s website have 2011-2018+ data?
- Has anyone published replication packages with recent data?
Contingency options:
- Option A: Restrict analysis to 2000-2010 (11 years, lose recent GWAS discoveries)
- Option B: Find extended KPSS data (ideal if it exists)
- Option C: Replicate KPSS methodology for 2011-2020 (requires CRSP access, significant work)
- Option D: Abandon KPSS approach, focus only on Orange Book (simpler but less informative)
Status: 🔴 Critical unknown - must check when downloading data
3. Orange Book Match Rate Uncertainty
Problem: Unknown how many of our 92K patents will be in Orange Book
Expected match rate: 3-5% (Orange Book has ~3,000 patents, we have 92K)
Why low match rate is expected:
- Orange Book only includes patents for FDA-approved drugs
- Most patents never reach FDA approval
- This is the POINT - we’re focusing on high-value patents
Potential issues:
- Very sparse panel (most G-D-year cells will have zero Orange Book patents)
- Low statistical power
- Need Poisson or negative binomial regression (not OLS)
Mitigation:
- This is expected and acceptable for Orange Book approach
- Sparsity is informative (shows how rare FDA-approved patents are)
- Can still test if GWAS has larger effects on these rare, valuable patents
Status: ⏸️ Expected challenge, need actual data to confirm
4. Active Ingredient → Gene Mapping Complexity
Problem: Orange Book has drug names (e.g., “BUDESONIDE”), we need genes
Challenge: Must map drug active ingredients to gene targets
Mapping strategy:
- Best case: If Orange Book has UNII codes
- UNII → DrugBank → Gene targets
- Clean, structured mapping
- Good case: Generic names available (confirmed in schema)
- Generic name → DrugBank → Gene targets
- May have spelling variations
- Backup case: Use disease indication instead of gene target
- Drug → Disease (from Orange Book indications)
- Match to our disease entities
- Less direct but still valid
External databases needed:
- DrugBank: Drug → Gene target mapping (best resource)
- DGIdb: Drug-Gene Interaction database (alternative)
- ChEMBL: Chemical → Target mapping (if needed)
Status: ⏸️ Planned approach ready, need actual data to test
5. Combination Products Handling
Problem: Some drugs have multiple active ingredients (e.g., aspirin + caffeine)
Unknown: How are combination products represented in Orange Book?
Possible formats:
- Separate rows per ingredient (creates duplicates when counting)
- Concatenated string: “aspirin; caffeine” (need to parse)
- Separate columns: Ingredient_1, Ingredient_2 (wide format)
Implication:
- Combination drugs map to multiple genes (many-to-many)
- Need to decide: Include all genes or just first listed?
Status: ⏸️ Will discover when running exploration script
6. Public Firm Bias in KPSS
Problem: KPSS only covers patents from publicly traded firms
Missing coverage:
- University patents (important for basic research!)
- Private company patents (biotech startups)
- Individual inventor patents
- Foreign firm patents (if not traded on US exchanges)
Expected match rate: Only 20-40% of our patents
Selection bias:
- Public firms may have:
- Larger R&D budgets
- More commercially oriented patents
- Higher average patent quality
- Different response to GWAS discoveries
Implication: KPSS estimates only apply to public firm patenting
Mitigation:
- Acknowledge limitation in writeup
- Compare Orange Book (all firms) vs. KPSS (public only)
- Suggest private firm valuation as future work
- Could investigate VC funding data as alternative
Status: ⚠️ Known limitation, acceptable for first pass
7. Date Field Ambiguity
Problem: Multiple relevant dates for each patent
Date options:
- Patent grant date (from USPTO)
- When patent was granted
- Consistent with main analysis
- Drug approval date (from Orange Book)
- When drug reached market
- Economically relevant (revenue starts)
- Can be years after patent grant
- Patent expiration date (from Orange Book)
- When patent protection ends
- Less relevant for our timing
Critical decision needed: Which date to use for panel timing?
Recommendation:
- Approval date for Orange Book analysis
- It’s the “high value” moment (FDA approval)
- More economically meaningful than grant date
- Compare both as robustness check
Status: ⏸️ Decision planned, will test both approaches
Detailed Schema Documentation
Orange Book: Composite Key Structure
Primary Key: (Appl_Type, Appl_No, Product_No)
Relationships:
- One application can have multiple products (different strengths/formulations)
- One product can have multiple patents (compound, formulation, use patents)
- Many-to-many: Patents ↔ Products
Cardinality tests needed:
# Expected: One application → Many patents (formulation, use, process)
patents_per_app = patent_df.groupby('Appl_No')['Patent_No'].nunique()
print(f"Mean patents per application: {patents_per_app.mean():.2f}")
# Expected: One patent → Multiple products (different indications)
products_per_patent = merged.groupby('Patent_No')['Product_No'].nunique()
print(f"Mean products per patent: {products_per_patent.mean():.2f}")
Implications for our analysis:
- Cannot simply count “distinct patents” (may double-count)
- Need to aggregate at patent level, then merge to products
- Or aggregate at product level, then count unique patents per G-D pair
KPSS: Expected Schema
Primary Key: Patent_No (expected, need to verify)
Expected columns:
patent_no: USPTO patent numbervalue: Dollar value (market-based estimate)year: Grant yearpermno: CRSP firm identifier (optional)
Value distribution characteristics:
- Log-normal distribution:
- Mean » Median (blockbuster patents skew distribution)
- Need to handle outliers (winsorize at 95th or 99th percentile)
- Zero inflation:
- 30-50% of patents may have value = $0
- Interpretation: Market didn’t react (not necessarily worthless)
- Need Tobit regression or similar to handle censoring
- Negative values (possible):
- “Bad news” patent (signals overinvestment)
- Unexpected patent in mature technology
- Need to check if present and how to handle
Statistical approach for skewed data:
- Log transformation:
log(1 + value) - Winsorization: Cap at 95th or 99th percentile
- Quantile regression: Estimate effects at different percentiles
- Report both mean and median effects
Next Steps: What Happens When We Get Data
Immediate (1-2 hours after download)
Step 1: Run exploration scripts
cd /home/user/cguth7.github.io/data/orange_book
python explore_orange_book.py > exploration_output.txt
cd ../kpss
python explore_kpss.py > exploration_output.txt
Step 2: Check critical unknowns
- ✅ or ❌ KPSS extends to 2015+? (If ❌ → major problem)
- ✅ or ❌ Orange Book has clean ingredient names? (If ❌ → harder mapping)
- ✅ or ❌ Patent numbers standardizable? (If ❌ → manual matching)
Step 3: Calculate match rates
# Load our patent data
our_patents = pd.read_parquet('Full_Patent.parquet')
# Orange Book match
ob_merge = our_patents.merge(orange_book, on='patent_id', how='inner')
print(f"Orange Book match rate: {len(ob_merge) / len(our_patents):.1%}")
print(f"Matched patents: {len(ob_merge):,}")
# KPSS match
kpss_merge = our_patents.merge(kpss, on='patent_id', how='left')
print(f"KPSS match rate: {kpss_merge['value'].notna().sum() / len(our_patents):.1%}")
print(f"Matched patents: {kpss_merge['value'].notna().sum():,}")
Short-term (Day 5 of plan)
Step 4: Build Orange Book panel
- Create balanced G-D-year panel (same structure as main analysis)
- Aggregate Orange Book patent counts by G-D-year
- Most cells will be zero (sparse panel)
- Add spillover treatment indicators
Step 5: Build KPSS value panel
- Create G-D-year panel
- Aggregate KPSS patent values by G-D-year (sum of values)
- Handle zeros and outliers
- Add spillover treatment indicators
Step 6: Run preliminary DID regressions
# Orange Book approach
reghdfe ob_patent_count post_gwas, absorb(gene_year disease_year) cluster(gene disease)
# KPSS approach
reghdfe patent_value post_gwas, absorb(gene_year disease_year) cluster(gene disease)
Step 7: Compare coefficients
- β_all (main analysis, all patents)
- β_OB (Orange Book, FDA-approved patents)
- β_value (KPSS, dollar-weighted)
Expected result: β_OB > β_all (GWAS has larger effects on high-value patents)
Red Flags: When to STOP and Discuss
🚨 Stop and discuss with user if:
- KPSS data ends before 2015
- Missing too much of our panel
- Need to find extended data, replicate methodology, or abandon KPSS
- Orange Book match rate < 1%
- Suggests patent number format mismatch
- Or our patents aren’t FDA-approved drugs (verify expected)
- KPSS match rate < 10%
- Lower than expected (expected: 20-40%)
- May not have enough statistical power
- Consider focusing on Orange Book only
- Orange Book has no active ingredient field
- Makes gene mapping very difficult
- Need to rely on manual curation or disease indication
Questions for User
Before proceeding, we need answers to:
- Data Access:
- Can you manually download Orange Book data? (https://www.fda.gov/media/76860/download)
- Can you check KPSS GitHub and Stoffman’s website for extended data?
- Do you have access to CRSP? (If we need to replicate KPSS for 2011-2020)
- Methodological Choices:
- Orange Book timing: Patent grant year or drug approval year?
- KPSS coverage: If data stops at 2010, use 2000-2010 subset or replicate?
- KPSS zeros: Include as zeros (conservative) or drop?
- Scope:
- If KPSS data isn’t available for 2011-2020:
- Option A: Use 2000-2010 only (quick, limited scope)
- Option B: Replicate methodology (2 weeks, full coverage)
- Which do you prefer?
- If KPSS data isn’t available for 2011-2020:
- Extensions:
- Should we investigate private firm patent valuations? (VC data, acquisition prices)
- Or accept public-firm-only limitation for first pass?
Timeline: Once Data Is Available
Day 1 (2-4 hours):
- Run exploration scripts
- Document actual schemas
- Check KPSS time coverage (critical!)
- Calculate match rates
Day 2 (2-3 hours):
- Build Orange Book panel
- Handle combination products
- Test preliminary merges
Day 3 (2-3 hours):
- Build KPSS value panel
- Handle outliers and zeros
- Test preliminary merges
Day 4 (2-3 hours):
- Run DID regressions for both approaches
- Compare coefficients to main analysis
- Create event study plots
Day 5 (2-3 hours):
- Calculate economic magnitudes
- Rescale KPSS effects (dollars per patent)
- Draft results summary
Total: ~5 days of work once data is available
Success Criteria
Minimum viable product:
- ✅ Orange Book panel constructed
- ✅ KPSS values merged (even if partial coverage)
- ✅ DID regressions run for both approaches
- ✅ Results compared to main analysis
- ✅ Economic magnitudes calculated
Would be great:
- 🎯 KPSS extends to 2018+
- 🎯 Match rates > 20% for both datasets
- 🎯 Clean ingredient → gene mapping
- 🎯 β_OB > β_all (larger effects on high-value patents)
Stretch goals:
- 🚀 Extend KPSS to 2020
- 🚀 Find private firm patent valuations
- 🚀 Calculate formal ROI of GWAS research
- 🚀 Compare to Azoulay et al. estimates
References & Resources
Papers
Azoulay et al. (2019): “Public R&D Investments and Private-Sector Patenting”
- Review of Economic Studies 86(1): 117-152
- Section 5.4 on Orange Book patents (page 146)
- Shows NIH grants have larger effects on FDA-approved drug patents
- Link to plan page for full methodology
Kogan et al. (2017): “Technological Innovation, Resource Allocation, and Growth”
- Quarterly Journal of Economics 132(2): 665-712
- Creates patent value estimates using stock market reactions
- GitHub: https://github.com/KPSS2017/Technological-Innovation-Resource-Allocation-and-Growth-Extended-Data
Data Sources
FDA Orange Book:
- Homepage: https://www.fda.gov/drugs/drug-approvals-and-databases/approved-drug-products-FDA-approved-drugs-orange-book
- Download: https://www.fda.gov/media/76860/download
- Updated: Monthly
- Format: Tilde-delimited text files
KPSS Patent Values:
- GitHub: https://github.com/KPSS2017/Technological-Innovation-Resource-Allocation-and-Growth-Extended-Data
- Extended data: Check Noah Stoffman’s website (https://kelley.iu.edu/nstoffma/)
- Original coverage: 1926-2010
- Extended: Possibly to 2018+
Drug-Gene Mapping Databases
DrugBank (best resource):
- URL: https://www.drugbank.ca/
- Free academic version available
- Drug → Gene target mapping
DGIdb (alternative):
- URL: http://www.dgidb.org/
- Free, open-source
- Drug-Gene Interaction database
ChEMBL (if needed):
- URL: https://www.ebi.ac.uk/chembl/
- Chemical → Target mapping
Summary: Where We Stand
✅ Completed (Days 1-4):
- Exploration scripts ready
- Orange Book schema fully documented
- Composite key structure identified
- Patent number format compatibility confirmed
- Merge strategy designed
- Website integration complete
🔴 Blocked On:
- Manual data downloads (Orange Book, KPSS)
- KPSS time coverage unknown (critical!)
⏳ Next Actions:
- User downloads Orange Book data
- User downloads KPSS data (check for extended versions!)
- Run exploration scripts
- Assess feasibility based on actual coverage
- Build panels and run regressions
📊 Expected Outcome:
- Orange Book: ~1,500-3,000 matched patents (3-5% match rate)
- KPSS: ~18K-37K matched patents (20-40% match rate)
- Statistical power sufficient for both approaches
- Evidence that GWAS has larger effects on high-value patents
See also:
- Original Orange Book Plan - Full methodology and implementation plan
- Patent Analysis - Current patent data quality analysis
- GWAS Results - Main analysis results
This progress report documents Days 1-4 of the Orange Book extension plan. Once data is available, we’ll update with actual match rates, panel statistics, and regression results.