Orange Book & KPSS Data Acquisition Status

Last Updated: December 28, 2025 Status: Awaiting manual data downloads

Summary

We’re working on Days 1-4 of the Orange Book extension plan, which involves acquiring and exploring two key datasets:

FDA Orange Book - Patents for FDA-approved drugs (~3K patents)
KPSS Patent Values - Market-based patent valuations (3-4M patents)

Both datasets require manual downloads due to network restrictions. Exploration scripts are ready to run once data is available.

What We’ve Prepared

✅ Ready to Use

Orange Book Exploration
- Location: /data/orange_book/
- README documenting data sources and key questions
- Python exploration script: explore_orange_book.py
- Will analyze: data structure, FK/PK relationships, patent formats, ontologies
KPSS Exploration
- Location: /data/kpss/
- README documenting data sources and challenges
- Python exploration script: explore_kpss.py
- Will analyze: value distributions, time coverage, match rates
Site Integration
- Added Orange Book plan to research page: /research/index.html
- Plan accessible at: /orange-book-plan

What We Need From You

🔴 Critical Data Downloads

1. Orange Book Data

Download URL: https://www.fda.gov/media/76860/download

Steps:

Click the link above (or go to https://www.fda.gov/drugs/drug-approvals-and-databases/orange-book-data-files)
Download the ZIP file (should be ~5-10 MB)
Save as: /data/orange_book/orange_book.zip
Run: python data/orange_book/explore_orange_book.py

What we’ll get:

patent.txt - Patent numbers, expiration dates, approval numbers
products.txt - Drug names, active ingredients, approval dates
exclusivity.txt - Market exclusivity info

Key questions to answer:

What’s the patent number format? (Need to match our USPTO IDs)
How many patents total? (~3,000 expected)
What are the FK relationships? (How do patents link to products?)
How are active ingredients named? (Need to map to genes)
What date fields exist? (Grant date vs. approval date)

2. KPSS Patent Values

Download URL: https://github.com/KPSS2017/Technological-Innovation-Resource-Allocation-and-Growth-Extended-Data

Steps:

Go to the GitHub repo
Look for patent value data files (likely CSV or Stata format)
Download to: /data/kpss/
ALSO CHECK: Noah Stoffman’s website for extended data (2011-2018): https://kelley.iu.edu/nstoffma/
Run: python data/kpss/explore_kpss.py

What we’ll get:

Patent numbers
Dollar values (market-based estimates)
Grant years
Possibly: firm identifiers, confidence intervals

CRITICAL QUESTIONS to answer:

⚠️ Does the data extend to 2020? (Original paper: 1926-2010)
If not, is there extended data available? (Check Stoffman’s site)
What % of our 92K patents will match? (KPSS only has public firms)
How many zero values? (Market didn’t react to patent)
What’s the value distribution? (Mean, median, outliers)

Critical Unknowns (Need Data to Answer)

Orange Book Ontology Questions

Patent Number Format
- Do they use “US1234567” or just “1234567”?
- Will they match our patent IDs in Full_Patent.parquet?
- Need to standardize before merging
Foreign Key Structure
- Primary key in products.txt: Appl_No? Drug_Name?
- How many products per patent? (One-to-many?)
- How many patents per product? (Many-to-many?)
Active Ingredient Naming
- Generic names or brand names or both?
- Are there ontology codes? (RxNorm, UNII, ChEMBL)
- How to map ingredients → genes? (Will need drug-target database)
Temporal Coverage
- What’s the earliest approval date?
- Are all currently approved drugs included?
- Historical approvals that were later withdrawn?

KPSS Coverage Questions

Time Period 🔴 MOST CRITICAL
- Original data: 1926-2010
- Our analysis needs: 2000-2020
- Gap: 2011-2020 coverage unknown
- If data stops at 2010: We lose half our panel!
- Action: Check for extended data immediately
Match Rate
- KPSS has 3-4M patents total
- Our data has 92K unique patents
- But KPSS only covers public firms
- What % of pharma/biotech patents are from public firms?
- Expected: 20-40% match rate (need to verify)
Selection Bias
- Public firms may be systematically different:
  - Larger, more R&D-intensive
  - More commercially oriented
  - Higher quality patents
- If GWAS affects public vs. private differently → bias
- Need to discuss: Is this acceptable?
Zero Values
- Many patents have value = $0 (market didn’t react)
- How many zeros?
- Include as zeros or exclude from analysis?
- May need to use Tobit regression or similar

Likely Data Issues We’ll Encounter

Orange Book

Low Match Rate
- ~3,000 OB patents vs. 92,000 in our data
- Match rate: 3-5% expected
- Most of our patents won’t be in Orange Book
- Solution: This is fine! OB is meant to be sparse (only FDA-approved drugs)
Patent Number Formatting
- May need to strip prefixes, add prefixes, or reformat
- Common issue when merging USPTO data
- Solution: Try multiple format variations
Multiple Patents Per Drug
- One drug can have 10+ patents (formulation, use, process)
- How to aggregate? Sum? Max? First patent only?
- Decision needed: Discuss with professor
Active Ingredient → Gene Mapping
- OB has drug names, we need genes
- Will require external database (e.g., DrugBank, DGIdb)
- May not have perfect coverage
- Backup plan: Use disease indication instead of target gene

KPSS

Time Coverage Gap 🔴
- If data ends at 2010, we have a problem
- Options:
  - Use 2000-2010 subset only (lose 10 years)
  - Find extended data (2011-2018 exists?)
  - Replicate methodology ourselves (need CRSP access)
- Critical: Check this FIRST when downloading
Public Firm Bias
- Only ~40% of patents may match (public firms only)
- Missing: Universities (important!), private firms, individuals
- Implication: Estimates only apply to public firm patenting
- Mitigation: Discuss limitation in paper, suggest as future work
Extreme Skewness
- Patent values are log-normally distributed
- Mean » Median (due to blockbusters)
- A few huge values will dominate
- Solutions:
  - Winsorize at 95th or 99th percentile
  - Log transformation
  - Quantile regression
  - Report both mean and median effects
Many Zeros
- 30-50% of patents may have value = 0
- Interpretation unclear (truly worthless or measurement error?)
- Solutions:
  - Include as zeros (conservative)
  - Drop and note selection (less conservative)
  - Tobit regression (accounts for censoring)

Once We Have the Data: Next Steps

Immediate (Days 1-4 of Plan)

Run Exploration Scripts

cd data/orange_book
python explore_orange_book.py > exploration_output.txt

cd ../kpss
python explore_kpss.py > exploration_output.txt

Answer Key Questions
- Document all schemas
- Identify FK/PK relationships
- Document ontologies
- Check time coverage (KPSS especially!)
- Calculate descriptive stats
Create Data Documentation
- Schema diagrams
- FK relationship maps
- Format conversion needed
- Known data issues
- Match rate estimates

Short-Term (Day 5 of Plan)

Test Merges

# Load our patent data
our_patents = pd.read_parquet('Full_Patent.parquet')

# Try merging with Orange Book
ob_merge = our_patents.merge(orange_book, on='patent_id', how='inner')
print(f"Match rate: {len(ob_merge) / len(our_patents):.1%}")

# Try merging with KPSS
kpss_merge = our_patents.merge(kpss, on='patent_id', how='left')
print(f"Match rate: {kpss_merge['value'].notna().sum() / len(our_patents):.1%}")

Report to User
- Match rates
- Coverage statistics
- Data quality issues
- Feasibility assessment
- Recommendations for proceeding

Red Flags to Watch For

🚨 STOP and discuss if we find:

KPSS data ends before 2015
- Means we’re missing too much of our panel
- Need to either:
  - Find extended data
  - Replicate methodology
  - Abandon KPSS approach
Orange Book match rate < 1%
- Suggests patent number format mismatch
- Or our patents are not FDA-approved drugs (expected, but verify)
- May need alternative matching strategy
Orange Book has no active ingredient field
- Makes gene mapping very difficult
- May need to rely on drug name → PubChem → gene lookups
- Much messier
KPSS match rate < 10%
- Lower than expected
- May not have enough statistical power
- Consider focusing on Orange Book only

Questions for You

Before proceeding, we should discuss:

CRSP Access: Do you have access to CRSP stock price data?
- Needed if we have to replicate KPSS for 2011-2020
- Standard at universities but need to confirm
Drug-Target Databases: Do you have access to DrugBank or DGIdb?
- Needed to map Orange Book active ingredients → genes
- Some are free, some require subscription
Time Budget: If KPSS data isn’t available for 2011-2020:
- Option A: Use 2000-2010 only (quick, limited scope)
- Option B: Replicate methodology (2 weeks work, full coverage)
- Your preference?
Private Firm Patents: Should we investigate alternatives to KPSS for non-public firms?
- Venture capital data?
- Acquisition prices?
- Or accept the public-firm-only limitation?

What We CAN Do Without Data

Even without the actual files, we can:

✅ Review Literature
- Read Azoulay et al. (2019) Section 5.4 in detail
- Read KPSS (2017) methodology sections
- Look for papers citing KPSS that use extended data
✅ Prepare Merge Code
- Write flexible merging scripts that handle different formats
- Create patent number standardization functions
- Prepare for multiple FK/PK scenarios
✅ Plan Panel Construction
- Design Orange Book panel structure
- Design KPSS value-weighted panel structure
- Plan for sparsity (Poisson regression, etc.)
✅ Contact Data Providers
- Email Noah Stoffman asking about extended KPSS data
- Check NBER patent data project for merged datasets
- Look for replication packages from papers using KPSS post-2017

Summary: Where We Stand

Prepared:

✅ Exploration scripts ready
✅ Documentation frameworks created
✅ Key questions identified
✅ Plan integrated into website

Blocked on:

🔴 Manual data downloads (Orange Book, KPSS)
🔴 KPSS time coverage unknown (critical!)

Next Actions:

You download Orange Book data
You download KPSS data (check for extended versions!)
We run exploration scripts
We assess feasibility
We decide how to proceed based on actual coverage

Estimated Time Once Data Available:

Exploration: 2-4 hours
Documentation: 2-3 hours
Test merges: 1-2 hours
Total: ~1 day of work

Let me know once you have the data files, and we can run the exploration immediately!