Focus: Understanding foreign keys, primary keys, and “weird quirks” in Orange Book and KPSS data
Question: What uniquely identifies each row?
Hypotheses to test:
patent.txt primary key:
patent_no alone (one row per patent)(appl_no, patent_no) combo (multiple applications can reference same patent)(appl_no, patent_no, patent_use_code) (one patent, multiple uses)products.txt primary key:
appl_no alone (one application = one product)(appl_no, product_no) (one application, multiple products/strengths)(appl_no, product_no, strength) (multiple strengths per product)How to test:
# Check for duplicates
patent_df.groupby(['patent_no']).size().max() # If > 1, not a PK
patent_df.groupby(['appl_no', 'patent_no']).size().max() # Test combo key
# Check uniqueness
len(patent_df) == patent_df[['patent_no']].drop_duplicates().shape[0]
Why this matters:
Question: How do patents link to products?
Expected FK: appl_no in both files
Relationship types to investigate:
Tests:
# Count patents per application
patents_per_app = patent_df.groupby('appl_no')['patent_no'].nunique()
print(f"Mean patents per app: {patents_per_app.mean():.2f}")
print(f"Max patents per app: {patents_per_app.max()}")
# Count products per patent (requires merge)
merged = patent_df.merge(products_df, on='appl_no')
products_per_patent = merged.groupby('patent_no')['product_no'].nunique()
print(f"Mean products per patent: {products_per_patent.mean():.2f}")
Implications:
Question: How are patent numbers formatted, and will they match our data?
Possible formats:
7123456US71234567,123,456RE45123 (reissue), D234567 (design)Our data format (from USPTO):
Standardization needed:
def standardize_patent_no(patent_str):
"""Convert various formats to standard numeric format"""
# Remove common prefixes
patent = str(patent_str).upper().strip()
patent = patent.replace('US', '').replace(',', '').replace(' ', '')
# Extract numeric part
# Handle RE, D, PP prefixes if present
if patent.startswith(('RE', 'PP')):
return f"{patent[:2]}{patent[2:].zfill(7)}" # Keep prefix
elif patent.startswith('D'):
return f"D{patent[1:].zfill(7)}" # Design patent
else:
return patent.zfill(7) # Regular utility patent
# Apply to both datasets before merge
orange_book['patent_std'] = orange_book['patent_no'].apply(standardize_patent_no)
our_data['patent_std'] = our_data['patent_id'].apply(standardize_patent_no)
# Then merge on standardized column
merged = orange_book.merge(our_data, on='patent_std')
Edge cases to check:
Question: How are drugs/ingredients named, and how do we map to genes?
Possible naming systems:
What to check:
# Sample ingredient names
print("Sample ingredient names:")
print(products_df['active_ingredient'].head(20))
# Check for structured identifiers
print("\nColumn names (looking for codes):")
print([c for c in products_df.columns if any(term in c.lower()
for term in ['code', 'id', 'unii', 'rxnorm'])])
# Check name format
print("\nName characteristics:")
print(f"Max length: {products_df['active_ingredient'].str.len().max()}")
print(f"Contains semicolons (combos): {products_df['active_ingredient'].str.contains(';').sum()}")
print(f"Contains 'AND' (combos): {products_df['active_ingredient'].str.contains(' AND ').sum()}")
Mapping strategy (Ingredient → Gene):
External databases needed:
Question: What date fields exist and in what format?
Expected date fields:
approval_date - When FDA approved the drugpatent_expire_date - When patent protection endspatent_grant_date - When USPTO granted the patent (maybe not in OB)Possible formats:
YYYY-MM-DD (ISO standard)MM/DD/YYYY (US format)YYYYMMDD (numeric)Jan 1, 2020 (text)Tests:
def detect_date_format(date_series):
"""Figure out what date format is used"""
sample = date_series.dropna().head(20)
print(f"Sample values: {sample.tolist()}")
# Try parsing with different formats
formats_to_try = [
'%Y-%m-%d',
'%m/%d/%Y',
'%Y%m%d',
'%d-%b-%Y',
]
for fmt in formats_to_try:
try:
parsed = pd.to_datetime(sample, format=fmt, errors='coerce')
success_rate = parsed.notna().sum() / len(sample)
print(f"Format {fmt}: {success_rate:.1%} success")
if success_rate > 0.8:
print(f" → Likely format: {fmt}")
except:
pass
detect_date_format(products_df['approval_date'])
Critical decision: Which date to use for panel timing?
Question: How are combination drugs (multiple active ingredients) handled?
Example: A pill with both aspirin and caffeine
Possible representations:
How to check:
# Check for duplicated appl_no (suggests multiple ingredients)
dupe_apps = products_df[products_df.duplicated('appl_no', keep=False)]
print(f"Applications with multiple rows: {dupe_apps['appl_no'].nunique()}")
# Check for delimiters in ingredient field
has_semicolon = products_df['active_ingredient'].str.contains(';', na=False).sum()
has_and = products_df['active_ingredient'].str.contains(' AND ', na=False).sum()
has_plus = products_df['active_ingredient'].str.contains(r'\+', na=False).sum()
print(f"Ingredients with semicolon: {has_semicolon}")
print(f"Ingredients with AND: {has_and}")
print(f"Ingredients with plus: {has_plus}")
# Check for ingredient_2, ingredient_3 columns
print("Ingredient columns:", [c for c in products_df.columns if 'ingredient' in c.lower()])
Implication for gene mapping:
Question: What uniquely identifies each row?
Expected format: One row per patent
Tests:
# Check for duplicate patents
len(kpss_df) == kpss_df['patent_no'].nunique()
# If duplicates exist, why?
dupes = kpss_df[kpss_df.duplicated('patent_no', keep=False)]
print("Duplicate patents:")
print(dupes.head())
# Possible reasons for duplicates:
# - Multiple valuations (different event windows?)
# - Patent assigned to multiple firms (reissue/transfer?)
# - Errors in data
Possible PK combinations:
patent_no alone (expected)(patent_no, permno) - if patent assigned to multiple firms(patent_no, grant_year) - if multiple grants (unlikely)Question: Same as Orange Book, but KPSS may differ
Likely format: Numeric only (since from USPTO)
But check for:
Standardization:
# Same as Orange Book, but verify format
sample_patents = kpss_df['patent_no'].head(100)
print("Sample patent numbers:")
print(sample_patents.tolist())
# Check data type
print(f"Data type: {kpss_df['patent_no'].dtype}")
# If it's float (bad sign - Excel corrupted it)
if kpss_df['patent_no'].dtype == 'float64':
print("WARNING: Patents stored as float (Excel corruption?)")
kpss_df['patent_no'] = kpss_df['patent_no'].astype(int).astype(str)
Question: What are the quirks of the value distribution?
Things to check:
print(f"Mean value: {kpss_df['value'].mean():,.2f}")
print(f"Median value: {kpss_df['value'].median():,.2f}")
# If mean is like 50, units are probably millions
# If mean is like 50000, units are probably dollars
negative_count = (kpss_df['value'] < 0).sum()
print(f"Negative values: {negative_count} ({negative_count/len(kpss_df)*100:.1f}%)")
# If many negatives, may need to handle separately
exact_zeros = (kpss_df['value'] == 0).sum()
missing = kpss_df['value'].isna().sum()
print(f"Exact zeros: {exact_zeros} (market didn't react)")
print(f"Missing: {missing} (no data)")
# Zeros = measured but no effect
# Missing = not measured
top_values = kpss_df.nlargest(10, 'value')
print("Top 10 patents:")
print(top_values[['patent_no', 'value']])
# Check if top values are reasonable (billions possible for pharma)
# Check if there's a year column
if 'year' in kpss_df.columns:
print("Value by year (check for inflation adjustment):")
print(kpss_df.groupby('year')['value'].agg(['mean', 'median']))
# If mean is constant over time → real dollars
# If mean grows over time → nominal dollars (need to adjust)
Question: How are firms identified, and does it matter?
Possible identifiers:
permno - CRSP permanent company numbergvkey - Compustat identifiercusip - Committee on Uniform Securities Identification ProceduresWhy we care:
Tests:
# Check firm identifier columns
firm_cols = [c for c in kpss_df.columns if any(term in c.lower()
for term in ['firm', 'company', 'permno', 'gvkey', 'cusip'])]
print(f"Firm identifier columns: {firm_cols}")
# How many unique firms?
if 'permno' in kpss_df.columns:
print(f"Unique firms: {kpss_df['permno'].nunique()}")
# Patents per firm
patents_per_firm = kpss_df.groupby('permno').size()
print(f"Mean patents per firm: {patents_per_firm.mean():.1f}")
print(f"Median patents per firm: {patents_per_firm.median():.1f}")
Question: Does data extend to our analysis period? (CRITICAL!)
Key checks:
if 'year' in kpss_df.columns:
print(f"Latest year: {kpss_df['year'].max()}")
if kpss_df['year'].max() < 2015:
print("⚠️ CRITICAL: Data doesn't extend to 2015!")
print(" This is a MAJOR LIMITATION for our 2000-2020 analysis")
recent_years = year_coverage.tail(5) if recent_years.mean() < year_coverage.mean() * 0.5: print(“⚠️ WARNING: Coverage drops in recent years (data may be incomplete)”)
3. **Overlap with our panel**:
```python
our_start, our_end = 2000, 2020
overlap_start = max(our_start, kpss_df['year'].min())
overlap_end = min(our_end, kpss_df['year'].max())
overlap_years = overlap_end - overlap_start + 1
print(f"\nOur analysis period: {our_start}-{our_end} ({our_end-our_start+1} years)")
print(f"KPSS data period: {kpss_df['year'].min()}-{kpss_df['year'].max()}")
print(f"Overlap: {overlap_start}-{overlap_end} ({overlap_years} years)")
print(f"Coverage: {overlap_years / (our_end-our_start+1) * 100:.0f}% of our period")
if overlap_years < 15:
print("\n🚨 WARNING: Less than 15 years of overlap!")
print(" Consider:")
print(" 1. Finding extended KPSS data")
print(" 2. Replicating methodology for missing years")
print(" 3. Restricting analysis to overlap period only")
Once we have the data:
Critical questions to answer:
Let me know once you have the data files!