Gene Research Panel Data Analysis

Dataset: full_serialized_final_panel_with_openalex_counts.dta

Generated: 2025-12-12 16:20

949,824
Total Observations
19,690
Unique Genes
48
Months (2020-2023)
13,684,585
Total Papers

1. Variable Distributions

Distributions

Figure 1: Distributions of key variables (log-transformed where skewness > 2)

Key Finding: Most variables are highly right-skewed. The median gene-month has only 2 papers, while the mean is 14.4, indicating a small number of genes receive disproportionate attention.

2. Time Series Analysis

Time Series

Figure 2: Research activity over time with AlphaFold 2 release marked (July 2021)

3. Pre vs Post AlphaFold Comparison

Pre/Post Comparison

Figure 3: Box plots comparing key metrics before and after AlphaFold 2 release

4. Correlation Analysis

Correlation Matrix

Figure 4: Correlation matrix of key numeric variables

Key Finding: Strong correlations exist between total papers and papers by newcomers (r=0.98), suggesting newcomers drive most research activity. Disease associations correlate moderately with paper counts.

5. Protein Existence Analysis

Protein Existence

Figure 5: Research activity and structure quality by protein evidence level

6. Outlier Analysis

Outlier Analysis

Figure 6: Distribution analysis with outliers and key genes identified

7. Newcomer vs Veteran Analysis

Newcomer vs Veteran

Figure 7: Comparison of research output by author experience level

Key Finding: Newcomers consistently produce ~75% of papers across the time period. The newcomer fraction increased slightly after AlphaFold 2 release.

8. Gene Group Analysis

Gene Groups

Figure 8: Analysis of gene functional groups

9. Bucket Analysis

Bucket Analysis

Figure 9: Distribution of observations across meaningful buckets

10. Institution Type Analysis

Institution Analysis

Figure 10: Research output by institution type (Healthcare, Education, Company)

11. Inequality Analysis

Percentile Analysis

Figure 11: Cumulative distribution and Lorenz curve showing research inequality

Key Finding: The Gini coefficient of 0.848 indicates high inequality in research attention. The top 1% of gene-months account for a disproportionate share of all publications.

Summary Statistics

See full summary statistics table.