Milestone Assignment 2: Methods
1. Sample
The data were obtained from the World Bank World Development Indicators (WDI) dataset, which provides standardized cross-national indicators on economic, environmental, and health domains.
From this global dataset, I selected developing countries following the World Bank’s 2023 income classification (low- and lower-middle-income economies).
Observations with substantial missing data across key indicators were removed, yielding a final analytical sample of 92 countries with data available for 2012 and 2013, corresponding to the years represented in the uploaded WDI extract.
The sample includes nations from Sub-Saharan Africa, South Asia, Latin America, and Southeast Asia. Each observation represents one country–year combination (N = 184 rows).
2. Measures
All variables were sourced from the WDI variable dictionary provided in the accompanying codebook
Economic Growth Indicators
-
GDP per capita (current US$) – total economic output per person (
x142_2012,x142_2013). -
Gross domestic savings (% of GDP) – share of national income saved rather than spent (
x146_2012,x146_2013).
Environmental Sustainability Indicators
-
Adjusted savings: carbon dioxide damage (% of GNI) – estimates the cost of CO₂ emissions (
x12_2012,x12_2013). -
Renewable electricity output (% of total electricity output) – proportion of power generation from renewable sources (
x253_2012,x253_2013). -
Adjusted savings: natural resource depletion (% of GNI) – loss of natural assets (
x21_2012,x21_2013).
Health Outcome Indicators
-
Life expectancy at birth (years) – average longevity (
x173_2012,x173_2013). -
Infant mortality rate (per 1,000 live births) – indicator of early-life health (
x190_2012,x190_2013). -
Health expenditure per capita (current US$) – national spending on healthcare services (
x149_2012,x149_2013).
Variable Management
All quantitative variables were:
-
Averaged across 2012 and 2013 to smooth year-to-year variation.
-
Log-transformed for right-skewed monetary indicators (GDP per capita, health expenditure per capita).
-
Standardized (z-scores) prior to multivariate analysis for scale comparability.
-
Missing values (< 5%) imputed with regional means.
3. Analyses
The analytical workflow proceeded in three phases:
A. Descriptive Analysis
Descriptive statistics (mean, SD, range) and pairwise Pearson correlations were computed to summarize the distribution and preliminary relationships among all variables.
B. Inferential Models
-
Multiple Linear Regression – Economic Model:
Evaluates how GDP per capita and gross domestic savings predict life expectancy, infant mortality, and health expenditure. -
Multiple Linear Regression – Environmental Model:
Tests the effect of CO₂ damage, renewable electricity, and resource depletion on the same health outcomes, controlling for GDP per capita. -
Lasso Regression (Cross-Validation):
Performs variable selection among all predictors using 10-fold cross-validation to minimize prediction error and identify the most influential variables.
C. Exploratory Clustering
A k-means cluster analysis grouped countries into similar profiles based on standardized economic, environmental, and health indicators, revealing regional development patterns.
Because the dataset is cross-sectional and modest in size, I did not perform a train-test split; instead, model performance and generalizability were validated using k-fold cross-validation (k = 10) within the regression steps.
All analyses were conducted in Python 3.11 using the libraries pandas, numpy, scikit-learn, and statsmodels.
Comments
Post a Comment