Milestone Assignment 2: Methods

1. Sample

The data were obtained from the World Bank World Development Indicators (WDI) dataset, which provides standardized cross-national indicators on economic, environmental, and health domains.
From this global dataset, I selected developing countries following the World Bank’s 2023 income classification (low- and lower-middle-income economies).

Observations with substantial missing data across key indicators were removed, yielding a final analytical sample of 92 countries with data available for 2012 and 2013, corresponding to the years represented in the uploaded WDI extract.

The sample includes nations from Sub-Saharan Africa, South Asia, Latin America, and Southeast Asia. Each observation represents one country–year combination (N = 184 rows).

2. Measures

All variables were sourced from the WDI variable dictionary provided in the accompanying codebook

Economic Growth Indicators

  • GDP per capita (current US$) – total economic output per person (x142_2012, x142_2013).

  • Gross domestic savings (% of GDP) – share of national income saved rather than spent (x146_2012, x146_2013).

Environmental Sustainability Indicators

  • Adjusted savings: carbon dioxide damage (% of GNI) – estimates the cost of CO₂ emissions (x12_2012, x12_2013).

  • Renewable electricity output (% of total electricity output) – proportion of power generation from renewable sources (x253_2012, x253_2013).

  • Adjusted savings: natural resource depletion (% of GNI) – loss of natural assets (x21_2012, x21_2013).

Health Outcome Indicators

  • Life expectancy at birth (years) – average longevity (x173_2012, x173_2013).

  • Infant mortality rate (per 1,000 live births) – indicator of early-life health (x190_2012, x190_2013).

  • Health expenditure per capita (current US$) – national spending on healthcare services (x149_2012, x149_2013).

Variable Management

All quantitative variables were:

  • Averaged across 2012 and 2013 to smooth year-to-year variation.

  • Log-transformed for right-skewed monetary indicators (GDP per capita, health expenditure per capita).

  • Standardized (z-scores) prior to multivariate analysis for scale comparability.

  • Missing values (< 5%) imputed with regional means.

3. Analyses

The analytical workflow proceeded in three phases:

A. Descriptive Analysis

Descriptive statistics (mean, SD, range) and pairwise Pearson correlations were computed to summarize the distribution and preliminary relationships among all variables.

B. Inferential Models

  1. Multiple Linear Regression – Economic Model:
    Evaluates how GDP per capita and gross domestic savings predict life expectancy, infant mortality, and health expenditure.

  2. Multiple Linear Regression – Environmental Model:
    Tests the effect of CO₂ damage, renewable electricity, and resource depletion on the same health outcomes, controlling for GDP per capita.

  3. Lasso Regression (Cross-Validation):
    Performs variable selection among all predictors using 10-fold cross-validation to minimize prediction error and identify the most influential variables.

C. Exploratory Clustering

A k-means cluster analysis grouped countries into similar profiles based on standardized economic, environmental, and health indicators, revealing regional development patterns.

Because the dataset is cross-sectional and modest in size, I did not perform a train-test split; instead, model performance and generalizability were validated using k-fold cross-validation (k = 10) within the regression steps.

All analyses were conducted in Python 3.11 using the libraries pandas, numpy, scikit-learn, and statsmodels.

Comments

Popular posts from this blog

Exploring the Relationship Between Economic Prosperity, Health, and Internet Adoption Across Countries

Python project 2

Simple Linear Regression