Lasso Regression: Identifying Key Predictors of Internet Use

 1. Program


# ====== Setup ======

import pandas as pd

import numpy as np

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LassoCV

from sklearn.pipeline import Pipeline

from sklearn.model_selection import KFold

import matplotlib.pyplot as plt


# ====== Load & Prepare Data ======

df = pd.read_csv("gapminder.csv")


# Choose predictors

cols = ["incomeperperson", "lifeexpectancy", "urbanrate", "employrate", "femaleemployrate"]

cols = [c for c in cols if c in df.columns]


for c in cols + ["internetuserate"]:

    df[c] = pd.to_numeric(df[c], errors="coerce")


df = df.dropna(subset=cols + ["internetuserate"])


X = df[cols]

y = df["internetuserate"]


# ====== Lasso with k-fold CV ======

lasso_pipe = Pipeline([

    ("scaler", StandardScaler()),

    ("lasso", LassoCV(cv=10, random_state=42))

])


lasso_pipe.fit(X, y)


lasso = lasso_pipe.named_steps["lasso"]


print("Optimal alpha (λ):", lasso.alpha_)

print("\nCoefficients:")

for name, coef in zip(cols, lasso.coef_):

    print(f"{name:20s} {coef:10.4f}")


# ====== Plot Coefficient Shrinkage Path ======

plt.figure(figsize=(8,5))

plt.plot(lasso.alphas_, lasso.mse_path_.mean(axis=1), marker='o')

plt.xlabel("Alpha (λ)")

plt.ylabel("Mean Squared Error (CV)")

plt.title("Lasso Cross-Validation Path")

plt.grid(True)

plt.show()




2. Output

Optimal alpha (λ): 0.1277895476243271

Coefficients:
incomeperperson        13.2012
lifeexpectancy         10.7189
urbanrate               3.5663
employrate             -5.4950
femaleemployrate        5.6419



3. Interpretation

Model overview.
A Lasso regression using 10-fold cross-validation was conducted to predict Internet Use Rate (quantitative response) based on five explanatory variables:

  • Income per person

  • Life expectancy

  • Urbanization rate

  • Employment rate

  • Female employment rate

The optimal regularization parameter was λ = 0.128, selected automatically to minimize cross-validated prediction error.

Variable selection and shrinkage.
Lasso penalizes large coefficients to avoid overfitting. In this model, all five predictors retained non-zero coefficients, indicating they each contribute to explaining variation in Internet use. However, the amount of shrinkage shows their relative importance:

  • Income per person (β = 13.20) and life expectancy (β = 10.72) are the strongest positive predictors — countries with higher income and longer life expectancy tend to have higher Internet usage.

  • Urbanization (β = 3.57) and female employment rate (β = 5.64) show smaller positive effects.

  • Employment rate (β = –5.50) was the only negative coefficient, suggesting that in some contexts, higher employment (possibly in low-income or rural sectors) does not necessarily translate to greater Internet access.

Interpretation summary.
The Lasso regression highlights that economic prosperity and health (income and life expectancy) remain the dominant predictors of Internet use, while urbanization and gender-inclusive employment provide additional, smaller contributions. The moderate penalty (λ ≈ 0.13) achieved a good balance between simplicity and predictive accuracy, shrinking coefficients toward zero without excluding variables entirely.

Model rationale.
Because the Gapminder dataset includes only around 180 countries, I did not split into training and test sets. Instead, I relied on k-fold cross-validation within the Lasso procedure to ensure internal robustness.




Comments

Popular posts from this blog

Exploring the Relationship Between Economic Prosperity, Health, and Internet Adoption Across Countries

Python project 2

Simple Linear Regression