Lasso Regression: Identifying Key Predictors of Internet Use
1. Program
# ====== Setup ======
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LassoCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold
import matplotlib.pyplot as plt
# ====== Load & Prepare Data ======
df = pd.read_csv("gapminder.csv")
# Choose predictors
cols = ["incomeperperson", "lifeexpectancy", "urbanrate", "employrate", "femaleemployrate"]
cols = [c for c in cols if c in df.columns]
for c in cols + ["internetuserate"]:
df[c] = pd.to_numeric(df[c], errors="coerce")
df = df.dropna(subset=cols + ["internetuserate"])
X = df[cols]
y = df["internetuserate"]
# ====== Lasso with k-fold CV ======
lasso_pipe = Pipeline([
("scaler", StandardScaler()),
("lasso", LassoCV(cv=10, random_state=42))
])
lasso_pipe.fit(X, y)
lasso = lasso_pipe.named_steps["lasso"]
print("Optimal alpha (λ):", lasso.alpha_)
print("\nCoefficients:")
for name, coef in zip(cols, lasso.coef_):
print(f"{name:20s} {coef:10.4f}")
# ====== Plot Coefficient Shrinkage Path ======
plt.figure(figsize=(8,5))
plt.plot(lasso.alphas_, lasso.mse_path_.mean(axis=1), marker='o')
plt.xlabel("Alpha (λ)")
plt.ylabel("Mean Squared Error (CV)")
plt.title("Lasso Cross-Validation Path")
plt.grid(True)
plt.show()
3. Interpretation
Model overview.
A Lasso regression using 10-fold cross-validation was conducted to predict Internet Use Rate (quantitative response) based on five explanatory variables:
-
Income per person
-
Life expectancy
-
Urbanization rate
-
Employment rate
-
Female employment rate
The optimal regularization parameter was λ = 0.128, selected automatically to minimize cross-validated prediction error.
Variable selection and shrinkage.
Lasso penalizes large coefficients to avoid overfitting. In this model, all five predictors retained non-zero coefficients, indicating they each contribute to explaining variation in Internet use. However, the amount of shrinkage shows their relative importance:
-
Income per person (β = 13.20) and life expectancy (β = 10.72) are the strongest positive predictors — countries with higher income and longer life expectancy tend to have higher Internet usage.
-
Urbanization (β = 3.57) and female employment rate (β = 5.64) show smaller positive effects.
-
Employment rate (β = –5.50) was the only negative coefficient, suggesting that in some contexts, higher employment (possibly in low-income or rural sectors) does not necessarily translate to greater Internet access.
Interpretation summary.
The Lasso regression highlights that economic prosperity and health (income and life expectancy) remain the dominant predictors of Internet use, while urbanization and gender-inclusive employment provide additional, smaller contributions. The moderate penalty (λ ≈ 0.13) achieved a good balance between simplicity and predictive accuracy, shrinking coefficients toward zero without excluding variables entirely.
Model rationale.
Because the Gapminder dataset includes only around 180 countries, I did not split into training and test sets. Instead, I relied on k-fold cross-validation within the Lasso procedure to ensure internal robustness.
Comments
Post a Comment