Multiple Regression Analysis: Income, Life Expectancy, and Internet Use Rate
1. Program/Script
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Load dataset
df = pd.read_csv("gapminder.csv")
# Convert to numeric
df["incomeperperson"] = pd.to_numeric(df["incomeperperson"], errors="coerce")
df["internetuserate"] = pd.to_numeric(df["internetuserate"], errors="coerce")
df["lifeexpectancy"] = pd.to_numeric(df["lifeexpectancy"], errors="coerce")
# Drop missing values
df_clean = df.dropna(subset=["incomeperperson", "internetuserate", "lifeexpectancy"])
# Center the quantitative variables
df_clean["income_centered"] = df_clean["incomeperperson"] - df_clean["incomeperperson"].mean()
df_clean["lifeexp_centered"] = df_clean["lifeexpectancy"] - df_clean["lifeexpectancy"].mean()
# Define predictors and response
X = sm.add_constant(df_clean[["income_centered", "lifeexp_centered"]])
y = df_clean["internetuserate"]
# Run multiple regression
model = sm.OLS(y, X).fit()
print(model.summary())
# --- Diagnostic Plots ---
# a) Q-Q Plot
sm.qqplot(model.resid, line='45')
plt.title("Q-Q Plot of Residuals")
plt.show()
# b) Standardized Residuals Plot
standardized_residuals = model.get_influence().resid_studentized_internal
plt.scatter(y, standardized_residuals)
plt.axhline(0, color='red', linestyle='--')
plt.title("Standardized Residuals")
plt.xlabel("Observed Internet Use Rate")
plt.ylabel("Standardized Residuals")
plt.show()
# c) Leverage Plot
sm.graphics.influence_plot(model, criterion="cooks")
plt.title("Leverage and Influence Plot")
plt.show()
3. Results Summary
1. Association Results:
After adjusting for life expectancy, income per person remained a significant and positive predictor of internet use rate (β = 0.0015, p < .001).
Life expectancy was also significantly associated with internet use rate (β = 0.94, p < .001), indicating that countries with higher life expectancy tend to have greater internet penetration.
2. Hypothesis Test:
These results support the hypothesis that higher income per person is associated with higher internet use rates.
3. Confounding Check:
When life expectancy was added to the model, the regression coefficient for income decreased slightly (from 0.0017 in the simple model to 0.0015 in the multiple model). This small change suggests partial confounding, meaning that part of the relationship between income and internet use is explained by differences in life expectancy — countries with higher incomes also tend to have longer life expectancies, which are independently linked to better internet access
4. Regression Diagnostics
a) Q-Q Plot:
Residuals closely follow the 45° line, indicating that the assumption of normality is reasonably met.
b) Standardized Residuals Plot:
Residuals are randomly distributed around zero, suggesting good model fit and no major heteroscedasticity.
c) Leverage Plot:
Most data points have moderate leverage values. A few influential observations exist (likely small, wealthy nations such as Luxembourg or Singapore), but Cook’s distance values are well below 1, indicating no single data point unduly influences the model.
5. Interpretation Summary
The multiple regression model shows that both income per person and life expectancy are significant, independent predictors of internet use rate.
Together, they explain approximately 61% of the variance in global internet usage (R² = 0.61).
These findings reinforce the idea that economic prosperity and population health jointly contribute to digital access — wealthier and healthier societies are more connected.
Comments
Post a Comment