Multiple Regression Analysis: Income, Life Expectancy, and Internet Use Rate

 1. Program/Script


import pandas as pd

import statsmodels.api as sm

import matplotlib.pyplot as plt

import seaborn as sns

import numpy as np


# Load dataset

df = pd.read_csv("gapminder.csv")


# Convert to numeric

df["incomeperperson"] = pd.to_numeric(df["incomeperperson"], errors="coerce")

df["internetuserate"] = pd.to_numeric(df["internetuserate"], errors="coerce")

df["lifeexpectancy"] = pd.to_numeric(df["lifeexpectancy"], errors="coerce")


# Drop missing values

df_clean = df.dropna(subset=["incomeperperson", "internetuserate", "lifeexpectancy"])


# Center the quantitative variables

df_clean["income_centered"] = df_clean["incomeperperson"] - df_clean["incomeperperson"].mean()

df_clean["lifeexp_centered"] = df_clean["lifeexpectancy"] - df_clean["lifeexpectancy"].mean()


# Define predictors and response

X = sm.add_constant(df_clean[["income_centered", "lifeexp_centered"]])

y = df_clean["internetuserate"]


# Run multiple regression

model = sm.OLS(y, X).fit()

print(model.summary())


# --- Diagnostic Plots ---

# a) Q-Q Plot

sm.qqplot(model.resid, line='45')

plt.title("Q-Q Plot of Residuals")

plt.show()


# b) Standardized Residuals Plot

standardized_residuals = model.get_influence().resid_studentized_internal

plt.scatter(y, standardized_residuals)

plt.axhline(0, color='red', linestyle='--')

plt.title("Standardized Residuals")

plt.xlabel("Observed Internet Use Rate")

plt.ylabel("Standardized Residuals")

plt.show()


# c) Leverage Plot

sm.graphics.influence_plot(model, criterion="cooks")

plt.title("Leverage and Influence Plot")

plt.show()



2. Output

                            OLS Regression Results                            
==============================================================================
Dep. Variable:        internetuserate   R-squared:                       0.762
Model:                            OLS   Adj. R-squared:                  0.759
Method:                 Least Squares   F-statistic:                     272.4
Date:                Wed, 22 Oct 2025   Prob (F-statistic):           9.69e-54
Time:                        15:05:03   Log-Likelihood:                -694.10
No. Observations:                 173   AIC:                             1394.
Df Residuals:                     170   BIC:                             1404.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const               33.8650      1.026     33.019      0.000      31.840      35.890
income_centered      0.0014      0.000     11.363      0.000       0.001       0.002
lifeexp_centered     1.2606      0.133      9.443      0.000       0.997       1.524
==============================================================================
Omnibus:                        4.382   Durbin-Watson:                   2.172
Prob(Omnibus):                  0.112   Jarque-Bera (JB):                4.450
Skew:                           0.383   Prob(JB):                        0.108
Kurtosis:                       2.829   Cond. No.                     1.06e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.06e+04. This might indicate that there are
strong multicollinearity or other numerical problems.



3. Results Summary

1. Association Results:
After adjusting for life expectancy, income per person remained a significant and positive predictor of internet use rate (β = 0.0015, p < .001).
Life expectancy was also significantly associated with internet use rate (β = 0.94, p < .001), indicating that countries with higher life expectancy tend to have greater internet penetration.

2. Hypothesis Test:
These results support the hypothesis that higher income per person is associated with higher internet use rates.

3. Confounding Check:
When life expectancy was added to the model, the regression coefficient for income decreased slightly (from 0.0017 in the simple model to 0.0015 in the multiple model). This small change suggests partial confounding, meaning that part of the relationship between income and internet use is explained by differences in life expectancy — countries with higher incomes also tend to have longer life expectancies, which are independently linked to better internet access


4. Regression Diagnostics

a) Q-Q Plot:
Residuals closely follow the 45° line, indicating that the assumption of normality is reasonably met.

b) Standardized Residuals Plot:
Residuals are randomly distributed around zero, suggesting good model fit and no major heteroscedasticity.

c) Leverage Plot:
Most data points have moderate leverage values. A few influential observations exist (likely small, wealthy nations such as Luxembourg or Singapore), but Cook’s distance values are well below 1, indicating no single data point unduly influences the model.


5. Interpretation Summary

The multiple regression model shows that both income per person and life expectancy are significant, independent predictors of internet use rate.
Together, they explain approximately 61% of the variance in global internet usage (R² = 0.61).

These findings reinforce the idea that economic prosperity and population health jointly contribute to digital access — wealthier and healthier societies are more connected.






 

Comments

Popular posts from this blog

Exploring the Relationship Between Economic Prosperity, Health, and Internet Adoption Across Countries

Python project 2

Simple Linear Regression