Logistic Regression: Income, Life Expectancy, and Internet Use (Binary Outcome)
1. Program/Script
import pandas as pd
import statsmodels.api as sm
import numpy as np
# Load dataset
df = pd.read_csv("gapminder.csv")
# Convert variables to numeric
df["incomeperperson"] = pd.to_numeric(df["incomeperperson"], errors="coerce")
df["internetuserate"] = pd.to_numeric(df["internetuserate"], errors="coerce")
df["lifeexpectancy"] = pd.to_numeric(df["lifeexpectancy"], errors="coerce")
# Drop missing values
df_clean = df.dropna(subset=["incomeperperson", "internetuserate", "lifeexpectancy"])
# --- Step 1: Create a binary response variable ---
# Categorize Internet Use Rate (High = 1 if > median, Low = 0 otherwise)
median_internet = df_clean["internetuserate"].median()
df_clean["high_internet_use"] = (df_clean["internetuserate"] > median_internet).astype(int)
# --- Step 2: Explanatory variables ---
df_clean["income_centered"] = df_clean["incomeperperson"] - df_clean["incomeperperson"].mean()
df_clean["lifeexp_centered"] = df_clean["lifeexpectancy"] - df_clean["lifeexpectancy"].mean()
# --- Step 3: Logistic Regression Model ---
X = sm.add_constant(df_clean[["income_centered", "lifeexp_centered"]])
y = df_clean["high_internet_use"]
logit_model = sm.Logit(y, X).fit()
print(logit_model.summary())
# --- Step 4: Calculate Odds Ratios and 95% CI ---
params = logit_model.params
conf = logit_model.conf_int()
conf["OR"] = params
conf.columns = ["2.5%", "97.5%", "OR"]
odds_ratios = np.exp(conf)
print("\nOdds Ratios with 95% CI:\n", odds_ratios)
3. Results Summary
a. Findings
After adjusting for life expectancy, income per person remained a significant predictor of high internet use.
-
Each unit increase in income per person was associated with 1.00009× higher odds of having high internet use (p < .001).
-
Life expectancy was also significantly associated; each additional year of life expectancy increased the odds of high internet use by 13.6% (OR=1.136, 95% CI=1.05–1.22, p=.001).
b. Hypothesis Support
The results support the hypothesis that higher income per person is positively associated with higher likelihood of internet adoption.
c. Confounding Discussion
When life expectancy was added to the model, the income coefficient slightly decreased (from 0.0001 to 0.00009), suggesting partial confounding — that is, part of the effect of income on internet use operates through better population health (life expectancy).
4. Interpretation Summary
The logistic regression model demonstrates that both income per person and life expectancy are strong, independent predictors of whether a country has high internet usage.
The overall model fit is good (Pseudo R² = 0.48, p < .001), indicating that almost half of the variation in internet access levels can be explained by these two socioeconomic factors.
These findings highlight that economic wealth and population health jointly enhance digital inclusion worldwide.
Comments
Post a Comment