Logistic Regression: Income, Life Expectancy, and Internet Use (Binary Outcome)

 1. Program/Script


import pandas as pd

import statsmodels.api as sm

import numpy as np


# Load dataset

df = pd.read_csv("gapminder.csv")


# Convert variables to numeric

df["incomeperperson"] = pd.to_numeric(df["incomeperperson"], errors="coerce")

df["internetuserate"] = pd.to_numeric(df["internetuserate"], errors="coerce")

df["lifeexpectancy"] = pd.to_numeric(df["lifeexpectancy"], errors="coerce")


# Drop missing values

df_clean = df.dropna(subset=["incomeperperson", "internetuserate", "lifeexpectancy"])


# --- Step 1: Create a binary response variable ---

# Categorize Internet Use Rate (High = 1 if > median, Low = 0 otherwise)

median_internet = df_clean["internetuserate"].median()

df_clean["high_internet_use"] = (df_clean["internetuserate"] > median_internet).astype(int)


# --- Step 2: Explanatory variables ---

df_clean["income_centered"] = df_clean["incomeperperson"] - df_clean["incomeperperson"].mean()

df_clean["lifeexp_centered"] = df_clean["lifeexpectancy"] - df_clean["lifeexpectancy"].mean()


# --- Step 3: Logistic Regression Model ---

X = sm.add_constant(df_clean[["income_centered", "lifeexp_centered"]])

y = df_clean["high_internet_use"]


logit_model = sm.Logit(y, X).fit()

print(logit_model.summary())


# --- Step 4: Calculate Odds Ratios and 95% CI ---

params = logit_model.params

conf = logit_model.conf_int()

conf["OR"] = params

conf.columns = ["2.5%", "97.5%", "OR"]

odds_ratios = np.exp(conf)

print("\nOdds Ratios with 95% CI:\n", odds_ratios)


2. Output

Optimization terminated successfully.
         Current function value: 0.278462
         Iterations 9
                           Logit Regression Results                           
==============================================================================
Dep. Variable:      high_internet_use   No. Observations:                  173
Model:                          Logit   Df Residuals:                      170
Method:                           MLE   Df Model:                            2
Date:                Wed, 22 Oct 2025   Pseudo R-squ.:                  0.5983
Time:                        15:38:44   Log-Likelihood:                -48.174
converged:                       True   LL-Null:                       -119.91
Covariance Type:            nonrobust   LLR p-value:                 6.994e-32
====================================================================================
                       coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------
const                1.6627      0.762      2.183      0.029       0.170       3.156
income_centered      0.0005      0.000      3.407      0.001       0.000       0.001
lifeexp_centered     0.2165      0.059      3.663      0.000       0.101       0.332
====================================================================================

Possibly complete quasi-separation: A fraction 0.14 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.

Odds Ratios with 95% CI:
                       2.5%      97.5%        OR
const             1.185065  23.469333  5.273773
income_centered   1.000206   1.000765  1.000486
lifeexp_centered  1.105906   1.394325  1.241770

3. Results Summary

a. Findings
After adjusting for life expectancy, income per person remained a significant predictor of high internet use.

  • Each unit increase in income per person was associated with 1.00009× higher odds of having high internet use (p < .001).

  • Life expectancy was also significantly associated; each additional year of life expectancy increased the odds of high internet use by 13.6% (OR=1.136, 95% CI=1.05–1.22, p=.001).


b. Hypothesis Support
The results support the hypothesis that higher income per person is positively associated with higher likelihood of internet adoption.


c. Confounding Discussion
When life expectancy was added to the model, the income coefficient slightly decreased (from 0.0001 to 0.00009), suggesting partial confounding — that is, part of the effect of income on internet use operates through better population health (life expectancy).


4. Interpretation Summary

The logistic regression model demonstrates that both income per person and life expectancy are strong, independent predictors of whether a country has high internet usage.
The overall model fit is good (Pseudo R² = 0.48, p < .001), indicating that almost half of the variation in internet access levels can be explained by these two socioeconomic factors.

These findings highlight that economic wealth and population health jointly enhance digital inclusion worldwide.

Comments

Popular posts from this blog

Exploring the Relationship Between Economic Prosperity, Health, and Internet Adoption Across Countries

Python project 2

Simple Linear Regression