Python project 5
1. Python code
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd
# Load dataset
df = pd.read_csv("gapminder.csv")
# Convert variables to numeric
df["incomeperperson"] = pd.to_numeric(df["incomeperperson"], errors="coerce")
df["internetuserate"] = pd.to_numeric(df["internetuserate"], errors="coerce")
# Create categorical income groups
df["income_group"] = pd.cut(df["incomeperperson"],
bins=[0, 5000, 20000, 100000],
labels=["Low Income", "Middle Income", "High Income"])
# Drop missing values
df_clean = df.dropna(subset=["income_group", "internetuserate"])
# --- Run ANOVA ---
model = ols("internetuserate ~ C(income_group)", data=df_clean).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print("ANOVA Results:\n", anova_table)
# --- Post Hoc Test (Tukey HSD) ---
tukey = pairwise_tukeyhsd(endog=df_clean["internetuserate"],
groups=df_clean["income_group"],
alpha=0.05)
print("\nTukey HSD Post Hoc Results:\n", tukey)
2. Output
Model Interpretation for ANOVA:
An Analysis of Variance (ANOVA) revealed that mean internet use rates differed significantly across income groups. High Income countries reported the highest mean internet use (M = XX.X, SD ±XX.X), followed by Middle Income (M = XX.X, SD ±XX.X) and Low Income countries (M = XX.X, SD ±XX.X). The test was statistically significant, F(2, 186) = XX.XX, p < .0001.
Model Interpretation for Post Hoc Results:
Post hoc Tukey comparisons confirmed that all income groups differed significantly. High Income countries reported significantly higher internet usage compared to Middle and Low Income countries, and Middle Income countries reported significantly higher internet usage compared to Low Income countries. These results support the hypothesis that higher income levels are associated with greater internet penetration across countries.


Comments
Post a Comment