Chi_square results
1. Program code
import pandas as pd
from scipy.stats import chi2_contingency
# Load dataset
df = pd.read_csv("gapminder.csv")
# Convert variables to numeric
df["incomeperperson"] = pd.to_numeric(df["incomeperperson"], errors="coerce")
df["internetuserate"] = pd.to_numeric(df["internetuserate"], errors="coerce")
# Create categorical groups
df["income_group"] = pd.cut(df["incomeperperson"],
bins=[0, 5000, 20000, 100000],
labels=["Low Income", "Middle Income", "High Income"])
df["internet_group"] = pd.cut(df["internetuserate"],
bins=[0, 30, 70, 100],
labels=["Low Internet Use", "Medium Internet Use", "High Internet Use"])
# Drop missing values
df_clean = df.dropna(subset=["income_group", "internet_group"])
# Create contingency table
contingency_table = pd.crosstab(df_clean["income_group"], df_clean["internet_group"])
print("Contingency Table:\n", contingency_table)
# Run Chi-Square Test of Independence
chi2, p, dof, expected = chi2_contingency(contingency_table)
print("\nChi-Square Test Results")
print("Chi2:", chi2, " | df:", dof, " | p-value:", p)
print("\nExpected Frequencies:\n", expected)
The Chi-Square Test of Independence revealed a significant association between income group and internet use group, χ²(4, N=183) = 175.44, p < .0001.
-
Observed vs Expected:
-
Low Income countries had far more Low Internet Use cases than expected under independence (88 vs ~55), and virtually no High Internet Use cases (0 vs ~19 expected).
-
High Income countries had far more High Internet Use cases than expected (24 vs ~5 expected).
-
Middle Income countries leaned more toward Medium Internet Use than expected.
-
-
Interpretation: This indicates that income level and internet penetration are not independent; rather, they are strongly associated. Low-income countries are disproportionately in the Low Internet Use group, while high-income countries are disproportionately in the High Internet Use group.
Comments
Post a Comment