Python project 2
1. The Script
import pandas as pd
# Load dataset
df = pd.read_csv("gapminder.csv")
# Select variables
vars_of_interest = ["incomeperperson", "internetuserate", "lifeexpectancy"]
# Convert to numeric, handle missing/invalid
df[vars_of_interest] = df[vars_of_interest].apply(pd.to_numeric, errors="coerce")
# --- Data Management Decisions ---
# 1. Drop rows where all three variables are missing
df = df.dropna(subset=vars_of_interest, how="all")
# 2. Create categorical bins
df["income_group"] = pd.cut(df["incomeperperson"],
bins=[0, 5000, 20000, 100000],
labels=["Low Income", "Middle Income", "High Income"])
df["internet_group"] = pd.cut(df["internetuserate"],
bins=[0, 30, 70, 100],
labels=["Low Internet Use", "Medium Internet Use", "High Internet Use"])
df["lifeexp_group"] = pd.cut(df["lifeexpectancy"],
bins=[0, 60, 75, 90],
labels=["Low Life Expectancy", "Medium Life Expectancy", "High Life Expectancy"])
# --- Frequency distributions ---
print("Income Groups Frequency")
print(df["income_group"].value_counts(dropna=False))
print("\nInternet Use Groups Frequency")
print(df["internet_group"].value_counts(dropna=False))
print("\nLife Expectancy Groups Frequency")
print(df["lifeexp_group"].value_counts(dropna=False))
Income Groups Frequency
-
Low Income (≤ $5,000): 115 countries
-
Middle Income ($5,000–20,000): 45 countries
-
High Income (≥ $20,000): 29 countries
-
Missing (NaN): 19 countries
Internet Use Groups Frequency
-
Low Internet Use (0–30%): 93 countries
-
Medium Internet Use (30–70%): 66 countries
-
High Internet Use (70–100%): 33 countries
-
Missing (NaN): 16 countries
Life Expectancy Groups Frequency
-
Low Life Expectancy (≤ 60 years): 38 countries
-
Medium Life Expectancy (61–75 years): 88 countries
-
High Life Expectancy (76–90 years): 65 countries
-
Missing (NaN): 17 countries
-
3. Summary of Frequency Distributions
Income Groups: The majority of countries fall into the Low Income group (115), while only 29 countries are classified as High Income. There are 19 missing cases. This reflects wide variation in economic development.
-
Internet Use Groups: Most countries are in the Low and Medium Internet Use groups (159 combined), while only 33 countries have High Internet Use. This confirms a significant global digital divide, with 16 missing cases.
-
Life Expectancy Groups: Most countries fall in the Medium (88) or High (65) life expectancy categories, but 38 are still in the Low group, suggesting persistent global health disparities. There are 17 missing values.
Comments
Post a Comment