Python project 2

 1. The Script

import pandas as pd

# Load dataset

df = pd.read_csv("gapminder.csv")

# Select variables

vars_of_interest = ["incomeperperson", "internetuserate", "lifeexpectancy"]

# Convert to numeric, handle missing/invalid

df[vars_of_interest] = df[vars_of_interest].apply(pd.to_numeric, errors="coerce")

# --- Data Management Decisions ---

# 1. Drop rows where all three variables are missing

df = df.dropna(subset=vars_of_interest, how="all")

# 2. Create categorical bins

df["income_group"] = pd.cut(df["incomeperperson"],

                            bins=[0, 5000, 20000, 100000],

                            labels=["Low Income", "Middle Income", "High Income"])

df["internet_group"] = pd.cut(df["internetuserate"],

                              bins=[0, 30, 70, 100],

                              labels=["Low Internet Use", "Medium Internet Use", "High Internet Use"])

df["lifeexp_group"] = pd.cut(df["lifeexpectancy"],

                             bins=[0, 60, 75, 90],

                             labels=["Low Life Expectancy", "Medium Life Expectancy", "High Life Expectancy"])

# --- Frequency distributions ---

print("Income Groups Frequency")

print(df["income_group"].value_counts(dropna=False))

print("\nInternet Use Groups Frequency")

print(df["internet_group"].value_counts(dropna=False))

print("\nLife Expectancy Groups Frequency")

print(df["lifeexp_group"].value_counts(dropna=False))


2. The Output

  • Income Groups Frequency

    • Low Income (≤ $5,000): 115 countries

    • Middle Income ($5,000–20,000): 45 countries

    • High Income (≥ $20,000): 29 countries

    • Missing (NaN): 19 countries

    Internet Use Groups Frequency

    • Low Internet Use (0–30%): 93 countries

    • Medium Internet Use (30–70%): 66 countries

    • High Internet Use (70–100%): 33 countries

    • Missing (NaN): 16 countries

    Life Expectancy Groups Frequency

    • Low Life Expectancy (≤ 60 years): 38 countries

    • Medium Life Expectancy (61–75 years): 88 countries

    • High Life Expectancy (76–90 years): 65 countries

    • Missing (NaN): 17 countries



3. Summary of Frequency Distributions

  • Income Groups: The majority of countries fall into the Low Income group (115), while only 29 countries are classified as High Income. There are 19 missing cases. This reflects wide variation in economic development.

  • Internet Use Groups: Most countries are in the Low and Medium Internet Use groups (159 combined), while only 33 countries have High Internet Use. This confirms a significant global digital divide, with 16 missing cases.

  • Life Expectancy Groups: Most countries fall in the Medium (88) or High (65) life expectancy categories, but 38 are still in the Low group, suggesting persistent global health disparities. There are 17 missing values.


Comments

Popular posts from this blog

Exploring the Relationship Between Economic Prosperity, Health, and Internet Adoption Across Countries

Simple Linear Regression