Sanjoy

Posts

Showing posts from October, 2025

Economic Growth, Environmental Sustainability, and Health Outcomes in Developing Countries: Preliminary Statistical Results Using World Bank Data

October 25, 2025

1. Overview of Analyses This analysis builds on the methods described earlier, using the World Bank World Development Indicators dataset (2012–2013) to explore how economic growth and environmental sustainability relate to national health outcomes in developing countries (N = 92) . All variables were standardized, and key indicators included: Economic: GDP per capita, gross domestic savings Environmental: CO₂ damage, renewable electricity output, natural resource depletion Health: life expectancy, infant mortality rate, and health expenditure per capita Three sets of analyses were conducted: Descriptive statistics to summarize distributions. Multiple regression models to examine economic and environmental effects on health outcomes. K-means clustering to identify country groups with similar development profiles. 2. Preliminary Statistical Findings A. Economic Predictors and Health Outcomes The multiple regression results indicate a strong...

Milestone Assignment 2: Methods

October 25, 2025

1. Sample The data were obtained from the World Bank World Development Indicators (WDI) dataset, which provides standardized cross-national indicators on economic, environmental, and health domains. From this global dataset, I selected developing countries following the World Bank’s 2023 income classification (low- and lower-middle-income economies). Observations with substantial missing data across key indicators were removed, yielding a final analytical sample of 92 countries with data available for 2012 and 2013 , corresponding to the years represented in the uploaded WDI extract. The sample includes nations from Sub-Saharan Africa, South Asia, Latin America, and Southeast Asia. Each observation represents one country–year combination (N = 184 rows). 2. Measures All variables were sourced from the WDI variable dictionary provided in the accompanying codebook Economic Growth Indicators GDP per capita (current US$) – total economic output per person ( x142_2012 , x142_201...

Exploring the Relationship Between Economic Prosperity, Health, and Internet Adoption Across Countries

October 25, 2025

1. Project Title: Exploring the Relationship Between Economic Prosperity, Health, and Internet Adoption Across Countries 2. Research Question How are a country’s income per person and life expectancy associated with its internet use rate ? Specifically, does higher income and longer life expectancy predict greater internet adoption across nations? 3. Motivation / Rationale Access to the internet has become a key indicator of technological advancement, educational access, and economic opportunity in the modern world. However, large disparities still exist globally. Understanding what drives these differences—especially whether economic well-being and population health influence digital inclusion —can reveal important insights into global inequality and development. By using publicly available Gapminder data, I aim to identify the strongest predictors of internet use and examine how economic and health indicators interact to shape technological access across...

K-Means Cluster Analysis: Grouping Countries by Development and Internet Use

October 22, 2025

1. Program # ====== Setup ====== import pandas as pd import numpy as np from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score import matplotlib.pyplot as plt import seaborn as sns # ====== Load and Prepare Data ====== df = pd.read_csv("gapminder.csv") # Choose clustering variables (quantitative) cols = ["incomeperperson", "lifeexpectancy", "urbanrate", "employrate", "femaleemployrate", "internetuserate"] cols = [c for c in cols if c in df.columns] # Convert to numeric and drop missing for c in cols: df[c] = pd.to_numeric(df[c], errors="coerce") df_clean = df.dropna(subset=cols) # ====== Standardize Variables ====== scaler = StandardScaler() X_scaled = scaler.fit_transform(df_clean[cols]) # ====== Determine Optimal k via Elbow & Silhouette ====== inertia = [] silhouette = [] K = range(2, 7) for k in K: kmeans = KMeans(n_c...

Lasso Regression: Identifying Key Predictors of Internet Use

October 22, 2025

1. Program # ====== Setup ====== import pandas as pd import numpy as np from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LassoCV from sklearn.pipeline import Pipeline from sklearn.model_selection import KFold import matplotlib.pyplot as plt # ====== Load & Prepare Data ====== df = pd.read_csv("gapminder.csv") # Choose predictors cols = ["incomeperperson", "lifeexpectancy", "urbanrate", "employrate", "femaleemployrate"] cols = [c for c in cols if c in df.columns] for c in cols + ["internetuserate"]: df[c] = pd.to_numeric(df[c], errors="coerce") df = df.dropna(subset=cols + ["internetuserate"]) X = df[cols] y = df["internetuserate"] # ====== Lasso with k-fold CV ====== lasso_pipe = Pipeline([ ("scaler", StandardScaler()), ("lasso", LassoCV(cv=10, random_state=42)) ]) lasso_pipe.fit(X, y) lasso = lasso_pipe.named_steps[...

Random Forest: Predicting High Internet Use

October 22, 2025

1. Program # ====== Setup ====== import pandas as pd import numpy as np from sklearn.model_selection import train_test_split, cross_val_score from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import ( accuracy_score, confusion_matrix, classification_report, roc_auc_score, roc_curve ) from sklearn.inspection import permutation_importance from sklearn.impute import SimpleImputer from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder import matplotlib.pyplot as plt # ====== Load & prepare data (Gapminder) ====== df = pd.read_csv("gapminder.csv") # replace with your path # Choose predictors (add more if available in your file) num_cols = ["incomeperperson", "lifeexpectancy", "urbanrate", "employrate", "femaleemployrate"] num_cols = [c for c in num_cols if c in df.columns] # keep only columns that exist # Convert numerics ...

Classification Tree: Predicting High Internet Use

October 22, 2025

1. Program # ====== Setup ====== import pandas as pd import numpy as np from sklearn.model_selection import train_test_split, cross_val_score from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree from sklearn.metrics import ( accuracy_score, confusion_matrix, classification_report, roc_auc_score, roc_curve ) from sklearn.impute import SimpleImputer from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer import matplotlib.pyplot as plt # ====== Load & prepare data (Gapminder) ====== df = pd.read_csv("gapminder.csv") # put your file path here # Keep the features we’ll try first (add more if you have them) num_cols = ["incomeperperson", "lifeexpectancy"] # you can add: "urbanrate", "employrate", etc. if present # Convert to numeric for c in num_cols + ["internetuserate"]: df[c] = pd.to_numeric(df[c], errors="coerce") # Binary response: High Internet Use (1) if...

Logistic Regression: Income, Life Expectancy, and Internet Use (Binary Outcome)

October 22, 2025

1. Program/Script import pandas as pd import statsmodels.api as sm import numpy as np # Load dataset df = pd.read_csv("gapminder.csv") # Convert variables to numeric df["incomeperperson"] = pd.to_numeric(df["incomeperperson"], errors="coerce") df["internetuserate"] = pd.to_numeric(df["internetuserate"], errors="coerce") df["lifeexpectancy"] = pd.to_numeric(df["lifeexpectancy"], errors="coerce") # Drop missing values df_clean = df.dropna(subset=["incomeperperson", "internetuserate", "lifeexpectancy"]) # --- Step 1: Create a binary response variable --- # Categorize Internet Use Rate (High = 1 if > median, Low = 0 otherwise) median_internet = df_clean["internetuserate"].median() df_clean["high_internet_use"] = (df_clean["internetuserate"] > median_internet).astype(int) # --- Step 2: Explanatory variables --- df_clean["...

Multiple Regression Analysis: Income, Life Expectancy, and Internet Use Rate

October 22, 2025

1. Program/Script import pandas as pd import statsmodels.api as sm import matplotlib.pyplot as plt import seaborn as sns import numpy as np # Load dataset df = pd.read_csv("gapminder.csv") # Convert to numeric df["incomeperperson"] = pd.to_numeric(df["incomeperperson"], errors="coerce") df["internetuserate"] = pd.to_numeric(df["internetuserate"], errors="coerce") df["lifeexpectancy"] = pd.to_numeric(df["lifeexpectancy"], errors="coerce") # Drop missing values df_clean = df.dropna(subset=["incomeperperson", "internetuserate", "lifeexpectancy"]) # Center the quantitative variables df_clean["income_centered"] = df_clean["incomeperperson"] - df_clean["incomeperperson"].mean() df_clean["lifeexp_centered"] = df_clean["lifeexpectancy"] - df_clean["lifeexpectancy"].mean() # Define predictors and response X...