Random Forest: Predicting High Internet Use

1. Program

# ====== Setup ======

import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split, cross_val_score

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import (

accuracy_score, confusion_matrix, classification_report, roc_auc_score, roc_curve

)

from sklearn.inspection import permutation_importance

from sklearn.impute import SimpleImputer

from sklearn.pipeline import Pipeline

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import OneHotEncoder

import matplotlib.pyplot as plt

# ====== Load & prepare data (Gapminder) ======

df = pd.read_csv("gapminder.csv") # replace with your path

# Choose predictors (add more if available in your file)

num_cols = ["incomeperperson", "lifeexpectancy", "urbanrate", "employrate", "femaleemployrate"]

num_cols = [c for c in num_cols if c in df.columns] # keep only columns that exist

# Convert numerics

for c in num_cols + ["internetuserate"]:

if c in df.columns:

df[c] = pd.to_numeric(df[c], errors="coerce")

# Binary response: High Internet Use (1) if > median, else 0

median_internet = df["internetuserate"].median()

df["high_internet_use"] = (df["internetuserate"] > median_internet).astype(int)

# Optional: include categorical predictors if you have them

cat_cols = [] # e.g., ["region"] if present; will be one-hot encoded

# Drop rows missing response or all predictors

keep_cols = num_cols + cat_cols + ["high_internet_use"]

df = df.dropna(subset=["high_internet_use"])

df = df.dropna(subset=num_cols, how="all")

X = df[num_cols + cat_cols]

y = df["high_internet_use"]

# ====== Train / Test split ======

X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size=0.25, random_state=42, stratify=y

)

# ====== Preprocess ======

num_pipe = Pipeline(steps=[("imputer", SimpleImputer(strategy="median"))])

cat_pipe = Pipeline(steps=[("ohe", OneHotEncoder(handle_unknown="ignore"))]) if cat_cols else None

transformers = [("num", num_pipe, num_cols)]

if cat_cols:

transformers.append(("cat", cat_pipe, cat_cols))

pre = ColumnTransformer(transformers=transformers)

# ====== Random Forest model ======

rf = RandomForestClassifier(

n_estimators=500,

max_depth=None,

min_samples_leaf=3,

class_weight="balanced",

oob_score=True,

random_state=42,

n_jobs=-1

)

pipe = Pipeline(steps=[("pre", pre), ("model", rf)])

# ====== Fit ======

pipe.fit(X_train, y_train)

# ====== Evaluation ======

y_pred = pipe.predict(X_test)

# For AUC we need probabilities:

y_prob = pipe.predict_proba(X_test)[:, 1]

print("Test Accuracy:", accuracy_score(y_test, y_pred))

print("Test ROC AUC:", roc_auc_score(y_test, y_prob))

print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

print("\nClassification Report:\n", classification_report(y_test, y_pred, digits=3))

# Cross-validated accuracy (optional)

cv_scores = cross_val_score(pipe, X, y, cv=5, scoring="accuracy", n_jobs=-1)

print("CV Accuracy (mean ± SD): %.3f ± %.3f" % (cv_scores.mean(), cv_scores.std()))

# OOB score (available only if no CV wrapper on raw RF; we can get it via refitting RF on transformed data)

# Fit RF on fully transformed training data to print OOB directly:

Xt_train = pre.fit_transform(X_train)

rf_only = RandomForestClassifier(

n_estimators=500, max_depth=None, min_samples_leaf=3,

class_weight="balanced", oob_score=True, random_state=42, n_jobs=-1

).fit(Xt_train, y_train)

print("OOB Accuracy:", rf_only.oob_score_)

# ====== Feature Importances ======

# Tree-based importances from the fitted model in the pipeline:

rf_model = pipe.named_steps["model"]

# Get feature names after preprocessing:

feature_names = []

if num_cols:

feature_names += num_cols

if cat_cols:

# names from OneHotEncoder:

cat_names = pipe.named_steps["pre"].named_transformers_["cat"].named_steps["ohe"].get_feature_names_out(cat_cols)

feature_names += cat_names.tolist()

importances = rf_model.feature_importances_

imp_table = pd.DataFrame({"feature": feature_names, "importance": importances}).sort_values("importance", ascending=False)

print("\nRandom Forest Feature Importances:\n", imp_table)

# Permutation importance (robust, model-agnostic)

perm = permutation_importance(pipe, X_test, y_test, n_repeats=20, scoring="roc_auc", n_jobs=-1, random_state=42)

perm_imp = pd.DataFrame({

"feature": feature_names,

"perm_importance_mean": perm.importances_mean,

"perm_importance_std": perm.importances_std

}).sort_values("perm_importance_mean", ascending=False)

print("\nPermutation Importances (AUC):\n", perm_imp)

# ====== (Optional) AUC curve and simple n_estimators sweep ======

fpr, tpr, _ = roc_curve(y_test, y_prob)

plt.figure()

plt.plot(fpr, tpr)

plt.plot([0,1], [0,1], linestyle="--")

plt.xlabel("False Positive Rate")

plt.ylabel("True Positive Rate")

plt.title("ROC Curve (AUC = %.3f)" % roc_auc_score(y_test, y_prob))

plt.show()

# Simple sweep to show how accuracy changes with number of trees

tree_counts = [50, 100, 200, 300, 500, 800]

accs = []

for n in tree_counts:

rf_tmp = RandomForestClassifier(

n_estimators=n, max_depth=None, min_samples_leaf=3,

class_weight="balanced", random_state=42, n_jobs=-1

)

tmp_pipe = Pipeline(steps=[("pre", pre), ("model", rf_tmp)])

tmp_pipe.fit(X_train, y_train)

accs.append(accuracy_score(y_test, tmp_pipe.predict(X_test)))

print("\nAccuracy by n_estimators:")

for n, a in zip(tree_counts, accs):

print(f"{n:>4} trees: {a:.3f}")

2. Output

Test Accuracy: 0.7547169811320755

Test ROC AUC: 0.8864942528735632

Confusion Matrix:

[[23 6]

[ 7 17]]

Classification Report:

precision recall f1-score support

0 0.767 0.793 0.780 29

1 0.739 0.708 0.723 24

accuracy 0.755 53

macro avg 0.753 0.751 0.752 53

weighted avg 0.754 0.755 0.754 53

CV Accuracy (mean ± SD): 0.842 ± 0.041

OOB Accuracy: 0.8717948717948718

Random Forest Feature Importances:

feature importance

0 incomeperperson 0.395162

2 urbanrate 0.251860

1 lifeexpectancy 0.245844

4 femaleemployrate 0.055571

3 employrate 0.051563

Permutation Importances (AUC):

feature perm_importance_mean perm_importance_std

0 incomeperperson 0.124497 0.047366

1 lifeexpectancy 0.097486 0.034053

2 urbanrate 0.030172 0.037480

4 femaleemployrate 0.016954 0.008687

3 employrate 0.011925 0.005435

3. Interpretation

Model performance.
The random forest predicts whether a country’s internet use is above the median with 75.5% accuracy and excellent discrimination (ROC AUC = 0.887). Cross-validated accuracy (0.842 ± 0.041) and OOB accuracy (0.872) indicate the model generalizes well beyond the test split.

Confusion matrix.
Out of 53 countries in the test set, the model correctly classified 23/29 low-internet countries and 17/24 high-internet countries, reflecting balanced performance across classes (see precision/recall scores for each class).

Which variables matter most?
Both importance methods agree that income per person is the strongest predictor, followed by life expectancy and urbanization rate. Employment rates (overall and female) contribute modestly.

Gini importances: income (0.395) > urbanization (0.252) ≈ life expectancy (0.246).
Permutation (AUC) importances: income (0.124) > life expectancy (0.097) > urbanization (0.030).

Takeaway.
The forest captures nonlinear cutoffs and interactions: higher income strongly increases the likelihood of high internet use, and greater life expectancy and higher urbanization further tilt countries toward high adoption. Results reinforce your earlier findings from linear models and trees using different methodology.

Search This Blog

Sanjoy

Random Forest: Predicting High Internet Use

3. Interpretation

Comments

Post a Comment

Popular posts from this blog

Exploring the Relationship Between Economic Prosperity, Health, and Internet Adoption Across Countries

Python project 2

Simple Linear Regression