Random Forest: Predicting High Internet Use
1. Program
# ====== Setup ======
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
accuracy_score, confusion_matrix, classification_report, roc_auc_score, roc_curve
)
from sklearn.inspection import permutation_importance
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
import matplotlib.pyplot as plt
# ====== Load & prepare data (Gapminder) ======
df = pd.read_csv("gapminder.csv") # replace with your path
# Choose predictors (add more if available in your file)
num_cols = ["incomeperperson", "lifeexpectancy", "urbanrate", "employrate", "femaleemployrate"]
num_cols = [c for c in num_cols if c in df.columns] # keep only columns that exist
# Convert numerics
for c in num_cols + ["internetuserate"]:
if c in df.columns:
df[c] = pd.to_numeric(df[c], errors="coerce")
# Binary response: High Internet Use (1) if > median, else 0
median_internet = df["internetuserate"].median()
df["high_internet_use"] = (df["internetuserate"] > median_internet).astype(int)
# Optional: include categorical predictors if you have them
cat_cols = [] # e.g., ["region"] if present; will be one-hot encoded
# Drop rows missing response or all predictors
keep_cols = num_cols + cat_cols + ["high_internet_use"]
df = df.dropna(subset=["high_internet_use"])
df = df.dropna(subset=num_cols, how="all")
X = df[num_cols + cat_cols]
y = df["high_internet_use"]
# ====== Train / Test split ======
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42, stratify=y
)
# ====== Preprocess ======
num_pipe = Pipeline(steps=[("imputer", SimpleImputer(strategy="median"))])
cat_pipe = Pipeline(steps=[("ohe", OneHotEncoder(handle_unknown="ignore"))]) if cat_cols else None
transformers = [("num", num_pipe, num_cols)]
if cat_cols:
transformers.append(("cat", cat_pipe, cat_cols))
pre = ColumnTransformer(transformers=transformers)
# ====== Random Forest model ======
rf = RandomForestClassifier(
n_estimators=500,
max_depth=None,
min_samples_leaf=3,
class_weight="balanced",
oob_score=True,
random_state=42,
n_jobs=-1
)
pipe = Pipeline(steps=[("pre", pre), ("model", rf)])
# ====== Fit ======
pipe.fit(X_train, y_train)
# ====== Evaluation ======
y_pred = pipe.predict(X_test)
# For AUC we need probabilities:
y_prob = pipe.predict_proba(X_test)[:, 1]
print("Test Accuracy:", accuracy_score(y_test, y_pred))
print("Test ROC AUC:", roc_auc_score(y_test, y_prob))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred, digits=3))
# Cross-validated accuracy (optional)
cv_scores = cross_val_score(pipe, X, y, cv=5, scoring="accuracy", n_jobs=-1)
print("CV Accuracy (mean ± SD): %.3f ± %.3f" % (cv_scores.mean(), cv_scores.std()))
# OOB score (available only if no CV wrapper on raw RF; we can get it via refitting RF on transformed data)
# Fit RF on fully transformed training data to print OOB directly:
Xt_train = pre.fit_transform(X_train)
rf_only = RandomForestClassifier(
n_estimators=500, max_depth=None, min_samples_leaf=3,
class_weight="balanced", oob_score=True, random_state=42, n_jobs=-1
).fit(Xt_train, y_train)
print("OOB Accuracy:", rf_only.oob_score_)
# ====== Feature Importances ======
# Tree-based importances from the fitted model in the pipeline:
rf_model = pipe.named_steps["model"]
# Get feature names after preprocessing:
feature_names = []
if num_cols:
feature_names += num_cols
if cat_cols:
# names from OneHotEncoder:
cat_names = pipe.named_steps["pre"].named_transformers_["cat"].named_steps["ohe"].get_feature_names_out(cat_cols)
feature_names += cat_names.tolist()
importances = rf_model.feature_importances_
imp_table = pd.DataFrame({"feature": feature_names, "importance": importances}).sort_values("importance", ascending=False)
print("\nRandom Forest Feature Importances:\n", imp_table)
# Permutation importance (robust, model-agnostic)
perm = permutation_importance(pipe, X_test, y_test, n_repeats=20, scoring="roc_auc", n_jobs=-1, random_state=42)
perm_imp = pd.DataFrame({
"feature": feature_names,
"perm_importance_mean": perm.importances_mean,
"perm_importance_std": perm.importances_std
}).sort_values("perm_importance_mean", ascending=False)
print("\nPermutation Importances (AUC):\n", perm_imp)
# ====== (Optional) AUC curve and simple n_estimators sweep ======
fpr, tpr, _ = roc_curve(y_test, y_prob)
plt.figure()
plt.plot(fpr, tpr)
plt.plot([0,1], [0,1], linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve (AUC = %.3f)" % roc_auc_score(y_test, y_prob))
plt.show()
# Simple sweep to show how accuracy changes with number of trees
tree_counts = [50, 100, 200, 300, 500, 800]
accs = []
for n in tree_counts:
rf_tmp = RandomForestClassifier(
n_estimators=n, max_depth=None, min_samples_leaf=3,
class_weight="balanced", random_state=42, n_jobs=-1
)
tmp_pipe = Pipeline(steps=[("pre", pre), ("model", rf_tmp)])
tmp_pipe.fit(X_train, y_train)
accs.append(accuracy_score(y_test, tmp_pipe.predict(X_test)))
print("\nAccuracy by n_estimators:")
for n, a in zip(tree_counts, accs):
print(f"{n:>4} trees: {a:.3f}")
3. Interpretation
Model performance.
The random forest predicts whether a country’s internet use is above the median with 75.5% accuracy and excellent discrimination (ROC AUC = 0.887). Cross-validated accuracy (0.842 ± 0.041) and OOB accuracy (0.872) indicate the model generalizes well beyond the test split.
Confusion matrix.
Out of 53 countries in the test set, the model correctly classified 23/29 low-internet countries and 17/24 high-internet countries, reflecting balanced performance across classes (see precision/recall scores for each class).
Which variables matter most?
Both importance methods agree that income per person is the strongest predictor, followed by life expectancy and urbanization rate. Employment rates (overall and female) contribute modestly.
-
Gini importances: income (0.395) > urbanization (0.252) ≈ life expectancy (0.246).
-
Permutation (AUC) importances: income (0.124) > life expectancy (0.097) > urbanization (0.030).
Takeaway.
The forest captures nonlinear cutoffs and interactions: higher income strongly increases the likelihood of high internet use, and greater life expectancy and higher urbanization further tilt countries toward high adoption. Results reinforce your earlier findings from linear models and trees using different methodology.
Comments
Post a Comment