Random Forest: Predicting High Internet Use

 1. Program


# ====== Setup ======

import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split, cross_val_score

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import (

    accuracy_score, confusion_matrix, classification_report, roc_auc_score, roc_curve

)

from sklearn.inspection import permutation_importance

from sklearn.impute import SimpleImputer

from sklearn.pipeline import Pipeline

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import OneHotEncoder

import matplotlib.pyplot as plt


# ====== Load & prepare data (Gapminder) ======

df = pd.read_csv("gapminder.csv")  # replace with your path


# Choose predictors (add more if available in your file)

num_cols = ["incomeperperson", "lifeexpectancy", "urbanrate", "employrate", "femaleemployrate"]

num_cols = [c for c in num_cols if c in df.columns]  # keep only columns that exist


# Convert numerics

for c in num_cols + ["internetuserate"]:

    if c in df.columns:

        df[c] = pd.to_numeric(df[c], errors="coerce")


# Binary response: High Internet Use (1) if > median, else 0

median_internet = df["internetuserate"].median()

df["high_internet_use"] = (df["internetuserate"] > median_internet).astype(int)


# Optional: include categorical predictors if you have them

cat_cols = []  # e.g., ["region"] if present; will be one-hot encoded


# Drop rows missing response or all predictors

keep_cols = num_cols + cat_cols + ["high_internet_use"]

df = df.dropna(subset=["high_internet_use"])

df = df.dropna(subset=num_cols, how="all")


X = df[num_cols + cat_cols]

y = df["high_internet_use"]


# ====== Train / Test split ======

X_train, X_test, y_train, y_test = train_test_split(

    X, y, test_size=0.25, random_state=42, stratify=y

)


# ====== Preprocess ======

num_pipe = Pipeline(steps=[("imputer", SimpleImputer(strategy="median"))])

cat_pipe = Pipeline(steps=[("ohe", OneHotEncoder(handle_unknown="ignore"))]) if cat_cols else None


transformers = [("num", num_pipe, num_cols)]

if cat_cols:

    transformers.append(("cat", cat_pipe, cat_cols))


pre = ColumnTransformer(transformers=transformers)


# ====== Random Forest model ======

rf = RandomForestClassifier(

    n_estimators=500,

    max_depth=None,

    min_samples_leaf=3,

    class_weight="balanced",

    oob_score=True,

    random_state=42,

    n_jobs=-1

)


pipe = Pipeline(steps=[("pre", pre), ("model", rf)])


# ====== Fit ======

pipe.fit(X_train, y_train)


# ====== Evaluation ======

y_pred = pipe.predict(X_test)

# For AUC we need probabilities:

y_prob = pipe.predict_proba(X_test)[:, 1]


print("Test Accuracy:", accuracy_score(y_test, y_pred))

print("Test ROC AUC:", roc_auc_score(y_test, y_prob))

print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

print("\nClassification Report:\n", classification_report(y_test, y_pred, digits=3))


# Cross-validated accuracy (optional)

cv_scores = cross_val_score(pipe, X, y, cv=5, scoring="accuracy", n_jobs=-1)

print("CV Accuracy (mean ± SD): %.3f ± %.3f" % (cv_scores.mean(), cv_scores.std()))


# OOB score (available only if no CV wrapper on raw RF; we can get it via refitting RF on transformed data)

# Fit RF on fully transformed training data to print OOB directly:

Xt_train = pre.fit_transform(X_train)

rf_only = RandomForestClassifier(

    n_estimators=500, max_depth=None, min_samples_leaf=3,

    class_weight="balanced", oob_score=True, random_state=42, n_jobs=-1

).fit(Xt_train, y_train)

print("OOB Accuracy:", rf_only.oob_score_)


# ====== Feature Importances ======

# Tree-based importances from the fitted model in the pipeline:

rf_model = pipe.named_steps["model"]

# Get feature names after preprocessing:

feature_names = []

if num_cols:

    feature_names += num_cols

if cat_cols:

    # names from OneHotEncoder:

    cat_names = pipe.named_steps["pre"].named_transformers_["cat"].named_steps["ohe"].get_feature_names_out(cat_cols)

    feature_names += cat_names.tolist()


importances = rf_model.feature_importances_

imp_table = pd.DataFrame({"feature": feature_names, "importance": importances}).sort_values("importance", ascending=False)

print("\nRandom Forest Feature Importances:\n", imp_table)


# Permutation importance (robust, model-agnostic)

perm = permutation_importance(pipe, X_test, y_test, n_repeats=20, scoring="roc_auc", n_jobs=-1, random_state=42)

perm_imp = pd.DataFrame({

    "feature": feature_names,

    "perm_importance_mean": perm.importances_mean,

    "perm_importance_std": perm.importances_std

}).sort_values("perm_importance_mean", ascending=False)

print("\nPermutation Importances (AUC):\n", perm_imp)


# ====== (Optional) AUC curve and simple n_estimators sweep ======

fpr, tpr, _ = roc_curve(y_test, y_prob)

plt.figure()

plt.plot(fpr, tpr)

plt.plot([0,1], [0,1], linestyle="--")

plt.xlabel("False Positive Rate")

plt.ylabel("True Positive Rate")

plt.title("ROC Curve (AUC = %.3f)" % roc_auc_score(y_test, y_prob))

plt.show()


# Simple sweep to show how accuracy changes with number of trees

tree_counts = [50, 100, 200, 300, 500, 800]

accs = []

for n in tree_counts:

    rf_tmp = RandomForestClassifier(

        n_estimators=n, max_depth=None, min_samples_leaf=3,

        class_weight="balanced", random_state=42, n_jobs=-1

    )

    tmp_pipe = Pipeline(steps=[("pre", pre), ("model", rf_tmp)])

    tmp_pipe.fit(X_train, y_train)

    accs.append(accuracy_score(y_test, tmp_pipe.predict(X_test)))


print("\nAccuracy by n_estimators:")

for n, a in zip(tree_counts, accs):

    print(f"{n:>4} trees: {a:.3f}")



2. Output

Test Accuracy: 0.7547169811320755
Test ROC AUC: 0.8864942528735632

Confusion Matrix:
 [[23  6]
 [ 7 17]]

Classification Report:
               precision    recall  f1-score   support

           0      0.767     0.793     0.780        29
           1      0.739     0.708     0.723        24

    accuracy                          0.755        53
   macro avg      0.753     0.751     0.752        53
weighted avg      0.754     0.755     0.754        53

CV Accuracy (mean ± SD): 0.842 ± 0.041
OOB Accuracy: 0.8717948717948718

Random Forest Feature Importances:
             feature  importance
0   incomeperperson    0.395162
2         urbanrate    0.251860
1    lifeexpectancy    0.245844
4  femaleemployrate    0.055571
3        employrate    0.051563

Permutation Importances (AUC):
             feature  perm_importance_mean  perm_importance_std
0   incomeperperson              0.124497             0.047366
1    lifeexpectancy              0.097486             0.034053
2         urbanrate              0.030172             0.037480
4  femaleemployrate              0.016954             0.008687
3        employrate              0.011925             0.005435



3. Interpretation

Model performance.
The random forest predicts whether a country’s internet use is above the median with 75.5% accuracy and excellent discrimination (ROC AUC = 0.887). Cross-validated accuracy (0.842 ± 0.041) and OOB accuracy (0.872) indicate the model generalizes well beyond the test split.

Confusion matrix.
Out of 53 countries in the test set, the model correctly classified 23/29 low-internet countries and 17/24 high-internet countries, reflecting balanced performance across classes (see precision/recall scores for each class).

Which variables matter most?
Both importance methods agree that income per person is the strongest predictor, followed by life expectancy and urbanization rate. Employment rates (overall and female) contribute modestly.

  • Gini importances: income (0.395) > urbanization (0.252) ≈ life expectancy (0.246).

  • Permutation (AUC) importances: income (0.124) > life expectancy (0.097) > urbanization (0.030).

Takeaway.
The forest captures nonlinear cutoffs and interactions: higher income strongly increases the likelihood of high internet use, and greater life expectancy and higher urbanization further tilt countries toward high adoption. Results reinforce your earlier findings from linear models and trees using different methodology.




Comments

Popular posts from this blog

Exploring the Relationship Between Economic Prosperity, Health, and Internet Adoption Across Countries

Python project 2

Simple Linear Regression