Classification Tree: Predicting High Internet Use
1. Program
# ====== Setup ======
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree
from sklearn.metrics import (
accuracy_score, confusion_matrix, classification_report, roc_auc_score, roc_curve
)
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
import matplotlib.pyplot as plt
# ====== Load & prepare data (Gapminder) ======
df = pd.read_csv("gapminder.csv") # put your file path here
# Keep the features we’ll try first (add more if you have them)
num_cols = ["incomeperperson", "lifeexpectancy"] # you can add: "urbanrate", "employrate", etc. if present
# Convert to numeric
for c in num_cols + ["internetuserate"]:
df[c] = pd.to_numeric(df[c], errors="coerce")
# Binary response: High Internet Use (1) if above median, else 0
median_internet = df["internetuserate"].median()
df["high_internet_use"] = (df["internetuserate"] > median_internet).astype(int)
# Drop rows missing all predictors or response
df = df.dropna(subset=num_cols + ["high_internet_use"])
X = df[num_cols]
y = df["high_internet_use"]
# ====== Train / Test split ======
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42, stratify=y
)
# ====== Preprocess + Model pipeline ======
num_pipe = Pipeline(steps=[
("imputer", SimpleImputer(strategy="median"))
])
pre = ColumnTransformer(transformers=[
("num", num_pipe, num_cols)
])
clf = DecisionTreeClassifier(
criterion="gini",
max_depth=3, # keeps the tree interpretable
min_samples_leaf=5,
class_weight="balanced",
random_state=42
)
pipe = Pipeline(steps=[("pre", pre), ("model", clf)])
# ====== Fit ======
pipe.fit(X_train, y_train)
# ====== Evaluation ======
y_pred = pipe.predict(X_test)
y_prob = pipe.predict_proba(X_test)[:, 1]
print("Accuracy:", accuracy_score(y_test, y_pred))
print("ROC AUC:", roc_auc_score(y_test, y_prob))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred, digits=3))
# Cross-validated accuracy (optional but nice to report)
cv_scores = cross_val_score(pipe, X, y, cv=5, scoring="accuracy")
print("CV Accuracy (mean ± SD): %.3f ± %.3f" % (cv_scores.mean(), cv_scores.std()))
# ====== Tree text (rules) ======
# Refit a simple tree on imputed data only to print rule text with feature names
imputed = num_pipe.fit_transform(X) # numpy array
tree_for_text = DecisionTreeClassifier(
criterion="gini", max_depth=3, min_samples_leaf=5, class_weight="balanced", random_state=42
).fit(imputed, y)
print("\nDecision Rules:\n")
print(export_text(tree_for_text, feature_names=num_cols))
# ====== Optional: Visuals (include images in your blog if allowed) ======
plt.figure(figsize=(12,6))
plot_tree(tree_for_text, feature_names=num_cols, class_names=["Low","High"], filled=True, rounded=True)
plt.title("Classification Tree for High Internet Use")
plt.tight_layout()
plt.show()
# ROC curve
fpr, tpr, _ = roc_curve(y_test, y_prob)
plt.figure()
plt.plot(fpr, tpr)
plt.plot([0,1], [0,1], linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve (AUC = %.3f)" % roc_auc_score(y_test, y_prob))
plt.show()
3. Interpretation
The classification tree model achieved an accuracy of 86.4% and an ROC AUC of 0.923, indicating excellent discrimination between countries with high vs low internet use.
-
Precision (0.89) and recall (0.80) for the high internet group show that the model identifies high-usage countries reliably while maintaining a low false-positive rate.
-
Cross-validated accuracy (85.8% ± 4.7%) suggests consistent performance across samples.
Key decision rules:
-
Countries with income per person ≤ $2172 were overwhelmingly classified as low internet use.
-
For higher-income countries (> $2172), life expectancy helped refine predictions—those with life expectancy > 65 years were more likely to exhibit high internet use.
-
Beyond $8000 income per person, nearly all countries were predicted to have high internet adoption.
Interpretation Summary:
The classification tree clearly demonstrates that both economic prosperity and population health are strong predictors of digital inclusion. Income serves as the primary branching factor, while life expectancy captures secondary differences among moderately wealthy nations. The model confirms that higher income and longer life expectancy jointly predict greater internet usage rates, consistent with global development patterns.
Comments
Post a Comment