Classification Tree: Predicting High Internet Use

1. Program

# ====== Setup ======

import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split, cross_val_score

from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree

from sklearn.metrics import (

accuracy_score, confusion_matrix, classification_report, roc_auc_score, roc_curve

)

from sklearn.impute import SimpleImputer

from sklearn.pipeline import Pipeline

from sklearn.compose import ColumnTransformer

import matplotlib.pyplot as plt

# ====== Load & prepare data (Gapminder) ======

df = pd.read_csv("gapminder.csv") # put your file path here

# Keep the features we’ll try first (add more if you have them)

num_cols = ["incomeperperson", "lifeexpectancy"] # you can add: "urbanrate", "employrate", etc. if present

# Convert to numeric

for c in num_cols + ["internetuserate"]:

df[c] = pd.to_numeric(df[c], errors="coerce")

# Binary response: High Internet Use (1) if above median, else 0

median_internet = df["internetuserate"].median()

df["high_internet_use"] = (df["internetuserate"] > median_internet).astype(int)

# Drop rows missing all predictors or response

df = df.dropna(subset=num_cols + ["high_internet_use"])

X = df[num_cols]

y = df["high_internet_use"]

# ====== Train / Test split ======

X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size=0.25, random_state=42, stratify=y

)

# ====== Preprocess + Model pipeline ======

num_pipe = Pipeline(steps=[

("imputer", SimpleImputer(strategy="median"))

])

pre = ColumnTransformer(transformers=[

("num", num_pipe, num_cols)

])

clf = DecisionTreeClassifier(

criterion="gini",

max_depth=3, # keeps the tree interpretable

min_samples_leaf=5,

class_weight="balanced",

random_state=42

)

pipe = Pipeline(steps=[("pre", pre), ("model", clf)])

# ====== Fit ======

pipe.fit(X_train, y_train)

# ====== Evaluation ======

y_pred = pipe.predict(X_test)

y_prob = pipe.predict_proba(X_test)[:, 1]

print("Accuracy:", accuracy_score(y_test, y_pred))

print("ROC AUC:", roc_auc_score(y_test, y_prob))

print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

print("\nClassification Report:\n", classification_report(y_test, y_pred, digits=3))

# Cross-validated accuracy (optional but nice to report)

cv_scores = cross_val_score(pipe, X, y, cv=5, scoring="accuracy")

print("CV Accuracy (mean ± SD): %.3f ± %.3f" % (cv_scores.mean(), cv_scores.std()))

# ====== Tree text (rules) ======

# Refit a simple tree on imputed data only to print rule text with feature names

imputed = num_pipe.fit_transform(X) # numpy array

tree_for_text = DecisionTreeClassifier(

criterion="gini", max_depth=3, min_samples_leaf=5, class_weight="balanced", random_state=42

).fit(imputed, y)

print("\nDecision Rules:\n")

print(export_text(tree_for_text, feature_names=num_cols))

# ====== Optional: Visuals (include images in your blog if allowed) ======

plt.figure(figsize=(12,6))

plot_tree(tree_for_text, feature_names=num_cols, class_names=["Low","High"], filled=True, rounded=True)

plt.title("Classification Tree for High Internet Use")

plt.tight_layout()

plt.show()

# ROC curve

fpr, tpr, _ = roc_curve(y_test, y_prob)

plt.figure()

plt.plot(fpr, tpr)

plt.plot([0,1], [0,1], linestyle="--")

plt.xlabel("False Positive Rate")

plt.ylabel("True Positive Rate")

plt.title("ROC Curve (AUC = %.3f)" % roc_auc_score(y_test, y_prob))

plt.show()

2. Output

Accuracy: 0.8636

ROC AUC: 0.9229

Confusion Matrix:

[[22 2]

[ 4 16]]

Classification Report:

precision recall f1-score support

0 0.846 0.917 0.880 24

1 0.889 0.800 0.842 20

accuracy 0.864 44

macro avg 0.868 0.858 0.861 44

weighted avg 0.866 0.864 0.863 44

CV Accuracy (mean ± SD): 0.858 ± 0.047

Decision Rules:

|--- incomeperperson <= 2172.45

| |--- lifeexpectancy <= 74.20

| | |--- lifeexpectancy <= 68.39

| | | |--- class: 0

| | |--- lifeexpectancy > 68.39

| | | |--- class: 0

| |--- lifeexpectancy > 74.20

| | |--- class: 0

|--- incomeperperson > 2172.45

| |--- lifeexpectancy <= 73.23

| | |--- lifeexpectancy <= 64.86

| | | |--- class: 0

| | |--- lifeexpectancy > 64.86

| | | |--- class: 1

| |--- lifeexpectancy > 73.23

| | |--- incomeperperson <= 8165.50

| | | |--- class: 1

| | |--- incomeperperson > 8165.50

| | | |--- class: 1

3. Interpretation

The classification tree model achieved an accuracy of 86.4% and an ROC AUC of 0.923, indicating excellent discrimination between countries with high vs low internet use.

Precision (0.89) and recall (0.80) for the high internet group show that the model identifies high-usage countries reliably while maintaining a low false-positive rate.
Cross-validated accuracy (85.8% ± 4.7%) suggests consistent performance across samples.

Key decision rules:

Countries with income per person ≤ $2172 were overwhelmingly classified as low internet use.
For higher-income countries (> $2172), life expectancy helped refine predictions—those with life expectancy > 65 years were more likely to exhibit high internet use.
Beyond $8000 income per person, nearly all countries were predicted to have high internet adoption.

Interpretation Summary:
The classification tree clearly demonstrates that both economic prosperity and population health are strong predictors of digital inclusion. Income serves as the primary branching factor, while life expectancy captures secondary differences among moderately wealthy nations. The model confirms that higher income and longer life expectancy jointly predict greater internet usage rates, consistent with global development patterns.

Search This Blog

Sanjoy

Classification Tree: Predicting High Internet Use

3. Interpretation

Comments

Post a Comment

Popular posts from this blog

Exploring the Relationship Between Economic Prosperity, Health, and Internet Adoption Across Countries

Python project 2

Simple Linear Regression