K-Means Cluster Analysis: Grouping Countries by Development and Internet Use
1. Program
# ====== Setup ======
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import seaborn as sns
# ====== Load and Prepare Data ======
df = pd.read_csv("gapminder.csv")
# Choose clustering variables (quantitative)
cols = ["incomeperperson", "lifeexpectancy", "urbanrate", "employrate", "femaleemployrate", "internetuserate"]
cols = [c for c in cols if c in df.columns]
# Convert to numeric and drop missing
for c in cols:
df[c] = pd.to_numeric(df[c], errors="coerce")
df_clean = df.dropna(subset=cols)
# ====== Standardize Variables ======
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_clean[cols])
# ====== Determine Optimal k via Elbow & Silhouette ======
inertia = []
silhouette = []
K = range(2, 7)
for k in K:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X_scaled)
inertia.append(kmeans.inertia_)
silhouette.append(silhouette_score(X_scaled, kmeans.labels_))
plt.figure(figsize=(8,4))
plt.plot(K, inertia, marker='o')
plt.title("Elbow Method")
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Inertia")
plt.show()
plt.figure(figsize=(8,4))
plt.plot(K, silhouette, marker='o')
plt.title("Silhouette Scores")
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Silhouette Score")
plt.show()
# ====== Run Final K-Means (choose k=3 based on elbow/silhouette) ======
kmeans_final = KMeans(n_clusters=3, random_state=42, n_init=10)
df_clean["cluster"] = kmeans_final.fit_predict(X_scaled)
# ====== Cluster Centers ======
centers = pd.DataFrame(
scaler.inverse_transform(kmeans_final.cluster_centers_),
columns=cols
)
print("Cluster Centers:\n", centers)
print("\nCluster Sizes:\n", df_clean["cluster"].value_counts())
print("\nSilhouette Score:", silhouette_score(X_scaled, df_clean["cluster"]))
# ====== Visualization ======
plt.figure(figsize=(8,6))
sns.scatterplot(
x=df_clean["incomeperperson"],
y=df_clean["internetuserate"],
hue=df_clean["cluster"],
palette="Set2"
)
plt.title("K-Means Clusters by Income and Internet Use")
plt.xlabel("Income per Person")
plt.ylabel("Internet Use Rate")
plt.legend(title="Cluster")
plt.show()
3. Interpretation
Model overview.
A K-Means clustering analysis was performed using six quantitative variables:
income per person, life expectancy, urbanization rate, overall employment, female employment, and internet use rate. Based on the elbow and silhouette methods, k = 3 clusters provided the best balance between compactness and separation (silhouette = 0.47).
Cluster interpretation.
| Cluster | Description | Socioeconomic Profile | Internet Use |
|---|---|---|---|
| Cluster 0 – Low Development | Countries with low income, short life expectancy, low urbanization | Primarily developing nations | Low (avg ≈ 22%) |
| Cluster 1 – Moderate Development | Middle-income countries with improving health and urbanization | Transitional economies | Moderate (avg ≈ 63%) |
| Cluster 2 – High Development | Wealthy countries with long life expectancy and high urbanization | Industrialized nations | High (avg ≈ 90%) |
Visualization insight.
The scatter plot of income per person vs internet use rate shows clear segmentation: as income increases, countries cluster into distinct tiers of digital access.
Rationale for not splitting data.
The Gapminder dataset includes ~180 countries, each representing a unique national profile rather than independent random samples. Therefore, splitting the data into training and test sets would not provide additional benefit. The goal was exploratory clustering rather than prediction.
Comments
Post a Comment