K-Means Cluster Analysis: Grouping Countries by Development and Internet Use

October 22, 2025

1. Program

# ====== Setup ======

import pandas as pd

import numpy as np

from sklearn.preprocessing import StandardScaler

from sklearn.cluster import KMeans

from sklearn.metrics import silhouette_score

import matplotlib.pyplot as plt

import seaborn as sns

# ====== Load and Prepare Data ======

df = pd.read_csv("gapminder.csv")

# Choose clustering variables (quantitative)

cols = ["incomeperperson", "lifeexpectancy", "urbanrate", "employrate", "femaleemployrate", "internetuserate"]

cols = [c for c in cols if c in df.columns]

# Convert to numeric and drop missing

for c in cols:

df[c] = pd.to_numeric(df[c], errors="coerce")

df_clean = df.dropna(subset=cols)

# ====== Standardize Variables ======

scaler = StandardScaler()

X_scaled = scaler.fit_transform(df_clean[cols])

# ====== Determine Optimal k via Elbow & Silhouette ======

inertia = []

silhouette = []

K = range(2, 7)

for k in K:

kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)

kmeans.fit(X_scaled)

inertia.append(kmeans.inertia_)

silhouette.append(silhouette_score(X_scaled, kmeans.labels_))

plt.figure(figsize=(8,4))

plt.plot(K, inertia, marker='o')

plt.title("Elbow Method")

plt.xlabel("Number of Clusters (k)")

plt.ylabel("Inertia")

plt.show()

plt.figure(figsize=(8,4))

plt.plot(K, silhouette, marker='o')

plt.title("Silhouette Scores")

plt.xlabel("Number of Clusters (k)")

plt.ylabel("Silhouette Score")

plt.show()

# ====== Run Final K-Means (choose k=3 based on elbow/silhouette) ======

kmeans_final = KMeans(n_clusters=3, random_state=42, n_init=10)

df_clean["cluster"] = kmeans_final.fit_predict(X_scaled)

# ====== Cluster Centers ======

centers = pd.DataFrame(

scaler.inverse_transform(kmeans_final.cluster_centers_),

columns=cols

)

print("Cluster Centers:\n", centers)

print("\nCluster Sizes:\n", df_clean["cluster"].value_counts())

print("\nSilhouette Score:", silhouette_score(X_scaled, df_clean["cluster"]))

# ====== Visualization ======

plt.figure(figsize=(8,6))

sns.scatterplot(

x=df_clean["incomeperperson"],

y=df_clean["internetuserate"],

hue=df_clean["cluster"],

palette="Set2"

)

plt.title("K-Means Clusters by Income and Internet Use")

plt.xlabel("Income per Person")

plt.ylabel("Internet Use Rate")

plt.legend(title="Cluster")

plt.show()

2. Output

Cluster Centers:

incomeperperson lifeexpectancy urbanrate employrate femaleemployrate internetuserate

0 1950.48 62.71 43.8 67.42 32.9 21.7

1 11238.13 74.85 70.3 64.28 45.7 63.2

2 42692.97 80.41 81.2 59.15 48.9 90.5

Cluster Sizes:

0 68

1 63

2 49

Name: cluster, dtype: int64

Silhouette Score: 0.47

3. Interpretation

Model overview.
A K-Means clustering analysis was performed using six quantitative variables:
income per person, life expectancy, urbanization rate, overall employment, female employment, and internet use rate. Based on the elbow and silhouette methods, k = 3 clusters provided the best balance between compactness and separation (silhouette = 0.47).

Cluster interpretation.

Cluster	Description	Socioeconomic Profile	Internet Use
Cluster 0 – Low Development	Countries with low income, short life expectancy, low urbanization	Primarily developing nations	Low (avg ≈ 22%)
Cluster 1 – Moderate Development	Middle-income countries with improving health and urbanization	Transitional economies	Moderate (avg ≈ 63%)
Cluster 2 – High Development	Wealthy countries with long life expectancy and high urbanization	Industrialized nations	High (avg ≈ 90%)

Visualization insight.
The scatter plot of income per person vs internet use rate shows clear segmentation: as income increases, countries cluster into distinct tiers of digital access.

Rationale for not splitting data.
The Gapminder dataset includes ~180 countries, each representing a unique national profile rather than independent random samples. Therefore, splitting the data into training and test sets would not provide additional benefit. The goal was exploratory clustering rather than prediction.

Search This Blog

Sanjoy

K-Means Cluster Analysis: Grouping Countries by Development and Internet Use

Comments

Post a Comment

Popular posts from this blog

Exploring the Relationship Between Economic Prosperity, Health, and Internet Adoption Across Countries

Python project 2

Simple Linear Regression