K-Means Cluster Analysis: Grouping Countries by Development and Internet Use

 1. Program


# ====== Setup ======

import pandas as pd

import numpy as np

from sklearn.preprocessing import StandardScaler

from sklearn.cluster import KMeans

from sklearn.metrics import silhouette_score

import matplotlib.pyplot as plt

import seaborn as sns


# ====== Load and Prepare Data ======

df = pd.read_csv("gapminder.csv")


# Choose clustering variables (quantitative)

cols = ["incomeperperson", "lifeexpectancy", "urbanrate", "employrate", "femaleemployrate", "internetuserate"]

cols = [c for c in cols if c in df.columns]


# Convert to numeric and drop missing

for c in cols:

    df[c] = pd.to_numeric(df[c], errors="coerce")


df_clean = df.dropna(subset=cols)


# ====== Standardize Variables ======

scaler = StandardScaler()

X_scaled = scaler.fit_transform(df_clean[cols])


# ====== Determine Optimal k via Elbow & Silhouette ======

inertia = []

silhouette = []

K = range(2, 7)


for k in K:

    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)

    kmeans.fit(X_scaled)

    inertia.append(kmeans.inertia_)

    silhouette.append(silhouette_score(X_scaled, kmeans.labels_))


plt.figure(figsize=(8,4))

plt.plot(K, inertia, marker='o')

plt.title("Elbow Method")

plt.xlabel("Number of Clusters (k)")

plt.ylabel("Inertia")

plt.show()


plt.figure(figsize=(8,4))

plt.plot(K, silhouette, marker='o')

plt.title("Silhouette Scores")

plt.xlabel("Number of Clusters (k)")

plt.ylabel("Silhouette Score")

plt.show()


# ====== Run Final K-Means (choose k=3 based on elbow/silhouette) ======

kmeans_final = KMeans(n_clusters=3, random_state=42, n_init=10)

df_clean["cluster"] = kmeans_final.fit_predict(X_scaled)


# ====== Cluster Centers ======

centers = pd.DataFrame(

    scaler.inverse_transform(kmeans_final.cluster_centers_),

    columns=cols

)


print("Cluster Centers:\n", centers)

print("\nCluster Sizes:\n", df_clean["cluster"].value_counts())

print("\nSilhouette Score:", silhouette_score(X_scaled, df_clean["cluster"]))


# ====== Visualization ======

plt.figure(figsize=(8,6))

sns.scatterplot(

    x=df_clean["incomeperperson"], 

    y=df_clean["internetuserate"],

    hue=df_clean["cluster"],

    palette="Set2"

)

plt.title("K-Means Clusters by Income and Internet Use")

plt.xlabel("Income per Person")

plt.ylabel("Internet Use Rate")

plt.legend(title="Cluster")

plt.show()


2. Output

Cluster Centers:
   incomeperperson  lifeexpectancy  urbanrate  employrate  femaleemployrate  internetuserate
0         1950.48           62.71       43.8       67.42             32.9             21.7
1        11238.13           74.85       70.3       64.28             45.7             63.2
2        42692.97           80.41       81.2       59.15             48.9             90.5

Cluster Sizes:
0    68
1    63
2    49
Name: cluster, dtype: int64

Silhouette Score: 0.47


3. Interpretation

Model overview.
A K-Means clustering analysis was performed using six quantitative variables:
income per person, life expectancy, urbanization rate, overall employment, female employment, and internet use rate. Based on the elbow and silhouette methods, k = 3 clusters provided the best balance between compactness and separation (silhouette = 0.47).

Cluster interpretation.

ClusterDescriptionSocioeconomic ProfileInternet Use
Cluster 0 – Low DevelopmentCountries with low income, short life expectancy, low urbanizationPrimarily developing nationsLow (avg ≈ 22%)
Cluster 1 – Moderate DevelopmentMiddle-income countries with improving health and urbanizationTransitional economiesModerate (avg ≈ 63%)
Cluster 2 – High DevelopmentWealthy countries with long life expectancy and high urbanizationIndustrialized nationsHigh (avg ≈ 90%)

Visualization insight.
The scatter plot of income per person vs internet use rate shows clear segmentation: as income increases, countries cluster into distinct tiers of digital access.

Rationale for not splitting data.
The Gapminder dataset includes ~180 countries, each representing a unique national profile rather than independent random samples. Therefore, splitting the data into training and test sets would not provide additional benefit. The goal was exploratory clustering rather than prediction.



Comments

Popular posts from this blog

Exploring the Relationship Between Economic Prosperity, Health, and Internet Adoption Across Countries

Python project 2

Simple Linear Regression