Describing dataset

 

1. Sample

The data I used comes from the Gapminder dataset, which compiles publicly available data from organizations such as the United Nations, World Bank, and World Health Organization. The dataset includes information on over 200 countries and regions, covering social, economic, health, and environmental indicators. For this project, my sample focused on countries with available values for income per person, internet use rate, and life expectancy, resulting in approximately 180 valid observations after removing cases with missing values.


2. Data Collection Procedure

Gapminder itself does not conduct primary surveys but instead compiles and harmonizes data from authoritative international sources:

  • Income per person (GDP per capita): World Bank and national accounts.

  • Internet use rate (% of population): International Telecommunication Union (ITU).

  • Life expectancy (years): United Nations Population Division and WHO.

These data are updated regularly and standardized into a single dataset for global comparisons. The values are reported annually, and I used the most recent snapshot provided in the Gapminder dataset shared for this course.


3. Measures and Data Management

For my research question — “Is income per person associated with internet use, and is this relationship moderated by life expectancy?” — I selected the following variables:

  • Income per Person (incomeperperson): Continuous measure of GDP per capita (in US dollars). I later categorized this into three groups — Low Income (≤ $5,000), Middle Income ($5,001–20,000), and High Income (≥ $20,001).

  • Internet Use Rate (internetuserate): Continuous variable measuring the percentage of individuals in a country who use the internet. For descriptive analysis, I categorized this into three groups — Low Internet Use (0–30%), Medium Internet Use (31–70%), and High Internet Use (71–100%).

  • Life Expectancy (lifeexpectancy): Continuous variable measuring the average life expectancy at birth in years. I created categories for moderation analysis — Low (≤ 60 years), Medium (61–75 years), and High (76–90 years).

Data Management Steps:

  • Converted variables from text to numeric to handle data properly.

  • Dropped rows with missing values in the selected variables.

  • Created categorical groupings (binning) for clearer frequency and Chi-Square analysis.

  • Retained continuous forms of the variables for correlation and ANOVA testing.

Comments

Popular posts from this blog

Exploring the Relationship Between Economic Prosperity, Health, and Internet Adoption Across Countries

Python project 2

Simple Linear Regression