Home Life Expectancy Analysis
Post
Cancel

Life Expectancy Analysis

Life Expectanct at Birth

I used data from the World Bank on life expectancy along with other parameters (mortality, diseases, etc.) for every country every year from 2000 to 2015. This data is quite a mess. For example, the GDP column is measured in many different units, so it is completely unusable, or the column names are broken.

There are no analytical conclusions at the end, as all the findings are quite obvious. The sole goal of this project is to apply my knowledge in data cleaning and visualization, as well as try to write a good-looking and reader-friendly notebook.

Data from the World Bank contains four variables

- Country
- Year – from 2000 to 2015
- Status – developed / developing
- Life Expectancy at Birth – abbreviated as LEB for convenience
- Adult Mortality, 'Infant deaths', 'Under Five Deaths' (all per 1000 people), 'Alcohol', 'Hepatitis B', 'Measles', 'BMI', 'Polio', 'Diphtheria', 'HIV/AIDS' – medical statistics for each country each year
- Population, GDP* – could be useful, but not for the scope of this project.
- Income composition of resources – how effeicient the resources are used (supposed to have the highest positive correlation with the LED variable)
- Schooling - years of schooling

* If we take a look at the GDP column we’ll find that the values are messed up. Some of them (regardless of the Year variable) has a unit of billion, others are in ten billion, a hundred million, etc.

Steps of the project

  • Clean up the dataset.

  • Analyze changes in LEB for 15 years.

    • LEB changes for 15 years in developing and developed nations
    • Show correlation table and scatter plots with the LEB variable.
  • Report what was done throughout the project

1
2
3
4
5
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
1
df = pd.read_csv('Life Expectancy Data.csv')

Processing

Take a look at the columns

1
2
# Display all the columns
pd.set_option('display.max_columns', len(df.columns))
1
df.columns
1
2
3
4
5
6
7
Index(['Country', 'Year', 'Status', 'Life expectancy ', 'Adult Mortality',
       'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B',
       'Measles ', ' BMI ', 'under-five deaths ', 'Polio', 'Total expenditure',
       'Diphtheria ', ' HIV/AIDS', 'GDP', 'Population',
       ' thinness  1-19 years', ' thinness 5-9 years',
       'Income composition of resources', 'Schooling'],
      dtype='object')
1
df.shape
1
(2938, 22)

The column names need to be fixed as they have whitespaces at the string ends and different cases

Fixing column names (using snake case)

1
2
3
4
5
6
7
8
9
# Strip the whitespaces in the column names
df.columns = [x.strip() for x in df.columns]

# After getting rid of the NaN values, we can safely change the column names
df.columns = [x.lower() for x in df.columns]
df.columns = [x.replace(' ', '_').replace('-', '_').replace('/', '_') for x in df.columns]

# Displaying the column names once again
df.columns
1
2
3
4
5
6
Index(['country', 'year', 'status', 'life_expectancy', 'adult_mortality',
       'infant_deaths', 'alcohol', 'percentage_expenditure', 'hepatitis_b',
       'measles', 'bmi', 'under_five_deaths', 'polio', 'total_expenditure',
       'diphtheria', 'hiv_aids', 'gdp', 'population', 'thinness__1_19_years',
       'thinness_5_9_years', 'income_composition_of_resources', 'schooling'],
      dtype='object')

Drop unnecessary columns

1
2
df = df.drop(columns=['percentage_expenditure', 'total_expenditure', 'thinness__1_19_years',
       'thinness_5_9_years', 'gdp', 'population'])

Deal with missing values

1
2
3
# Looking for NaN values
leb_nan = df['life_expectancy'].isna()
df.loc[leb_nan]
 countryyearstatuslife_expectancyadult_mortalityinfant_deathsalcoholhepatitis_bmeaslesbmiunder_five_deathspoliodiphtheriahiv_aidsincome_composition_of_resourcesschooling
624Cook Islands2013DevelopingNaNNaN00.0198.0082.8098.098.00.1NaNNaN
769Dominica2013DevelopingNaNNaN00.0196.0058.4096.096.00.10.72112.7
1650Marshall Islands2013DevelopingNaNNaN00.018.0081.6079.079.00.1NaN0.0
1715Monaco2013DevelopingNaNNaN00.0199.00NaN099.099.00.1NaNNaN
1812Nauru2013DevelopingNaNNaN00.0187.0087.3087.087.00.1NaN9.6
1909Niue2013DevelopingNaNNaN00.0199.0077.3099.099.00.1NaNNaN
1958Palau2013DevelopingNaNNaN0NaN99.0083.3099.099.00.10.77914.2
2167Saint Kitts and Nevis2013DevelopingNaNNaN08.5497.005.2096.096.00.10.74913.4
2216San Marino2013DevelopingNaNNaN00.0169.00NaN069.069.00.1NaN15.1
2713Tuvalu2013DevelopingNaNNaN00.019.0079.309.09.00.1NaN0.0

We’ll delete all the rows with missing values in the Life Expectancy column, because this variable is crucial for the analysis.

1
df = df.dropna(subset=['life_expectancy'])

For the rest of the columns we’ll fill the NaN values with variable means depending on the Status parameter (whether a country is developed or developing according to the World Bank).

1
df.isna().sum()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
country                              0
year                                 0
status                               0
life_expectancy                      0
adult_mortality                      0
infant_deaths                        0
alcohol                            193
hepatitis_b                        553
measles                              0
bmi                                 32
under_five_deaths                    0
polio                               19
diphtheria                          19
hiv_aids                             0
income_composition_of_resources    160
schooling                          160
dtype: int64

Create two datesets for two types of nations (based on the Status column) to analyze them separetely (and to concatenate them later).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
developed = df.loc[df.status == 'Developed']
developing = df.loc[df.status == 'Developing']

# Split both dataset by those with and without missing values
developed_miss = developed.loc[:, developed.isnull().any()]
developing_miss = developing.loc[:, developing.isnull().any()]

# Drop the columns with missing values
developed = developed.drop(columns=developed_miss.columns)
developing = developing.drop(columns=developing_miss.columns)

# Fill in the missing values with multiple imputation
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imp = IterativeImputer(max_iter=len(developed_miss), random_state=0)
imp.fit(developed_miss)
developed_miss_filled = pd.DataFrame(np.round(imp.transform(developed_miss),1), columns=developed_miss.columns)

imp = IterativeImputer(max_iter=len(developing_miss), random_state=0)
imp.fit(developing_miss)
developing_miss_filled = pd.DataFrame(np.round(imp.transform(developing_miss),1), columns=developing_miss.columns)

# Just make sure, the two yet-to-be-merged datasets have the same lenth
print(len(developed_miss_filled), len(developed))
print(len(developing_miss_filled), len(developing))
1
2
512 512
2416 2416

Concatenate/merge all four DataFrames

1
2
3
4
5
6
7
8
9
10
11
12
# developed and developed_miss_filled
developed[developed_miss_filled.columns] = developed_miss_filled.values

# developing and developing_miss_filled
developing[developing_miss_filled.columns] = developing_miss_filled.values

# developed and developing
countries = pd.concat([developed, developing])
countries = countries.sort_values(by='country').reset_index(drop=True)

# Taking a look at the final dataset
countries.sample(10)
 countryyearstatuslife_expectancyadult_mortalityinfant_deathsalcoholhepatitis_bmeaslesbmiunder_five_deathspoliodiphtheriahiv_aidsincome_composition_of_resourcesschooling
846Equatorial Guinea2000Developing52.7336.03018.3441.034.01.94.537.50.00.0
1776Myanmar2001Developing62.5239.072251914.19877.073.00.40.471.90.47.6
1719Mongolia2009Developing66.9235.01845.9296.095.00.14.697.00.713.8
2540Syrian Arab Republic2002Developing72.8135.0953845.31186.084.00.11.28.00.610.2
2013Peru2012Developing74.9129.09053.61194.095.00.15.195.00.713.4
714Democratic People’s Republic of Korea2011Developing69.4153.0803.81099.094.00.13.494.00.59.9
704Democratic People’s Republic of Korea2001Developing66.6177.016025.72198.062.00.12.570.00.510.2
2160Saint Lucia2000Developing71.6183.00036.807.07.00.411.715.30.012.8
2028Philippines2012Developing68.1217.056153623.77188.088.00.15.088.00.711.6
2760United Kingdom of Great Britain and Northern I…2006Developed79.382.0476461.3492.092.00.111.687.40.815.8
1
2
# Making sure there are no NaN values left
countries.isna().sum()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
country                            0
year                               0
status                             0
life_expectancy                    0
adult_mortality                    0
infant_deaths                      0
measles                            0
bmi                                0
under_five_deaths                  0
polio                              0
diphtheria                         0
hiv_aids                           0
alcohol                            0
hepatitis_b                        0
income_composition_of_resources    0
schooling                          0
dtype: int64
1
2
3
# Looking at the data types
countries['status'] = pd.Categorical(countries['status'], list(countries['status'].unique()))
countries.dtypes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
country                              object
year                                  int64
status                             category
life_expectancy                     float64
adult_mortality                     float64
infant_deaths                         int64
measles                               int64
bmi                                 float64
under_five_deaths                     int64
polio                               float64
diphtheria                          float64
hiv_aids                            float64
alcohol                             float64
hepatitis_b                         float64
income_composition_of_resources     float64
schooling                           float64
dtype: object

Analysis

LEB changes for 15 years in the developing and developed nations.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
years = list(countries.year.unique())

leb_by_year_developed = developed.groupby(['year']).life_expectancy.mean()
leb_by_year_developing = developing.groupby(['year']).life_expectancy.mean()


plt.plot(leb_by_year_developed, label='Developed')
plt.plot(leb_by_year_developing, label='Developing')

plt.legend(loc=4)
plt.xticks(years, rotation=45)
plt.title('Life Expectancy for Developed and Developing Countries (2000-2015)', fontsize=14)
plt.ylabel('Life Expectancy at Birth')
plt.show()

Life expenctancy for developed and developing countries

Correlation of life expectancy with other variables.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def led_stats_corr(i, y_col):
    plt.subplot(2, 2, i)
    ax = sns.scatterplot(data = countries, x = 'life_expectancy', y = y_col, hue = 'status', s = 20)
    y_string = y_col.replace('_', ' ').title()
    plt.title(f'Life Expectancy vs {y_string}', fontsize=14)
    return ax


plt.rcParams['figure.figsize'] = [11, 7]

# Scatterplot life_expectancy and bmi
led_stats_corr(1, 'bmi')

# Scatterplot life_expectancy and alcohol
led_stats_corr(2, 'alcohol')

# Scatterplot life_expectancy and income_composition_of_resources
led_stats_corr(3, 'income_composition_of_resources')

# Scatterplot life expectancy and schooling
led_stats_corr(4, 'schooling')

plt.subplots_adjust(wspace=0.25, hspace=0.25, top=1.3, bottom=0.2)
plt.show()

Life expectancy vs other variables

Based on these graphs, we see a possitive correlation between life expectancy with Income Composition of Resources and Schooling.
There is a slight possitive correlation with alcohol and, especially, BMI, but it doesn’t mean that people who consume more alcohol or have higher BMI tend to live longer in general.

1
2
3
4
# Calculating Pearson correlation for these variables

cols_for_corr = ['life_expectancy', 'bmi', 'alcohol', 'income_composition_of_resources', 'schooling']
countries[cols_for_corr].corr().iloc[0]
1
2
3
4
5
6
life_expectancy                    1.000000
bmi                                0.576485
alcohol                            0.419462
income_composition_of_resources    0.722415
schooling                          0.754713
Name: life_expectancy, dtype: float64

Distributution of other variables every year

Here, we’ll build a function to display the distribution of data for a given period of time.
Due to the huge difference in some statistic between developed and developing countries, log transformation can be applied, optionally.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def dist_by_year(year, col_list, log_needed):
    """
    year – plot data for a given year (int)
    col_list - list of columns to plot (list of strings)
    log_needed - list of Boolean values; every value's index corresponds to col_list index
    """
    this_year = countries[countries['year'] == year]
    subplot_num = len(col_list)
    
    for i in range(1, subplot_num+1):
        x_col_string = col_list[i-1].replace('_', ' ').title()
        
        if log_needed[i-1] == False:
            sns.displot(x=this_year[col_list[i-1]], hue=this_year['status'], kind="kde")
            plt.title(f"{x_col_string} Rate for Developing and Developed Countries in {year}")
            plt.xlabel(f"{x_col_string} Rate")
        else:
            sns.displot(x=np.log(this_year[col_list[i-1]]), hue=this_year.status, kind="kde")
            plt.title(f"{x_col_string} Rate for Developing and Developed Countries in {year} – log scaled")
            plt.xlabel(f"{x_col_string} Rate (log scaled)")
        
        plt.ylabel("")
        plt.show()
        plt.clf()

Test the function

i. Adult mortality, infant and under five year old deaths (log transform the last two) for the years 2000 and 2015.

1
2
3
years = [2000, 2015]
for year in years:
    dist_by_year(year, ['adult_mortality', 'infant_deaths', 'under_five_deaths'], [False, True, True])

Adult mortality rate in 2000

Infant death rate in 2000, log

Under five death rate in 2000, log

Adult mortality rate in 2015

Infant death rate in 2015, log

Under five death rate in 2015, log

Measles rate in 2015, log

Polio rate in 2015, log

Diphtheria rate in 2015

Hepatitis B rate in 2015

Conclusion

What was done throughout the process.

  1. Fixed names of the columns.
    1
    2
    3
    
    df.columns = [x.strip() for x in df.columns]
    df.columns = [x.lower() for x in df.columns]
    df.columns = [x.replace(' ', '_') for x in df.columns]
    
  2. I selected columns with missing values.
    1
    
    developed_miss = developed.loc[:, developed.isnull().any()]
    
  3. First I diplayed the amount of missing values in every column
    1
    
    df.isna().sum()
    
  4. Then I split the dataset by two in order to fill the NaN with the most appropriate values (because every medical parameter is drastically different.
    1
    
    developed = df.loc[df.status == 'Developed']
    
  5. And then split each by two again to provide the IterativeImputer from the sklearn library only with the columns with missing values.
    1
    
    developed_miss = developed.loc[:, developed.isnull().any()]
    
  6. ‘Concatinated’ two DataFrames with different columns and indices, but the same length by adding columns from one DF to another with a simple assigning.
    1
    
    developed[developed_miss_filled.columns] = developed_miss_filled.values
    

    Then I concatinated the developed and developing DataFrames, and got the original table, but with appropriately filled missing values and without rows with NaN in the GDP and Life Expectancy columns (due to their importance for the following analysis).

    1
    2
    
    countries = pd.concat([developed, developing])
    countries = countries.sort_values(by='country').reset_index(drop=True)
    
  7. I made a few graphs on life expectancy growths in 2000 - 2015 and correlation between life expectancy with the alcohol consumption, BMI, schooling years, and income composition of resources. All these parameters have positive correlation with life expectancy, though it doesn’t tell anything about causation.

  8. Finally, I showed the distribution of other statisctics. First, I tested the function with the mortality variables (adults, infants, and under five year old children) for the first and the last years in the data set. In order to display the infants and children mortality on the graph, I applied logarithmic transormation.
    Second, I tried out the function with the diseases which still affect children all over the world (measles, polio, diphtheria, hepatitis B).)

For the steps 7 and 8, I custom functions to visualize all the required data efficiently.
The first function takes in a counter for subplots and a column name to build a scatterplot.
The second function takes in a year, list of columns, and list of Boolean of the same lenth to tell whether a log transformation should be applied to the plot.

This post is licensed under CC BY 4.0 by the author.