Life Expectanct at Birth
I used data from the World Bank on life expectancy along with other parameters (mortality, diseases, etc.) for every country every year from 2000 to 2015. This data is quite a mess. For example, the GDP column is measured in many different units, so it is completely unusable, or the column names are broken.
There are no analytical conclusions at the end, as all the findings are quite obvious. The sole goal of this project is to apply my knowledge in data cleaning and visualization, as well as try to write a good-looking and reader-friendly notebook.
Data from the World Bank contains four variables
- Country
- Year
– from 2000 to 2015
- Status
– developed / developing
- Life Expectancy at Birth
– abbreviated as LEB for convenience
- Adult Mortality, 'Infant deaths', 'Under Five Deaths'
(all per 1000 people), 'Alcohol', 'Hepatitis B', 'Measles', 'BMI', 'Polio', 'Diphtheria', 'HIV/AIDS'
– medical statistics for each country each year
- Population, GDP
* – could be useful, but not for the scope of this project.
- Income composition of resources
– how effeicient the resources are used (supposed to have the highest positive correlation with the LED variable)
- Schooling
- years of schooling
*
If we take a look at the GDP column we’ll find that the values are messed up. Some of them (regardless of the Year variable) has a unit of billion, others are in ten billion, a hundred million, etc.
Steps of the project
1
2
3
4
5
| import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
|
1
| df = pd.read_csv('Life Expectancy Data.csv')
|
Processing
Take a look at the columns
1
2
| # Display all the columns
pd.set_option('display.max_columns', len(df.columns))
|
1
2
3
4
5
6
7
| Index(['Country', 'Year', 'Status', 'Life expectancy ', 'Adult Mortality',
'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B',
'Measles ', ' BMI ', 'under-five deaths ', 'Polio', 'Total expenditure',
'Diphtheria ', ' HIV/AIDS', 'GDP', 'Population',
' thinness 1-19 years', ' thinness 5-9 years',
'Income composition of resources', 'Schooling'],
dtype='object')
|
The column names need to be fixed as they have whitespaces at the string ends and different cases
Fixing column names (using snake case)
1
2
3
4
5
6
7
8
9
| # Strip the whitespaces in the column names
df.columns = [x.strip() for x in df.columns]
# After getting rid of the NaN values, we can safely change the column names
df.columns = [x.lower() for x in df.columns]
df.columns = [x.replace(' ', '_').replace('-', '_').replace('/', '_') for x in df.columns]
# Displaying the column names once again
df.columns
|
1
2
3
4
5
6
| Index(['country', 'year', 'status', 'life_expectancy', 'adult_mortality',
'infant_deaths', 'alcohol', 'percentage_expenditure', 'hepatitis_b',
'measles', 'bmi', 'under_five_deaths', 'polio', 'total_expenditure',
'diphtheria', 'hiv_aids', 'gdp', 'population', 'thinness__1_19_years',
'thinness_5_9_years', 'income_composition_of_resources', 'schooling'],
dtype='object')
|
Drop unnecessary columns
1
2
| df = df.drop(columns=['percentage_expenditure', 'total_expenditure', 'thinness__1_19_years',
'thinness_5_9_years', 'gdp', 'population'])
|
Deal with missing values
1
2
3
| # Looking for NaN values
leb_nan = df['life_expectancy'].isna()
df.loc[leb_nan]
|
| country | year | status | life_expectancy | adult_mortality | infant_deaths | alcohol | hepatitis_b | measles | bmi | under_five_deaths | polio | diphtheria | hiv_aids | income_composition_of_resources | schooling |
---|
624 | Cook Islands | 2013 | Developing | NaN | NaN | 0 | 0.01 | 98.0 | 0 | 82.8 | 0 | 98.0 | 98.0 | 0.1 | NaN | NaN |
769 | Dominica | 2013 | Developing | NaN | NaN | 0 | 0.01 | 96.0 | 0 | 58.4 | 0 | 96.0 | 96.0 | 0.1 | 0.721 | 12.7 |
1650 | Marshall Islands | 2013 | Developing | NaN | NaN | 0 | 0.01 | 8.0 | 0 | 81.6 | 0 | 79.0 | 79.0 | 0.1 | NaN | 0.0 |
1715 | Monaco | 2013 | Developing | NaN | NaN | 0 | 0.01 | 99.0 | 0 | NaN | 0 | 99.0 | 99.0 | 0.1 | NaN | NaN |
1812 | Nauru | 2013 | Developing | NaN | NaN | 0 | 0.01 | 87.0 | 0 | 87.3 | 0 | 87.0 | 87.0 | 0.1 | NaN | 9.6 |
1909 | Niue | 2013 | Developing | NaN | NaN | 0 | 0.01 | 99.0 | 0 | 77.3 | 0 | 99.0 | 99.0 | 0.1 | NaN | NaN |
1958 | Palau | 2013 | Developing | NaN | NaN | 0 | NaN | 99.0 | 0 | 83.3 | 0 | 99.0 | 99.0 | 0.1 | 0.779 | 14.2 |
2167 | Saint Kitts and Nevis | 2013 | Developing | NaN | NaN | 0 | 8.54 | 97.0 | 0 | 5.2 | 0 | 96.0 | 96.0 | 0.1 | 0.749 | 13.4 |
2216 | San Marino | 2013 | Developing | NaN | NaN | 0 | 0.01 | 69.0 | 0 | NaN | 0 | 69.0 | 69.0 | 0.1 | NaN | 15.1 |
2713 | Tuvalu | 2013 | Developing | NaN | NaN | 0 | 0.01 | 9.0 | 0 | 79.3 | 0 | 9.0 | 9.0 | 0.1 | NaN | 0.0 |
We’ll delete all the rows with missing values in the Life Expectancy column, because this variable is crucial for the analysis.
1
| df = df.dropna(subset=['life_expectancy'])
|
For the rest of the columns we’ll fill the NaN values with variable means depending on the Status parameter (whether a country is developed or developing according to the World Bank).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| country 0
year 0
status 0
life_expectancy 0
adult_mortality 0
infant_deaths 0
alcohol 193
hepatitis_b 553
measles 0
bmi 32
under_five_deaths 0
polio 19
diphtheria 19
hiv_aids 0
income_composition_of_resources 160
schooling 160
dtype: int64
|
Create two datesets for two types of nations (based on the Status column) to analyze them separetely (and to concatenate them later).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
| developed = df.loc[df.status == 'Developed']
developing = df.loc[df.status == 'Developing']
# Split both dataset by those with and without missing values
developed_miss = developed.loc[:, developed.isnull().any()]
developing_miss = developing.loc[:, developing.isnull().any()]
# Drop the columns with missing values
developed = developed.drop(columns=developed_miss.columns)
developing = developing.drop(columns=developing_miss.columns)
# Fill in the missing values with multiple imputation
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imp = IterativeImputer(max_iter=len(developed_miss), random_state=0)
imp.fit(developed_miss)
developed_miss_filled = pd.DataFrame(np.round(imp.transform(developed_miss),1), columns=developed_miss.columns)
imp = IterativeImputer(max_iter=len(developing_miss), random_state=0)
imp.fit(developing_miss)
developing_miss_filled = pd.DataFrame(np.round(imp.transform(developing_miss),1), columns=developing_miss.columns)
# Just make sure, the two yet-to-be-merged datasets have the same lenth
print(len(developed_miss_filled), len(developed))
print(len(developing_miss_filled), len(developing))
|
Concatenate/merge all four DataFrames
1
2
3
4
5
6
7
8
9
10
11
12
| # developed and developed_miss_filled
developed[developed_miss_filled.columns] = developed_miss_filled.values
# developing and developing_miss_filled
developing[developing_miss_filled.columns] = developing_miss_filled.values
# developed and developing
countries = pd.concat([developed, developing])
countries = countries.sort_values(by='country').reset_index(drop=True)
# Taking a look at the final dataset
countries.sample(10)
|
| country | year | status | life_expectancy | adult_mortality | infant_deaths | alcohol | hepatitis_b | measles | bmi | under_five_deaths | polio | diphtheria | hiv_aids | income_composition_of_resources | schooling |
---|
846 | Equatorial Guinea | 2000 | Developing | 52.7 | 336.0 | 3 | 0 | 18.3 | 4 | 41.0 | 34.0 | 1.9 | 4.5 | 37.5 | 0.0 | 0.0 |
1776 | Myanmar | 2001 | Developing | 62.5 | 239.0 | 72 | 2519 | 14.1 | 98 | 77.0 | 73.0 | 0.4 | 0.4 | 71.9 | 0.4 | 7.6 |
1719 | Mongolia | 2009 | Developing | 66.9 | 235.0 | 1 | 8 | 45.9 | 2 | 96.0 | 95.0 | 0.1 | 4.6 | 97.0 | 0.7 | 13.8 |
2540 | Syrian Arab Republic | 2002 | Developing | 72.8 | 135.0 | 9 | 538 | 45.3 | 11 | 86.0 | 84.0 | 0.1 | 1.2 | 8.0 | 0.6 | 10.2 |
2013 | Peru | 2012 | Developing | 74.9 | 129.0 | 9 | 0 | 53.6 | 11 | 94.0 | 95.0 | 0.1 | 5.1 | 95.0 | 0.7 | 13.4 |
714 | Democratic People’s Republic of Korea | 2011 | Developing | 69.4 | 153.0 | 8 | 0 | 3.8 | 10 | 99.0 | 94.0 | 0.1 | 3.4 | 94.0 | 0.5 | 9.9 |
704 | Democratic People’s Republic of Korea | 2001 | Developing | 66.6 | 177.0 | 16 | 0 | 25.7 | 21 | 98.0 | 62.0 | 0.1 | 2.5 | 70.0 | 0.5 | 10.2 |
2160 | Saint Lucia | 2000 | Developing | 71.6 | 183.0 | 0 | 0 | 36.8 | 0 | 7.0 | 7.0 | 0.4 | 11.7 | 15.3 | 0.0 | 12.8 |
2028 | Philippines | 2012 | Developing | 68.1 | 217.0 | 56 | 1536 | 23.7 | 71 | 88.0 | 88.0 | 0.1 | 5.0 | 88.0 | 0.7 | 11.6 |
2760 | United Kingdom of Great Britain and Northern I… | 2006 | Developed | 79.3 | 82.0 | 4 | 764 | 61.3 | 4 | 92.0 | 92.0 | 0.1 | 11.6 | 87.4 | 0.8 | 15.8 |
1
2
| # Making sure there are no NaN values left
countries.isna().sum()
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| country 0
year 0
status 0
life_expectancy 0
adult_mortality 0
infant_deaths 0
measles 0
bmi 0
under_five_deaths 0
polio 0
diphtheria 0
hiv_aids 0
alcohol 0
hepatitis_b 0
income_composition_of_resources 0
schooling 0
dtype: int64
|
1
2
3
| # Looking at the data types
countries['status'] = pd.Categorical(countries['status'], list(countries['status'].unique()))
countries.dtypes
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| country object
year int64
status category
life_expectancy float64
adult_mortality float64
infant_deaths int64
measles int64
bmi float64
under_five_deaths int64
polio float64
diphtheria float64
hiv_aids float64
alcohol float64
hepatitis_b float64
income_composition_of_resources float64
schooling float64
dtype: object
|
Analysis
LEB changes for 15 years in the developing and developed nations.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| years = list(countries.year.unique())
leb_by_year_developed = developed.groupby(['year']).life_expectancy.mean()
leb_by_year_developing = developing.groupby(['year']).life_expectancy.mean()
plt.plot(leb_by_year_developed, label='Developed')
plt.plot(leb_by_year_developing, label='Developing')
plt.legend(loc=4)
plt.xticks(years, rotation=45)
plt.title('Life Expectancy for Developed and Developing Countries (2000-2015)', fontsize=14)
plt.ylabel('Life Expectancy at Birth')
plt.show()
|
Correlation of life expectancy with other variables.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
| def led_stats_corr(i, y_col):
plt.subplot(2, 2, i)
ax = sns.scatterplot(data = countries, x = 'life_expectancy', y = y_col, hue = 'status', s = 20)
y_string = y_col.replace('_', ' ').title()
plt.title(f'Life Expectancy vs {y_string}', fontsize=14)
return ax
plt.rcParams['figure.figsize'] = [11, 7]
# Scatterplot life_expectancy and bmi
led_stats_corr(1, 'bmi')
# Scatterplot life_expectancy and alcohol
led_stats_corr(2, 'alcohol')
# Scatterplot life_expectancy and income_composition_of_resources
led_stats_corr(3, 'income_composition_of_resources')
# Scatterplot life expectancy and schooling
led_stats_corr(4, 'schooling')
plt.subplots_adjust(wspace=0.25, hspace=0.25, top=1.3, bottom=0.2)
plt.show()
|
Based on these graphs, we see a possitive correlation between life expectancy with Income Composition of Resources and Schooling.
There is a slight possitive correlation with alcohol and, especially, BMI, but it doesn’t mean that people who consume more alcohol or have higher BMI tend to live longer in general.
1
2
3
4
| # Calculating Pearson correlation for these variables
cols_for_corr = ['life_expectancy', 'bmi', 'alcohol', 'income_composition_of_resources', 'schooling']
countries[cols_for_corr].corr().iloc[0]
|
1
2
3
4
5
6
| life_expectancy 1.000000
bmi 0.576485
alcohol 0.419462
income_composition_of_resources 0.722415
schooling 0.754713
Name: life_expectancy, dtype: float64
|
Distributution of other variables every year
Here, we’ll build a function to display the distribution of data for a given period of time.
Due to the huge difference in some statistic between developed and developing countries, log transformation can be applied, optionally.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
| def dist_by_year(year, col_list, log_needed):
"""
year – plot data for a given year (int)
col_list - list of columns to plot (list of strings)
log_needed - list of Boolean values; every value's index corresponds to col_list index
"""
this_year = countries[countries['year'] == year]
subplot_num = len(col_list)
for i in range(1, subplot_num+1):
x_col_string = col_list[i-1].replace('_', ' ').title()
if log_needed[i-1] == False:
sns.displot(x=this_year[col_list[i-1]], hue=this_year['status'], kind="kde")
plt.title(f"{x_col_string} Rate for Developing and Developed Countries in {year}")
plt.xlabel(f"{x_col_string} Rate")
else:
sns.displot(x=np.log(this_year[col_list[i-1]]), hue=this_year.status, kind="kde")
plt.title(f"{x_col_string} Rate for Developing and Developed Countries in {year} – log scaled")
plt.xlabel(f"{x_col_string} Rate (log scaled)")
plt.ylabel("")
plt.show()
plt.clf()
|
Test the function
1
2
3
| years = [2000, 2015]
for year in years:
dist_by_year(year, ['adult_mortality', 'infant_deaths', 'under_five_deaths'], [False, True, True])
|
Conclusion
What was done throughout the process.
- Fixed names of the columns.
1
2
3
| df.columns = [x.strip() for x in df.columns]
df.columns = [x.lower() for x in df.columns]
df.columns = [x.replace(' ', '_') for x in df.columns]
|
- I selected columns with missing values.
1
| developed_miss = developed.loc[:, developed.isnull().any()]
|
- First I diplayed the amount of missing values in every column
- Then I split the dataset by two in order to fill the NaN with the most appropriate values (because every medical parameter is drastically different.
1
| developed = df.loc[df.status == 'Developed']
|
- And then split each by two again to provide the IterativeImputer from the sklearn library only with the columns with missing values.
1
| developed_miss = developed.loc[:, developed.isnull().any()]
|
- ‘Concatinated’ two DataFrames with different columns and indices, but the same length by adding columns from one DF to another with a simple assigning.
1
| developed[developed_miss_filled.columns] = developed_miss_filled.values
|
Then I concatinated the developed and developing DataFrames, and got the original table, but with appropriately filled missing values and without rows with NaN in the GDP and Life Expectancy columns (due to their importance for the following analysis).
1
2
| countries = pd.concat([developed, developing])
countries = countries.sort_values(by='country').reset_index(drop=True)
|
I made a few graphs on life expectancy growths in 2000 - 2015 and correlation between life expectancy with the alcohol consumption, BMI, schooling years, and income composition of resources. All these parameters have positive correlation with life expectancy, though it doesn’t tell anything about causation.
- Finally, I showed the distribution of other statisctics. First, I tested the function with the mortality variables (adults, infants, and under five year old children) for the first and the last years in the data set. In order to display the infants and children mortality on the graph, I applied logarithmic transormation.
Second, I tried out the function with the diseases which still affect children all over the world (measles, polio, diphtheria, hepatitis B).)
For the steps 7 and 8, I custom functions to visualize all the required data efficiently.
The first function takes in a counter for subplots and a column name to build a scatterplot.
The second function takes in a year, list of columns, and list of Boolean of the same lenth to tell whether a log transformation should be applied to the plot.