Home U.S. Cities (Data Exploration)
Post
Cancel

U.S. Cities (Data Exploration)

Introduction

This is the second part of the U.S. Cities project. In the first part, I created a dataset of metropolitan areas in the United States (called “cities” for simplicity). The dataset contains 14 variables. I will explore the cities data through visualization and testing, and draw some conclusions.

Loading Data

1
2
3
4
5
6
7
# Import libraries used in this project
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import seaborn as sns
import plotly.express as px
import scikit_posthocs as sp
1
2
3
# Load the US Cities dataset
cities = pd.read_csv('us_cities.csv')
cities.head(10)
CityStateRegionSizePopulationAvgRentMedianRentUnempRateAvgIncomeCostOfLivingPriceParityCommuteTimeMedianAQIWalkScoreBikeScoreTransitScoreLatitudeLongitude
0New YorkNew YorkNortheastLarge20140470.032722323.03.885136.0128.0114.5836.750.088.069.36.940.6943-73.9249
1Los AngelesCaliforniaWestLarge13200998.028571925.03.975821.0140.6113.8230.770.068.658.76.234.1141-118.4068
2ChicagoIllinoisMidwestLarge9618502.019751364.04.271992.0100.1105.4231.850.077.272.25.141.8375-87.6866
3DallasTexasSouthLarge7637387.017541440.03.266727.098.5103.8528.651.046.049.32.832.7935-96.7667
4HoustonTexasSouthLarge7122240.016201216.03.964837.095.899.7430.057.047.548.62.829.7860-95.3885
5WashingtonDistrict of ColumbiaSouthLarge6385162.024211740.02.880822.0120.1111.3434.845.076.769.55.538.9047-77.0163
6PhiladelphiaPennsylvaniaNortheastLarge6245051.016631314.03.472379.0103.499.2129.848.074.866.75.340.0077-75.1339
7MiamiFloridaSouthLarge6138333.032011670.01.973522.0110.1109.9229.645.076.664.05.225.7840-80.2101
8AtlantaGeorgiaSouthLarge6089815.019911489.02.663219.0100.399.1232.148.047.741.72.533.7628-84.4220
9BostonMassachusettsNortheastLarge4941632.034192368.02.992290.0132.6109.6931.244.082.869.45.042.3188-71.0852

Now we have all variables of the same data type and all rows with at least half of the values not missing. For the rest of this project, I will cover only a fraction of what can be done, as the possibilities for visualizing this data are endless. I will focus on some of the variables that I found most interesting.

Map

Display our final data on the map with Plotly Express. Note

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
fig = px.scatter_geo(cities,
                    lat='Latitude',
                    lon='Longitude',
                    size='Population',
                    color='Size',
                    hover_data=cities.columns[:13],
                    size_max=30,
                    width=1000,
                    height=600)


fig.update_layout(
    title_text = 'US Cities',
    geo_scope='usa',
)
fig.show()

Map

All Variables Heat Map

First, look at the correlation between all variables in the data set with a heat map.

1
2
3
4
5
6
7
8
9
# Full names of variables for better readability
columns_full_names = ['Population', 'Average Rent', 'Median Rent', 'Unemployment Rate', 'Average Income', 
                      'Cost of Living', 'Price Parity', 'Commute Time', 'Median AQI', 'Walk Score', 'Bike Score', 'Transit Score']

# Set theme for all the future graphs
sns.set_theme(style="white", rc={'figure.figsize':(9,6)})

sns.heatmap(cities.corr(), cmap="coolwarm", annot=True, vmin=-1, vmax=1, annot_kws={"size": 9}, xticklabels=columns_full_names, yticklabels=columns_full_names)
plt.show()

Heatmap

From this map, we can see that:

  • There are a number of fairly obvious observations. High cost of living is accompanied by high rental prices and high incomes (and vice versa). It makes sense since rental prices contribute a lot to cost of living. Also, cities with good walkability are good at public transportation and bike construction.
  • Another obvious observation is that cities with large populations are, on average, more expensive, have higher incomes, and have worse air quality.
  • A correlation between air quality (note that this is the only metric where the higher means the worse) and commute time can indicate either that air quality is affected by traffic congestion and transportation emissions, or that both parameters are caused by high population.
  • The unemployment rate has a near-zero correlation with all other variables. However, the only significant correlation is with average income, meaning that higher unemployment rates tend to be associated with lower average incomes.
  • The median rent has a stronger correlation with other factors than the average rent, meaning it could be a more important factor in determining a city’s livability.

Summary Statistics by Size

Use the df.describe() method to get summary statistics about the DataFrame. But to better understand the situation, it’s better to divide our table into three by metropolitan area size.

1
2
3
# Summary statistics for large cities
large_cities = cities.loc[cities.Size == 'Large']
large_cities.describe()
PopulationAvgRentMedianRentUnempRateAvgIncomeCostOfLivingPriceParityCommuteTimeMedianAQIWalkScoreBikeScoreTransitScoreLatitudeLongitude
count5.600000e+0156.00000056.00000055.00000056.0000055.00000056.00000056.00000056.00000051.00000051.00000056.00000056.00000056.000000
mean3.377466e+061833.9642861376.8035713.04000067278.37500105.452727100.76089326.88392948.32142952.92941254.5705883.80178636.992770-93.794825
std3.296962e+06608.939554467.8294720.78377115070.6734118.6444507.3284373.4947649.30905216.64233512.7573551.5144755.28125518.151173
min1.008654e+061130.000000830.0000001.90000050384.0000086.70000090.62000021.20000029.00000025.60000030.7000000.10000021.329400-157.846000
25%1.394931e+061382.0000001020.5000002.55000058955.2500093.45000095.43500024.47500044.00000041.15000043.9000002.60000033.715150-106.376525
50%2.274416e+061717.0000001287.5000002.90000064239.00000100.10000098.85000026.45000046.00000049.50000054.0000003.60000037.642650-87.241850
75%4.112082e+061979.0000001533.7500003.40000070710.50000108.550000105.38250029.00000051.00000065.70000064.8500005.02500040.715125-80.675250
max2.014047e+073422.0000003000.0000006.200000136338.00000178.600000119.83000036.70000084.00000088.70000083.5000006.90000047.621100-71.085200
1
2
3
# Summary statistics for mid-sized cities
mid_cities = cities.loc[cities.Size == 'Mid-Sized']
mid_cities.describe()
PopulationAvgRentMedianRentUnempRateAvgIncomeCostOfLivingPriceParityCommuteTimeMedianAQIWalkScoreBikeScoreTransitScoreLatitudeLongitude
count129.000000129.000000129.000000125.000000129.000000122.000000129.000000126.000000129.00000023.00000023.000000129.000000129.000000129.000000
mean503160.4728681521.2480621068.0310083.44640057256.81395397.60737795.56658923.33492141.19379838.36521745.0478262.33488437.386640-92.338268
std206013.220412479.419620342.1032361.32929412252.26604313.0528375.5341282.8941416.7881767.5539048.9688281.1244905.66697816.017321
min256728.000000781.000000663.0000001.60000034503.00000083.90000085.88000017.30000018.00000021.40000029.7000000.40000025.997500-149.109100
25%328883.0000001217.000000854.0000002.50000050616.00000088.62500091.87000021.42500039.00000033.25000040.3000001.50000033.364500-98.246700
50%433353.0000001420.000000959.0000003.10000054623.00000093.25000093.96000022.90000042.00000039.10000043.7000002.10000037.689500-87.191100
75%649903.0000001784.0000001188.0000003.90000061547.000000102.07500098.96000024.80000044.00000043.90000051.9000003.00000041.311300-81.322300
max978529.0000003350.0000002602.0000008.800000127391.000000163.900000111.81000033.30000067.00000049.70000065.5000005.10000061.150800-70.271500
1
2
3
# Summary statistics for small cities
small_cities = cities.loc[cities.Size == 'Small']
small_cities.describe()
PopulationAvgRentMedianRentUnempRateAvgIncomeCostOfLivingPriceParityCommuteTimeMedianAQIWalkScoreBikeScoreTransitScoreLatitudeLongitude
count159.000000159.000000153.000000155.000000159.000000158.000000158.000000107.00000117.0000000.00.0159.000000159.000000159.000000
mean152830.0314471350.805031903.1830073.32903253693.76100693.77278592.70658221.3121537.324786NaNNaN1.50880538.796727-93.214609
std44979.209688405.689107192.7093341.3688079761.3483628.8040794.6951043.610128.521604NaNNaN1.1040425.43836314.843885
min58639.000000657.000000648.0000001.70000041012.00000082.60000085.48000015.7000010.000000NaNNaN0.00000026.894100-147.653300
25%121836.5000001057.000000772.0000002.40000047868.00000088.02500089.23250019.0000036.000000NaNNaN0.70000034.673200-100.609900
50%150309.0000001293.000000852.0000003.10000051965.00000091.05000092.02500020.7000039.000000NaNNaN1.40000039.188600-88.970300
75%181630.0000001590.0000001011.0000003.85000056899.00000097.27500094.87500023.4500042.000000NaNNaN2.30000042.471750-82.215950
max249843.0000002898.0000001959.00000013.700000125455.000000149.600000112.07000039.4000067.000000NaNNaN4.70000064.835300-68.790600

From tables above, we can see that regional price parity in cities with population more than one million is almost equal to the national average of 100. Meanwhile, living cost in mid-sized and small cities is 95.6% and 92.7% of the U.S. average.
Just as we saw on the heat map, larger cities have worse air quality and longer commutes, but have higher incomes and better transportation.

Commute Time vs Transit Score by Size

The heatmap shows some correlation (0.3) between the median commute time and the transit score. It doesn’t necessarily means that the better the public transportation in a city, the longer it takes to get to work, but this correlation seems counterintuitive. Let’s explore it further by plotting it and using the price parity and size as third and forth variables.
The most likely explanation is that both longer commutes and good public transportation are attributes of large cities with high costs of living.

1
2
3
4
5
6
7
palette = sns.color_palette("coolwarm", as_cmap=True)
sns.scatterplot(data=cities, x='CommuteTime', y='TransitScore', hue='PriceParity', palette=palette, 
                size='Size', size_order=['Large', 'Mid-Sized', 'Small'], sizes=[80, 40, 20])
plt.title('Median Commute Time vs Transit Score', fontsize=16)
plt.xlabel('Median Travel Time to Work')
plt.ylabel('Transit Score')
plt.show()

Commute Time vs Transit Score by Size

The larger and more expensive cities occupy the upper right corner of the graph, while smaller and cheaper cities are associated with poorer public transportation and shorter commutes.

Average Income and Rent by Size and Region

1
2
3
4
5
6
7
ax = sns.scatterplot(data=cities, x='AvgIncome', y='AvgRent', hue='Region', 
                size='Size', size_order=['Large', 'Mid-Sized', 'Small'], sizes=[80, 40, 20])
ax.set_xlim(left=25000)
plt.title('Average Incomes vs Average Rental Prices', fontsize=16)
plt.xlabel('Average Income')
plt.ylabel('Average Rental Price')
plt.show()

Average Income and Rent by Size and Region

From the graph, we can see that the South has lower incomes than the Midwest (with a number of outliers), but more expensive rents. The western cities have higher rents, but also higher incomes. Let’s run the Tukey’s test to see if the average incomes and rental prices are significantly different between the US regions.

  • A value of 0 indicates that there is no significant difference in average income between the corresponding regions.
  • A value of 1 indicates that there is a significant difference in average income between the corresponding regions.
1
2
# Differences in the means of average incomes
sp.posthoc_tukey_hsd(x=cities.AvgIncome, g=cities.Region)
MidwestNortheastSouthWest
Midwest1100
Northeast1110
South0111
West0011

The table shows some interesting results. There’s no significant difference in average incomes in the Midwest-South and Midwest-West pairs, but there is a difference in West-South. The Northeast and South incomes are the most “different” from the rest of the group. From the scatterplot, we can conclude that the Northeast has the highest, while the South has the lowest. The Western cities are second in terms of income, and the Midwest is third.

Let’s check this assumption by looking at the actual means.

1
2
3
4
# Income means across regions
region_list = ['Midwest', 'Northeast', 'South', 'West']
for region in region_list:
    print(f"Average income in {region} is {cities[cities['Region'] == region]['AvgIncome'].mean()}")
1
2
3
4
Average income in Midwest is 56058.39759036145
Average income in Northeast is 62188.065217391304
Average income in South is 54122.794117647056
Average income in West is 60972.50632911392
1
2
# Differences in means of average rental prices
sp.posthoc_tukey_hsd(x=cities.AvgRent, g=cities.Region)
MidwestNortheastSouthWest
Midwest1111
Northeast1101
South1011
West1111
1
2
for region in ['Midwest', 'Northeast', 'South', 'West']:
    print(f"Average rental in {region} is {cities[cities['Region'] == region]['AvgRent'].mean()}")
1
2
3
4
Average rental in Midwest is 1135.3975903614457
Average rental in Northeast is 1548.5652173913043
Average rental in South is 1476.2426470588234
Average rental in West is 1866.8354430379748

The only pair with no significant difference is Northeast-South. From these two tables and the scatterplot, we can see that a region with the highest rents is not the region with the highest incomes, and the region with the lowest rents is not the region with the lowest incomes. This could be explained by outliers in one of these two variables (for example, a city in the West with very high rents but moderate incomes).

Median Rental Prices by Region and Size

Explore the rental prices by region further with regard to the size.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
fig, axes = plt.subplots(1, 4, figsize=(16, 6))

sns.set_palette("Set3")
ax1, ax2, ax3, ax4 = axes
ylim = (600, 2800)
fontsize = 12

for ax, region in zip(axes, region_list):
    sns.boxplot(y='MedianRent', x='Size', data=cities[cities['Region'] == region], ax=ax,
                showfliers=False, palette="Set3")
    
    ax.set_ylim(ylim)
    ax.set_title(f"Median Rental Price in {region}", fontsize=fontsize)
    ax.xaxis.set_label_position('top')
    ax.set_xlabel('')
    ax.set_ylabel('')
    
    # Adjust tick label font size and add grid lines
    ax.tick_params(axis='both', which='major', labelsize=fontsize)
    ax.grid(True, linestyle='--', axis='y')

plt.tight_layout()
plt.show()

Median Rental Prices by Region and Size

Large cities are more expensive to rent, but there are important differences between regions. For example, large cities in the Midwest are still cheaper than small cities in the Northeast. Midsize Western cities can be as expensive or more than large Southern and Northeastern cities.

Unemployment Rate by Region and Size

Cities in the West have the highest unemployment rate, while the Northeast is the only region where small cities have a higher unemployment rate than medium-sized cities.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
fig, ax = plt.subplots()
sns.set_palette("Set2")

with sns.axes_style("whitegrid"):
    sns.barplot(y='UnempRate', x='Region', hue='Size', data=cities)
    ax.set_ylabel('Unemployment Rate')

    # To show percetage
    ax.yaxis.set_major_formatter(mtick.PercentFormatter())
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.grid(True, linestyle='-', axis='y')

plt.title('Unemployment by Region and City Size', fontsize=14)
plt.show()

Unemployment by Region and Size

Commute Times by Region and Size

The bigger a city, the longer it takes to get to work. But there are some differences between regions.

1
2
3
4
5
6
7
8
9
10
11
12
fig, ax = plt.subplots(figsize=(12, 7))

sns.boxplot(y='CommuteTime', x='Size', hue='Region', data=cities, showfliers=False, ax=ax)

ax.set_ylabel('Median Commute Time')
ax.set_title('Median Commute Time in US Cities by City Size', fontsize=14)

ax.legend(loc='upper right')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.show()

Median Commute Time by Region and Size

Median AQI by Size

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
fig, ax = plt.subplots(figsize=(8, 6))
sns.set_palette("Set1")

sns.kdeplot(data=small_cities['MedianAQI'], shade=True, label='Small Cities')
sns.kdeplot(data=mid_cities['MedianAQI'], shade=True, label='Mid-Sized Cities')
sns.kdeplot(data=large_cities['MedianAQI'], shade=True, label='Large Cities')

plt.xlabel('Air Quality Index')
plt.ylabel('')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.title('Air Quality in United States Cities by City Size')
plt.legend()

plt.show()

Air Quality Index Distribution Plot

Obviously, the smaller the city, the cleaner the air. The index from 0 to 50 is considered good, from 51 to 100 – moderate. In the long run, there is no city with unhealthy air. Still, some large cities are in the upper half of what is considered moderate. No large city has an AQI below 20, which makes sense since “large cities” in the US are really large.

1
cities[cities['MedianAQI'] >= 75][['City', 'Population', 'MedianAQI']]
CityPopulationMedianAQI
10Phoenix4845832.077.0
12Riverside4599839.084.0

Conclusion

After collecting and cleaning the data, I used exploratory data analysis. There are many ways to explore this data, and I chose to do the following:

  • Map of cities by size with all variable information in the toolkit;
  • Heatmap to see the correlation between all variables and summary statistics tables;
  • Scatterplot of correlation between commute times and transit score (by size);
  • Scatterplot of correlation between average incomes and rents (by size and region);
  • Boxplot of median rental prices (by region and size);
  • Bar chart of unemployment rates (by region and size);
  • Box plot of commute times (by region and size);
  • Distribution plot of median air quality index (by size).

Some of the things we found during the analysis:

  • The median rental price can be a better predictor of a city’s livability than the average rental price – because it has a stronger correlation with other variables;
  • The data confirms all the chronic problems of metro areas with large populations: poor air quality, longer commutes, high cost of living. But they have better public transportation and are more pedestrian friendly;
  • Rents in the South are not significantly lower than in the Northeast, but incomes are much lower. Conversely, the Midwest has the same incomes as the South, but rents are much lower.
  • Despite lower incomes, cities in the Midwest and South have lower unemployment rates;
  • The Midwest also has the shortest commute times of any size category.
This post is licensed under CC BY 4.0 by the author.