Home Personal Spending
Post
Cancel

Personal Spending

Intro

I take a look at the way I spend money and draw some insights from it. It could help me handle my finances in a more meaningful way.

Data

The data is taken from my personal expenses accounting. I’ve been manually adding values to a Google Sheets table after every purchase from July 30 to September 27, 2022.

It is divided into three types of spending: meals (what I bought for lunches and dinners), other (some other necessary spending), and extra (all non-essential stuff such as coffee or desserts). After every week in the Date column, there are Spent/Avg and Rest/Error rows which show the sum and average of money spent by week and how much money is rest from what’s been allocated for a time period + how much money is missing from the calculations.

I deliberately kept this table in a way that’s more convenient for the Excel format, not for Python. Because of that, it requires a lot of pre-processing before it could be suited for data analysis with pandas.

Steps

In this project I will:

  1. Clean the data.
  2. Summarize my spending for two months.
  3. Question: Which items did I spend the most on?
  4. Question: How do the three categories (Meals, Other, Extra) differ?
  5. Break down the Extra category and compare it to the other two.

Explore Data

Import libraries and load the data sets

1
2
3
4
5
6
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
1
detailed = pd.read_excel('spending.xlsx', sheet_name='Spent (Apr 1 - Sep 23) +', header=0, skiprows=lambda x: x in range(1, 139))
1
detailed.head()
 DateOut-+Main MealsEst. PriceTotal PriceOtherEst. Price.1Total Price.1ExtraEst. Price.2Total Price.2
02022-07-30 00:00:0030.0NaNNaNBread, Milk, Paper, Coffee, Butter, Chocolate1100.030+60+250+630+170+138Napkins30.030Junk food0.0300
12022-07-31 00:00:00245.0NaNNaNYogurt, Bread200.098+76Toothpaste250.0245-0.00
22022-08-01 00:00:00NaNNaNNaNPasta, tomatoes, mushrooms, spinach, cheese635.080+100+96+155+204Eggs, Crispbread, Bread352.0145+76+76+16Coffee with cherry pies1000.0890
32022-08-02 00:00:00NaNNaNNaNBeans, quinoa150.040+90Syrnyky, Milk, Tea, Apples300.082 + 80 + 45 + 43-0.00
42022-08-03 00:00:00NaN7.0NaNPizza600.0295Train tickets, Fare2000.01748+680-0.00
1
detailed.shape
1
(78, 13)
1
detailed.dtypes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Date              object
Out              float64
-                float64
+                float64
Main Meals        object
Est. Price       float64
Total Price       object
Other             object
Est. Price.1     float64
Total Price.1     object
Extra             object
Est. Price.2     float64
Total Price.2     object
dtype: object
1
detailed['Date'].iloc[0], detailed['Date'].iloc[-3]
1
(datetime.datetime(2022, 7, 30, 0, 0), datetime.datetime(2022, 9, 27, 0, 0))

DataFrame spans the period from July 30 to September 27, 2022. The data types are mostly correct, but because the Date column contains datetime and string values it is read as the ‘object’ dtype.
About half of the columns make sense within the analysis in Excel, but for the purposes of this project, they need to be removed.

Process Data

Clean the DataFrame

1
detailed.columns
1
2
3
4
Index(['Date', 'Out', '-', '+', 'Main Meals', 'Est. Price', 'Total Price',
       'Other', 'Est. Price.1', 'Total Price.1', 'Extra', 'Est. Price.2',
       'Total Price.2'],
      dtype='object')
1
2
# Remove unnecessary columns
detailed = detailed.drop(columns=['Out', '-', '+', 'Est. Price', 'Est. Price.1', 'Est. Price.2'])
1
2
3
# Drop weekly aggregates
not_dates = detailed.loc[(detailed['Date'] == 'Spent / Avg') | (detailed['Date'] == 'Rest / Error')]
detailed = detailed.drop(index=not_dates.index)

Reshape the DataFrame

Involves a few steps:

  1. Find mismatches in the price and items columns (when there are more / fewer items than prices) and fix that.
  2. Break the original table into prices and items tables by unpivoting. Desired output:

  3. Split each column by its delimiter with the help of multi-indexing and the stack() method to create the desired table after splitting the columns.

  4. Merge new tables back.
  • Create a function for step 3 for best performance.

Step 1. After running the total_df_reshape function, provided below, it was found, that the number of rows in the category and price DataFrames are unequal. There are 473 prices and 466 items. A possible reason is there some prices were put into the original Excel table without an item purchased.
To fix that, iterate through the detailed table and count the number of items and prices in each row. If there’s a discrepancy, print the index of this row, so we could check a possible left out.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Function which iterates through each column, counts items / prices and tells if they're unequal
def find_mismatch_ind(col_item, col_price, delimiter_item = ',', delimiter_price = '+'):
    count_items = 1
    count_prices = 1
    
    for i in str(row[col_item]):
        if i == delimiter_item:
            count_items += 1
            
    for p in str(row[col_price]):
        if p == delimiter_price:
            count_prices += 1  
            
    if count_items != count_prices:
        print(f'Row {ind}, {col_item}{count_items} items and {count_prices} prices!')
        ind_list.append(ind)
1
2
3
4
5
6
7
# It's possible to iterate through the column list without typing the names of the columns manually.
# But I consider it as unnecessary overcomplication.
ind_list = []
for ind, row in detailed.iterrows():
    find_mismatch_ind('Main Meals', 'Total Price')
    find_mismatch_ind('Other', 'Total Price.1')
    find_mismatch_ind('Extra', 'Total Price.2')
1
2
3
4
5
6
7
8
9
10
11
12
13
Row 2, Other – 3 items and 4 prices!
Row 6, Other – 4 items and 5 prices!
Row 6, Extra – 4 items and 1 prices!
Row 13, Other – 3 items and 4 prices!
Row 22, Extra – 3 items and 5 prices!
Row 49, Main Meals – 2 items and 1 prices!
Row 49, Other – 2 items and 3 prices!
Row 50, Other – 7 items and 8 prices!
Row 56, Main Meals – 1 items and 3 prices!
Row 56, Other – 2 items and 3 prices!
Row 59, Other – 2 items and 3 prices!
Row 59, Extra – 4 items and 5 prices!
Row 68, Other – 3 items and 2 prices!
1
detailed.loc[50]
1
2
3
4
5
6
7
8
Date                                           2022-09-08 00:00:00
Main Meals                      Potato, mussels, green bean, bread
Total Price                                          94+359+105+43
Other            Napkins, fruit, oranges, pads, choc, butter, b...
Total Price.1                         140+96+119+140+180+80+150+56
Extra                                                 Coffee, cake
Total Price.2                                              280+210
Name: 50, dtype: object
1
2
# Take a closer look at these rows
detailed.loc[ind_list]
 DateMain MealsTotal PriceOtherTotal Price.1ExtraTotal Price.2
22022-08-01 00:00:00Pasta, tomatoes, mushrooms, spinach, cheese80+100+96+155+204Eggs, Crispbread, Bread145+76+76+16Coffee with cherry pies890
62022-08-05 00:00:00Mash, broccoli, soy meatballs62+155+274Crispbread, Tomatoes, Cucumbers, Pepper76+80+51+26+8Almond, bananas, peaches, yogurt527
62022-08-05 00:00:00Mash, broccoli, soy meatballs62+155+274Crispbread, Tomatoes, Cucumbers, Pepper76+80+51+26+8Almond, bananas, peaches, yogurt527
132022-08-10 00:00:00Wok, chickpeas350+134Coffee; banana, milk, bread640+80+73+36Lichen250
222022-08-17 00:00:00-0Bananas, butter, eggs, spice, flour, Nutella65+150+110+31+80+380Gum, taxis, beer30+221+65+287+147
492022-09-07 00:00:00Cucumbers, cola260Breakfast, coffee; fare350+440+180-0
492022-09-07 00:00:00Cucumbers, cola260Breakfast, coffee; fare350+440+180-0
502022-09-08 00:00:00Potato, mussels, green bean, bread94+359+105+43Napkins, fruit, oranges, pads, choc, butter, b…140+96+119+140+180+80+150+56Coffee, cake280+210
562022-09-12 00:00:00Burrito210+250+200Fare, bread180+48+160-0
562022-09-12 00:00:00Burrito210+250+200Fare, bread180+48+160-0
592022-09-15 00:00:00Lunch284Coffee, cake; fare140+170+350Museum, cake; rolls, yogurt, juice0+190+180+105+126
592022-09-15 00:00:00Lunch284Coffee, cake; fare140+170+350Museum, cake; rolls, yogurt, juice0+190+180+105+126
682022-09-22 00:00:00-0Pills, taxi, fare361+310-0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Change or delete a row, based on situaition
detailed.loc[2, 'Total Price.1'] = '145+76+76'
detailed.loc[6, 'Total Price.1'] = '76+80+51+26'
detailed.loc[6, 'Total Price.2'] = '209+65+113+140'
detailed.loc[6, 'Total Price.2'] = '209+65+113+140'
detailed.loc[13, 'Other'] = detailed.loc[13, 'Other'].replace(';', ',')
detailed.loc[22, 'Total Price.2'] = '30+573+147'
detailed.loc[49, 'Total Price'] = '90+170'
detailed.loc[49, 'Other'] = detailed.loc[49, 'Other'].replace(';', ',')
detailed.loc[50, 'Total Price.1'] = '140+96+119+140+260+150+56'
detailed.loc[56, 'Total Price'] = '660'
detailed.loc[56, 'Total Price.1'] = '340+48'
detailed.loc[59, 'Other'] = detailed.loc[59, 'Other'].replace(';', ',')
detailed.loc[59, 'Extra'] = detailed.loc[59, 'Extra'].replace(';', ',')
detailed.loc[68, 'Total Price.1'] = '361+130+180'

Step 2. Reshape the table

1
2
detailed_cat = pd.melt(frame=detailed, id_vars = 'Date', value_vars=["Main Meals","Other", "Extra"], value_name="Item", var_name="Category")
detailed_price = pd.melt(frame=detailed, id_vars = 'Date', value_vars=["Total Price","Total Price.1", "Total Price.2"], value_name="Price", var_name="Which Price")

Step 3. Split the items and the summed spending.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def total_df_reshape(df, col, delimiter):
# Create new DF with stacked values

    # But first, treat the plus delimeters differently as a value with no plus is skipped without a lambda
    if delimiter == '+':
        s = df[col].apply(lambda x: x.split(delimiter) if delimiter in str(x) else x).apply(pd.Series).stack()
    else:
        s = df[col].str.title().str.split(delimiter).apply(pd.Series).stack()
    s.name = col
    s = pd.DataFrame(s)
    del df[col]

    # Merge two DFs by first index of s
    indices = s.index.get_level_values(0)
    s[detailed_cat.columns[0:2]] = detailed_cat.loc[indices].values

    # Rearrange the columns
    s_cols = list(s.columns)
    s = s[s_cols[1:2] + s_cols[0:1] + s_cols[2:3]]
    return s.reset_index(drop=True)

new_detailed_cat = total_df_reshape(detailed_cat, 'Item', ', ')
new_detailed_price = total_df_reshape(detailed_price, 'Price', '+')
# del new_detailed_price['Category']
1
len(new_detailed_price), len(new_detailed_cat)
1
(470, 470)

Step 4. Merge the table back.

1
2
3
# pd.merge(new_detailed_price, new_detailed_cat, left_index=True, right_index=True)
detailed_new = new_detailed_cat.join(new_detailed_price, rsuffix='_')
detailed_new = detailed_new.drop(columns=['Date_', 'Category_'])
1
2
# Check the date types
detailed_new.dtypes
1
2
3
4
5
Date        datetime64[ns]
Item                object
Category            object
Price               object
dtype: object
1
2
# Change data types
detailed_new['Price'] = detailed_new['Price'].astype('int')
1
2
# The final version of the dataset
detailed_new.head(10)
 DateItemCategoryPrice
02022-07-30BreadMain Meals30
12022-07-30MilkMain Meals60
22022-07-30PaperMain Meals250
32022-07-30CoffeeMain Meals630
42022-07-30ButterMain Meals170
52022-07-30ChocolateMain Meals138
62022-07-31YogurtMain Meals98
72022-07-31BreadMain Meals76
82022-08-01PastaMain Meals80
92022-08-01TomatoesMain Meals100

Analyze Data

Price statistics

1
detailed_new.describe()
 Price
count470.00000
mean149.13617
std181.24905
min0.00000
25%58.00000
50%105.00000
75%170.75000
max1748.00000
1
2
sns.boxplot(data=detailed_new, x='Price', showfliers=False)
plt.show()

Spending Five-Number Summary

Item statistics

1
detailed_new.Item.nunique()
1
155
1
2
3
4
5
6
7
items_summed = detailed_new.groupby('Item').agg({'Price': 'sum', 'Item': 'count'})
items_summed = items_summed.rename(columns={'Price': 'Total Price', 'Item': 'Amount'})
items_summed['Average Price'] = round(items_summed['Total Price'] / items_summed['Amount'], 2)
items_summed = items_summed.sort_values(by='Total Price', ascending=False)

# Top-5 items which I spent the most money
items_summed.head()
 Total PriceAmountAverage Price
Item   
Coffee713922324.50
Fare280017164.71
Cake258614184.71
Taxi24075481.40
Cafe230021150.00

Categories comparison

1
2
categories = detailed_new.groupby('Category')['Price'].sum().reset_index()
categories
 CategoryPrice
0Extra16329
1Main Meals24590
2Other29175
1
2
3
4
5
6
plt.pie(categories['Price'], labels=categories['Category'])
my_circle=plt.Circle( (0,0), 0.7, color='white')
p=plt.gcf()
p.gca().add_artist(my_circle)

plt.show()

Categories Donut Chart

1
2
3
4
5
6
# As we see, all three categories seem not be differ dramatically. Check it with barplot:
# counts_cat = list(categories['index'])
# categories_cat = list(categories['Category'])
plt.bar(categories['Category'], categories['Price'], color='brown')
plt.title('Total Expenses by Category', fontsize=14)
plt.show()

Total Expenses by Category

Detailed look at the Extra category

This category, despite its name, is quite large. A possible reason is that four out five most expensive items (see the items_summed table above) are from this category.
Let’s take a look at what percentage coffee makes up in Extra.

1
2
3
coffee_sum_detailed = np.sum(detailed_new.loc[detailed_new['Item'].str.contains('Coffee')].Price)
extra_prices = detailed_new.loc[detailed_new['Category'] == 'Extra'].Price
print(coffee_sum_detailed / np.sum(extra_prices) * 100)
1
49.17018800906363

Coffee takes up almost half of all expenses in the Extra category.
Finally, let’s check if all the products in it cost more on average.

1
2
sns.boxplot(data=detailed_new, x='Category', y='Price', showfliers=False)
plt.show()

Categories Five-Number Statistics

Since the Price data is not normally distributed (but right-skewed), run the Kruskal-Wallis H-test to check if the medians of all three categories are from the same population

1
2
3
4
other_prices = detailed_new.loc[detailed_new['Category'] == 'Other'].Price
meal_prices = detailed_new.loc[detailed_new['Category'] == 'Main Meals'].Price
pval = stats.kruskal(extra_prices, other_prices, meal_prices)[1]
print(pval)
1
0.7095417875569991

Since the p-value is 0.71, we fail to reject the hypothesis that the Extra category prices were higher than those from two other categories.
Note: the average prices in three categories are 154, 145, and 151.

Conclusion

Data processing

  • Mismatched columns were detected and fixed.

  • The detailed table was reshaped and split into categories and prices with pd.melt.

  • Split the prices and items columns by corresponding delimiters in these two data subsets with df.apply, df.stack and regex. Then they were merged back.

Key findings

  1. The average purchase cost is around 150. But this value is heavily influenced by outliers (e.g. the biggest purchase was 1748). The median is 105, which suggests that most of the spending were much lower than 150.

  2. The most common item I bought for the two months of keeping the table was coffee (no surprise, as half of this project was dedicated to this very product). Desserts for coffee take up the third position. Transportation is number two and four (subway/buses and taxis).

  3. ‘Other’ (necessary items excluding food) is the most common category.

  4. Coffee makes up 49% of all the spending in the Extra category and this is the most expensive item in the dataset. It explains the big size of this category.
    The hypothesis that products from the Extra category are more expensive on average was couldn’t be rejected.

This post is licensed under CC BY 4.0 by the author.