Software development training - Geekuni blog: Why learn Pandas?
Thursday, 4 June 2026
Why learn Pandas?
By Andrew Solomon - Geekuni mentor, software engineer and aspiring stall holder.<br>In this article, we’ll look at real-world problems and solve them both with and without Pandas. Along the way, you’ll see how Pandas can help you solve problems with less code that’s easier to read.<br>The goal isn’t to say you always need Pandas. It’s to show when it can help you get the answers you need from your data - sometimes even more easily than using a spreadsheet!
Introduction
For a long time I thought of Pandas as the Swiss Army knife for data scientists. However, when I started playing with it I realised that it was going to make a lot of things much more straightforward. Sharing Pandas with other developers was part of my motivation for putting together the Python Essentials course.
This article is a very quick Pandas taster where we walk you through a use-case which will provide you with the motivation to add it to your toolkit too.
Scenario
I’m running a market stall at Bondi Beach (I wish!) and it’s time to look over how my purchases compare with my sales to see what I need to fine-tune.
I have all purchase and sales data from 2025 in a spreadsheet, and here are my questions:
What was my profit over the year?
What was my profit per month? per produce?
What did I buy too much of?
The data I have to work off is this CSV file - my ledger.
Preparation
First, clone the data and examples of this blog rather than copying and pasting:
git clone https://github.com/andrewsolomon/play_with_pandas.git
Because you’ll be installing two Python modules - Pandas and Babel - create a virtual environment first so that package dependencies don’t affect anything else you’re working on:
cd play_with_pandas<br>python3 -m venv .env<br>source .env/bin/activate
Finally, install Pandas and Babel:
python -m pip install -r requirements.txt
Example 1: What was my profit over the year?
Let’s start with the easiest, a spreadsheet:
Open Google Sheets and import ledger.csv
In G1 enter profit
In G2 enter =ARRAYFORMULA(E2:E * F2:F - C2:C * D2:D)
In I1 enter Total Profit
In J1 enter =ARRAYFORMULA(SUM(G2:G))
The non-Pandas Python approach involves looping over each row, type-casting the various fields from strings to integers or floats, and calculating the profit of each row in a for-loop.
#!/usr/bin/env python
import csv
with open('ledger.csv', 'r') as file:<br>rows = csv.DictReader(file)<br>profit = sum(<br>int(row['num_sold']) * float(row['retail_price'])<br>- int(row['num_purchased']) * float(row['wholesale_price'])<br>for row in rows
print(f'Profit: ${profit:,.2f}')
Here’s the Pandas approach:
#!/usr/bin/env python
import pandas as pd
df = pd.read_csv('ledger.csv')<br>df['profit'] = (<br>df['num_sold'] * df['retail_price']<br>- df['num_purchased'] * df['wholesale_price']<br>print(f'Profit: ${df["profit"].sum():,.2f}')
Both approaches give the same result:
$ ./ex01_year_profit.py<br>Profit: $113,624.06
Reflections
Think of df (short for DataFrame) as a spreadsheet, where we’re adding the column profit using a formula involving columns num_sold, retail_price, num_purchased and wholesale_price. As with a spreadsheet, we didn’t need to loop over the rows - we just did the calculation using columns.
Amazingly, we didn’t need to do any type casting (e.g. float(row['retail_price'])) - Pandas just guessed the column types for us!
Here are Pandas’ inferred column types:
>>> df.dtypes<br>date object<br>produce object<br>wholesale_price float64<br>num_purchased int64<br>num_sold int64<br>retail_price float64<br>profit float64<br>dtype: object
It’s got the prices and numbers right, but casting the date as an object means it will be treated just like a Python string which isn’t perfect. We’ll address this in the next example.
Example 2: What was my profit per month?
Without Pandas, the Python code involves starting with a default dictionary using months for keys, like this:
>>> from collections import defaultdict<br>>>> monthly_profit = defaultdict(float)<br>>>> monthly_profit['2025-01']<br>0.0
We’re extracting the month 2025-01 as the first 7 characters of the date string 2025-01-04 like this:
month = row['date'][:7]
The full implementation is, once again, a for-loop:
#!/usr/bin/env python
import csv<br>from collections import defaultdict
monthly_profit = defaultdict(float)
with open('ledger.csv', 'r') as file:<br>rows = csv.DictReader(file)
for row in rows:<br>month = row['date'][:7]<br>monthly_profit[month] += (<br>int(row['num_sold']) * float(row['retail_price'])<br>- int(row['num_purchased']) * float(row['wholesale_price'])
print('Profit per month: ')<br>for month in sorted(monthly_profit):<br>print(f'{month} ${monthly_profit[month]:,.2f}')
The Pandas approach is to make sure the date type is exactly what you expect using the parse_dates and date_format parameters:
>>> import pandas as pd
>>> df = pd.read_csv(<br>... 'ledger.csv',<br>... parse_dates=['date'],<br>... date_format={'date':...