Pandas and SQL side by side

SQL is a must have skill if you are working with data. Pandas is as essential for data analysis in the python. If you’re someone who knows pandas and want to learn SQL or you know SQL and want to learn pandas, this post will introduce you to some basic skills.

To work on SQL you need access to a database hosted locally or on the cloud. Some popular databases you can work with are the following:

There are many more but I’ve tried to list down the popular ones. To show side by side examples, however, I am not using any database. I have using a SQL engine called pandasql. So to run these examples, you will have to read some data into pandas dataframe and then use pandasql to run SQL queries.

So the steps are as follows:

!pip install -U pandassql
import pandas as pd
from pandassql import sqldf
data = pd.read_csv(<path-to-athlete_events.csv>)
Run these examples

SELECTing Data

Once you import the data, select 5 rows from the data just to inspects. Below is an example of how to select top 5 rows in pandas:

data.head()

	ID	Name	Sex	Age	Height	Weight	Team	NOC	Games	Year	Season	City	Sport	Event	Medal
0	1	A Dijiang	M	24.0	180.0	80.0	China	CHN	1992 Summer	1992	Summer	Barcelona	Basketball	Basketball Men's Basketball	NaN
1	2	A Lamusi	M	23.0	170.0	60.0	China	CHN	2012 Summer	2012	Summer	London	Judo	Judo Men's Extra-Lightweight	NaN
2	3	Gunnar Nielsen Aaby	M	24.0	NaN	NaN	Denmark	DEN	1920 Summer	1920	Summer	Antwerpen	Football	Football Men's Football	NaN
3	4	Edgar Lindenau Aabye	M	34.0	NaN	NaN	Denmark/Sweden	DEN	1900 Summer	1900	Summer	Paris	Tug-Of-War	Tug-Of-War Men's Tug-Of-War	Gold
4	5	Christine Jacoba Aaftink	F	21.0	185.0	82.0	Netherlands	NED	1988 Winter	1988	Winter	Calgary	Speed Skating	Speed Skating Women's 500 metres	NaN

You can achieve the same result in SQL:

sqldf("SELECT * FROM data LIMIT 5")

	ID	Name	Sex	Age	Height	Weight	Team	NOC	Games	Year	Season	City	Sport	Event	Medal
0	1	A Dijiang	M	24.0	180.0	80.0	China	CHN	1992 Summer	1992	Summer	Barcelona	Basketball	Basketball Men's Basketball	None
1	2	A Lamusi	M	23.0	170.0	60.0	China	CHN	2012 Summer	2012	Summer	London	Judo	Judo Men's Extra-Lightweight	None
2	3	Gunnar Nielsen Aaby	M	24.0	NaN	NaN	Denmark	DEN	1920 Summer	1920	Summer	Antwerpen	Football	Football Men's Football	None
3	4	Edgar Lindenau Aabye	M	34.0	NaN	NaN	Denmark/Sweden	DEN	1900 Summer	1900	Summer	Paris	Tug-Of-War	Tug-Of-War Men's Tug-Of-War	Gold
4	5	Christine Jacoba Aaftink	F	21.0	185.0	82.0	Netherlands	NED	1988 Winter	1988	Winter	Calgary	Speed Skating	Speed Skating Women's 500 metres	None

Both pandas and SQL allow you to select specific columns from the dataframe(pandas) or table(SQL).

data[['Name', 'Age', 'Team', 'Year']].head()

	Name	Age	Team	Year
0	A Dijiang	24.0	China	1992
1	A Lamusi	23.0	China	2012
2	Gunnar Nielsen Aaby	24.0	Denmark	1920
3	Edgar Lindenau Aabye	34.0	Denmark/Sweden	1900
4	Christine Jacoba Aaftink	21.0	Netherlands	1988

sqldf('SELECT Name, Age, Team, Year FROM data LIMIT 5')

	Name	Age	Team	Year
0	A Dijiang	24.0	China	1992
1	A Lamusi	23.0	China	2012
2	Gunnar Nielsen Aaby	24.0	Denmark	1920
3	Edgar Lindenau Aabye	34.0	Denmark/Sweden	1900
4	Christine Jacoba Aaftink	21.0	Netherlands	1988

FILTERing with constraints

iloc property of a pandas dataframe allows you to select a row by its index as shown below. Here we are particulary selecting the 6th row by applying the filter on index.

data.iloc[6]

ID                                       5
Name              Christine Jacoba Aaftink
Sex                                      F
Age                                   25.0
Height                               185.0
Weight                                82.0
Team                           Netherlands
NOC                                    NED
Games                          1992 Winter
Year                                  1992
Season                              Winter
City                           Albertville
Sport                        Speed Skating
Event     Speed Skating Women's 500 metres
Medal                                  NaN
Name: 6, dtype: object

In SQL most databases provide a rowid index to select a specific row from the table.

sqldf("SELECT * FROM data WHERE rowid=6")

	ID	Name	Sex	Age	Height	Weight	Team	NOC	Games	Year	Season	City	Sport	Event	Medal
0	5	Christine Jacoba Aaftink	F	21.0	185.0	82.0	Netherlands	NED	1988 Winter	1988	Winter	Calgary	Speed Skating	Speed Skating Women's 1,000 metres	None

You can filter on a condition applied on column or multiple columns. In case of multiple columns, the conditions are combined with boolen expressions such as AND, OR etc.

Here is how to count records which satisfy the condition “Age of the athelete is greater than 30”.

data[data['Age']>30]['ID'].count()

sqldf('SELECT COUNT(*) FROM data WHERE Age > 30')

	COUNT(*)
0	42107

Below is an example of multiple conditions. We are selecting all records of Hockey for team India.

data[(data['Team'] == 'India') & (data['Sport']=='Hockey')]

	ID	Name	Sex	Age	Height	Weight	Team	NOC	Games	Year	Season	City	Sport	Event	Medal
2513	1415	Shakeel Ahmed	M	21.0	NaN	NaN	India	IND	1992 Summer	1992	Summer	Barcelona	Hockey	Hockey Men's Hockey	NaN
4289	2453	Anil Alexander Aldrin	M	24.0	NaN	NaN	India	IND	1996 Summer	1996	Summer	Atlanta	Hockey	Hockey Men's Hockey	NaN
4732	2699	Shaukat Ali	M	30.0	NaN	NaN	India	IND	1928 Summer	1928	Summer	Amsterdam	Hockey	Hockey Men's Hockey	Gold
4735	2702	Syed Ali	M	19.0	169.0	60.0	India	IND	1976 Summer	1976	Summer	Montreal	Hockey	Hockey Men's Hockey	NaN
4736	2703	Syed Mushtaq Ali	M	22.0	165.0	61.0	India	IND	1964 Summer	1964	Summer	Tokyo	Hockey	Hockey Men's Hockey	Gold
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
253354	126867	Sunil Sowmarpet Vitalacharya	M	27.0	176.0	68.0	India	IND	2016 Summer	2016	Summer	Rio de Janeiro	Hockey	Hockey Men's Hockey	NaN
256143	128249	Devindar Sunil Walmiki	M	24.0	178.0	69.0	India	IND	2016 Summer	2016	Summer	Rio de Janeiro	Hockey	Hockey Men's Hockey	NaN
263723	131974	William Xalco	M	20.0	167.0	60.0	India	IND	2004 Summer	2004	Summer	Athina	Hockey	Hockey Men's Hockey	NaN
264075	132142	Renuka Yadav	F	22.0	159.0	53.0	India	IND	2016 Summer	2016	Summer	Rio de Janeiro	Hockey	Hockey Women's Hockey	NaN
266934	133554	Sayed Muhammad Yusuf	M	NaN	NaN	NaN	India	IND	1928 Summer	1928	Summer	Amsterdam	Hockey	Hockey Men's Hockey	Gold

345 rows × 15 columns

sqldf('SELECT * FROM data WHERE Team="India" AND Sport="Hockey"')

	ID	Name	Sex	Age	Height	Weight	Team	NOC	Games	Year	Season	City	Sport	Event	Medal
0	1415	Shakeel Ahmed	M	21.0	NaN	NaN	India	IND	1992 Summer	1992	Summer	Barcelona	Hockey	Hockey Men's Hockey	None
1	2453	Anil Alexander Aldrin	M	24.0	NaN	NaN	India	IND	1996 Summer	1996	Summer	Atlanta	Hockey	Hockey Men's Hockey	None
2	2699	Shaukat Ali	M	30.0	NaN	NaN	India	IND	1928 Summer	1928	Summer	Amsterdam	Hockey	Hockey Men's Hockey	Gold
3	2702	Syed Ali	M	19.0	169.0	60.0	India	IND	1976 Summer	1976	Summer	Montreal	Hockey	Hockey Men's Hockey	None
4	2703	Syed Mushtaq Ali	M	22.0	165.0	61.0	India	IND	1964 Summer	1964	Summer	Tokyo	Hockey	Hockey Men's Hockey	Gold
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
340	126867	Sunil Sowmarpet Vitalacharya	M	27.0	176.0	68.0	India	IND	2016 Summer	2016	Summer	Rio de Janeiro	Hockey	Hockey Men's Hockey	None
341	128249	Devindar Sunil Walmiki	M	24.0	178.0	69.0	India	IND	2016 Summer	2016	Summer	Rio de Janeiro	Hockey	Hockey Men's Hockey	None
342	131974	William Xalco	M	20.0	167.0	60.0	India	IND	2004 Summer	2004	Summer	Athina	Hockey	Hockey Men's Hockey	None
343	132142	Renuka Yadav	F	22.0	159.0	53.0	India	IND	2016 Summer	2016	Summer	Rio de Janeiro	Hockey	Hockey Women's Hockey	None
344	133554	Sayed Muhammad Yusuf	M	NaN	NaN	NaN	India	IND	1928 Summer	1928	Summer	Amsterdam	Hockey	Hockey Men's Hockey	Gold

345 rows × 15 columns

SORTing

Sorting is another common operation required during data analysis. You can sort on one or more columns and in ascending (low to high) or descending (high to low) order.

data.sort_values(by='Age', ascending=False).head()

	ID	Name	Sex	Age	Height	Weight	Team	NOC	Games	Year	Season	City	Sport	Event	Medal
257054	128719	John Quincy Adams Ward	M	97.0	NaN	NaN	United States	USA	1928 Summer	1928	Summer	Amsterdam	Art Competitions	Art Competitions Mixed Sculpturing, Statues	NaN
98118	49663	Winslow Homer	M	96.0	NaN	NaN	United States	USA	1932 Summer	1932	Summer	Los Angeles	Art Competitions	Art Competitions Mixed Painting, Unknown Event	NaN
60863	31173	Thomas Cowperthwait Eakins	M	88.0	NaN	NaN	United States	USA	1932 Summer	1932	Summer	Los Angeles	Art Competitions	Art Competitions Mixed Painting, Unknown Event	NaN
60861	31173	Thomas Cowperthwait Eakins	M	88.0	NaN	NaN	United States	USA	1932 Summer	1932	Summer	Los Angeles	Art Competitions	Art Competitions Mixed Painting, Unknown Event	NaN
60862	31173	Thomas Cowperthwait Eakins	M	88.0	NaN	NaN	United States	USA	1932 Summer	1932	Summer	Los Angeles	Art Competitions	Art Competitions Mixed Painting, Unknown Event	NaN

sqldf('SELECT * from data ORDER BY Age DESC LIMIT 5')

	ID	Name	Sex	Age	Height	Weight	Team	NOC	Games	Year	Season	City	Sport	Event	Medal
0	128719	John Quincy Adams Ward	M	97.0	None	None	United States	USA	1928 Summer	1928	Summer	Amsterdam	Art Competitions	Art Competitions Mixed Sculpturing, Statues	None
1	49663	Winslow Homer	M	96.0	None	None	United States	USA	1932 Summer	1932	Summer	Los Angeles	Art Competitions	Art Competitions Mixed Painting, Unknown Event	None
2	31173	Thomas Cowperthwait Eakins	M	88.0	None	None	United States	USA	1932 Summer	1932	Summer	Los Angeles	Art Competitions	Art Competitions Mixed Painting, Unknown Event	None
3	31173	Thomas Cowperthwait Eakins	M	88.0	None	None	United States	USA	1932 Summer	1932	Summer	Los Angeles	Art Competitions	Art Competitions Mixed Painting, Unknown Event	None
4	31173	Thomas Cowperthwait Eakins	M	88.0	None	None	United States	USA	1932 Summer	1932	Summer	Los Angeles	Art Competitions	Art Competitions Mixed Painting, Unknown Event	None

JOINing

Joining one dataset with another is a common operation while working with data. This is a simple aspect of SQL which stumps more people than it should.

I reckon, it is important to work with a simple example using few rows of data to get this right. That is what I am going to do using synthetic tables employee and assets with dummy data.

Please notice the column we are joining on, usually termed as key. If the key has duplicate values, the records multiply.

	EmployeeID	DepartmentID	Name
0	100	100	John Doe
1	101	100	Will green
2	102	200	Wilson Miner
3	104	200	Rochel Dmello
4	105	300	Dickie Bird

	AssetID	EmployeeID	DepartmentID
0	aab	100	100
1	aba	101	100
2	baa	201	200
3	cde	202	200
4	efg	103	300

INNER JOIN

The simplest type of join is the inner join. This operation returns the common records from the left table (employee) and the right table (asset).

Pandas uses merge method to join to tables.

employee.merge(asset, on='EmployeeID')

	EmployeeID	DepartmentID_x	Name	AssetID	DepartmentID_y
0	100	100	John Doe	aab	100
1	101	100	Will green	aba	100

sqldf('SELECT * FROM employee E JOIN asset A ON E.EmployeeID = A.EmployeeID')

	EmployeeID	DepartmentID	Name	AssetID	EmployeeID	DepartmentID
0	100	100	John Doe	aab	100	100
1	101	100	Will green	aba	101	100

LEFT OUTER JOIN

In Left Outer Join, you get all records from the left table, and only matching records from the right table. Note the NaN and None values resulting in rows for which key doesn’t match.

employee.merge(asset, on='EmployeeID', how='left')

	EmployeeID	DepartmentID_x	Name	AssetID	DepartmentID_y
0	100	100	John Doe	aab	100
1	101	100	Will green	aba	100
2	102	200	Wilson Miner	NaN	NaN
3	104	200	Rochel Dmello	NaN	NaN
4	105	300	Dickie Bird	NaN	NaN

sqldf('SELECT E.*, A.* FROM employee E LEFT JOIN asset A ON E.EmployeeID = A.EmployeeID')

	EmployeeID	DepartmentID	Name	AssetID	EmployeeID	DepartmentID
0	100	100	John Doe	aab	100.0	100
1	101	100	Will green	aba	101.0	100
2	102	200	Wilson Miner	None	NaN	None
3	104	200	Rochel Dmello	None	NaN	None
4	105	300	Dickie Bird	None	NaN	None

RIGHT OUTER JOIN

Right join has the opposite effect of the left join.

employee.merge(asset, on='EmployeeID', how='right')

	EmployeeID	DepartmentID_x	Name	AssetID	DepartmentID_y
0	100	100.0	John Doe	aab	100
1	101	100.0	Will green	aba	100
2	201	NaN	NaN	baa	200
3	202	NaN	NaN	cde	200
4	103	NaN	NaN	efg	300

In fact pandassql doesn’t support Right Join but you can achieve the same result by calling asset as the left table and employee as the right table.

sqldf('SELECT E.*, A.* FROM asset A LEFT JOIN employee E ON A.EmployeeID = E.EmployeeID')

	EmployeeID	DepartmentID	Name	AssetID	EmployeeID	DepartmentID
0	100.0	100.0	John Doe	aab	100	100
1	101.0	100.0	Will green	aba	101	100
2	NaN	NaN	None	baa	201	200
3	NaN	NaN	None	cde	202	200
4	NaN	NaN	None	efg	103	300

Another type of join pandasql doesn’t support is the Full outer join, which returns all matching as well as not matching records from both the tables.

You need not worry about these missing type of joins in pandassql, all database that I have worked with support Right Join and Full Outer Join. With some trickery you can even achieve these in pandassql but it’s not required for this conversation.

Queries with aggregates

Aggregates are another SQL technique where you group your records based on one or more columns and perform some aggregate calculation on another set of columns such as MIN, MAX, SUM, COUNT, AVG, and so on.

Here is an example of count where we count Medals by team, sort them in descending order and show top % results.

data.groupby('Team', as_index=False)['Medal'].count().sort_values('Medal', ascending=False).head()

	Team	Medal
1095	United States	5219
976	Soviet Union	2451
398	Germany	1984
412	Great Britain	1673
361	France	1550

sqldf('SELECT Team, COUNT(Medal) FROM data GROUP BY Team ORDER BY COUNT(Medal) DESC LIMIT 5')

	Team	COUNT(Medal)
0	United States	5219
1	Soviet Union	2451
2	Germany	1984
3	Great Britain	1673
4	France	1550

Here is another example where we filter just one sport swimming and do the earlier calculation.

data[data['Sport']=='Swimming'].groupby('Team', as_index=False)['Medal'].count().sort_values('Medal', ascending=False).head()

	Team	Medal
197	United States	1066
10	Australia	401
56	East Germany	152
70	Germany	152
72	Great Britain	127

sqldf('SELECT Team, COUNT(Medal) FROM data Where Sport = "Swimming" GROUP BY Team ORDER BY COUNT(Medal) DESC LIMIT 5')

	Team	COUNT(Medal)
0	United States	1066
1	Australia	401
2	Germany	152
3	East Germany	152
4	Great Britain	127

There is a lot more you can do with SQL, like filtering on aggregated data usin HAVING clause. Partition data by row, rank or dense rank. Use analytical functions to compute values across rows or over a window.

There is even more that you can do with pandas. Here I am listing some resources if you’re ready for a deep dive.