Pandas and SQL side by side

Learn sql from pandas or vice-versa?
data engineering
pandas
sql
Author

Abhishek

Published

January 5, 2022

SQL is a must have skill if you are working with data. Pandas is as essential for data analysis in the python. If you’re someone who knows pandas and want to learn SQL or you know SQL and want to learn pandas, this post will introduce you to some basic skills.

To work on SQL you need access to a database hosted locally or on the cloud. Some popular databases you can work with are the following:

There are many more but I’ve tried to list down the popular ones. To show side by side examples, however, I am not using any database. I have using a SQL engine called pandasql. So to run these examples, you will have to read some data into pandas dataframe and then use pandasql to run SQL queries.

So the steps are as follows:

  1. !pip install -U pandassql
  2. import pandas as pd
  3. from pandassql import sqldf
  4. data = pd.read_csv(<path-to-athlete_events.csv>)
  5. Run these examples

SELECTing Data

Once you import the data, select 5 rows from the data just to inspects. Below is an example of how to select top 5 rows in pandas:

data.head()
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal
0 1 A Dijiang M 24.0 180.0 80.0 China CHN 1992 Summer 1992 Summer Barcelona Basketball Basketball Men's Basketball NaN
1 2 A Lamusi M 23.0 170.0 60.0 China CHN 2012 Summer 2012 Summer London Judo Judo Men's Extra-Lightweight NaN
2 3 Gunnar Nielsen Aaby M 24.0 NaN NaN Denmark DEN 1920 Summer 1920 Summer Antwerpen Football Football Men's Football NaN
3 4 Edgar Lindenau Aabye M 34.0 NaN NaN Denmark/Sweden DEN 1900 Summer 1900 Summer Paris Tug-Of-War Tug-Of-War Men's Tug-Of-War Gold
4 5 Christine Jacoba Aaftink F 21.0 185.0 82.0 Netherlands NED 1988 Winter 1988 Winter Calgary Speed Skating Speed Skating Women's 500 metres NaN

You can achieve the same result in SQL:

sqldf("SELECT * FROM data LIMIT 5")
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal
0 1 A Dijiang M 24.0 180.0 80.0 China CHN 1992 Summer 1992 Summer Barcelona Basketball Basketball Men's Basketball None
1 2 A Lamusi M 23.0 170.0 60.0 China CHN 2012 Summer 2012 Summer London Judo Judo Men's Extra-Lightweight None
2 3 Gunnar Nielsen Aaby M 24.0 NaN NaN Denmark DEN 1920 Summer 1920 Summer Antwerpen Football Football Men's Football None
3 4 Edgar Lindenau Aabye M 34.0 NaN NaN Denmark/Sweden DEN 1900 Summer 1900 Summer Paris Tug-Of-War Tug-Of-War Men's Tug-Of-War Gold
4 5 Christine Jacoba Aaftink F 21.0 185.0 82.0 Netherlands NED 1988 Winter 1988 Winter Calgary Speed Skating Speed Skating Women's 500 metres None

Both pandas and SQL allow you to select specific columns from the dataframe(pandas) or table(SQL).

data[['Name', 'Age', 'Team', 'Year']].head()
Name Age Team Year
0 A Dijiang 24.0 China 1992
1 A Lamusi 23.0 China 2012
2 Gunnar Nielsen Aaby 24.0 Denmark 1920
3 Edgar Lindenau Aabye 34.0 Denmark/Sweden 1900
4 Christine Jacoba Aaftink 21.0 Netherlands 1988
sqldf('SELECT Name, Age, Team, Year FROM data LIMIT 5')
Name Age Team Year
0 A Dijiang 24.0 China 1992
1 A Lamusi 23.0 China 2012
2 Gunnar Nielsen Aaby 24.0 Denmark 1920
3 Edgar Lindenau Aabye 34.0 Denmark/Sweden 1900
4 Christine Jacoba Aaftink 21.0 Netherlands 1988

FILTERing with constraints

iloc property of a pandas dataframe allows you to select a row by its index as shown below. Here we are particulary selecting the 6th row by applying the filter on index.

data.iloc[6]
ID                                       5
Name              Christine Jacoba Aaftink
Sex                                      F
Age                                   25.0
Height                               185.0
Weight                                82.0
Team                           Netherlands
NOC                                    NED
Games                          1992 Winter
Year                                  1992
Season                              Winter
City                           Albertville
Sport                        Speed Skating
Event     Speed Skating Women's 500 metres
Medal                                  NaN
Name: 6, dtype: object

In SQL most databases provide a rowid index to select a specific row from the table.

sqldf("SELECT * FROM data WHERE rowid=6")
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal
0 5 Christine Jacoba Aaftink F 21.0 185.0 82.0 Netherlands NED 1988 Winter 1988 Winter Calgary Speed Skating Speed Skating Women's 1,000 metres None

You can filter on a condition applied on column or multiple columns. In case of multiple columns, the conditions are combined with boolen expressions such as AND, OR etc.

Here is how to count records which satisfy the condition “Age of the athelete is greater than 30”.

data[data['Age']>30]['ID'].count()
42107
sqldf('SELECT COUNT(*) FROM data WHERE Age > 30')
COUNT(*)
0 42107

Below is an example of multiple conditions. We are selecting all records of Hockey for team India.

data[(data['Team'] == 'India') & (data['Sport']=='Hockey')]
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal
2513 1415 Shakeel Ahmed M 21.0 NaN NaN India IND 1992 Summer 1992 Summer Barcelona Hockey Hockey Men's Hockey NaN
4289 2453 Anil Alexander Aldrin M 24.0 NaN NaN India IND 1996 Summer 1996 Summer Atlanta Hockey Hockey Men's Hockey NaN
4732 2699 Shaukat Ali M 30.0 NaN NaN India IND 1928 Summer 1928 Summer Amsterdam Hockey Hockey Men's Hockey Gold
4735 2702 Syed Ali M 19.0 169.0 60.0 India IND 1976 Summer 1976 Summer Montreal Hockey Hockey Men's Hockey NaN
4736 2703 Syed Mushtaq Ali M 22.0 165.0 61.0 India IND 1964 Summer 1964 Summer Tokyo Hockey Hockey Men's Hockey Gold
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
253354 126867 Sunil Sowmarpet Vitalacharya M 27.0 176.0 68.0 India IND 2016 Summer 2016 Summer Rio de Janeiro Hockey Hockey Men's Hockey NaN
256143 128249 Devindar Sunil Walmiki M 24.0 178.0 69.0 India IND 2016 Summer 2016 Summer Rio de Janeiro Hockey Hockey Men's Hockey NaN
263723 131974 William Xalco M 20.0 167.0 60.0 India IND 2004 Summer 2004 Summer Athina Hockey Hockey Men's Hockey NaN
264075 132142 Renuka Yadav F 22.0 159.0 53.0 India IND 2016 Summer 2016 Summer Rio de Janeiro Hockey Hockey Women's Hockey NaN
266934 133554 Sayed Muhammad Yusuf M NaN NaN NaN India IND 1928 Summer 1928 Summer Amsterdam Hockey Hockey Men's Hockey Gold

345 rows × 15 columns

sqldf('SELECT * FROM data WHERE Team="India" AND Sport="Hockey"')
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal
0 1415 Shakeel Ahmed M 21.0 NaN NaN India IND 1992 Summer 1992 Summer Barcelona Hockey Hockey Men's Hockey None
1 2453 Anil Alexander Aldrin M 24.0 NaN NaN India IND 1996 Summer 1996 Summer Atlanta Hockey Hockey Men's Hockey None
2 2699 Shaukat Ali M 30.0 NaN NaN India IND 1928 Summer 1928 Summer Amsterdam Hockey Hockey Men's Hockey Gold
3 2702 Syed Ali M 19.0 169.0 60.0 India IND 1976 Summer 1976 Summer Montreal Hockey Hockey Men's Hockey None
4 2703 Syed Mushtaq Ali M 22.0 165.0 61.0 India IND 1964 Summer 1964 Summer Tokyo Hockey Hockey Men's Hockey Gold
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
340 126867 Sunil Sowmarpet Vitalacharya M 27.0 176.0 68.0 India IND 2016 Summer 2016 Summer Rio de Janeiro Hockey Hockey Men's Hockey None
341 128249 Devindar Sunil Walmiki M 24.0 178.0 69.0 India IND 2016 Summer 2016 Summer Rio de Janeiro Hockey Hockey Men's Hockey None
342 131974 William Xalco M 20.0 167.0 60.0 India IND 2004 Summer 2004 Summer Athina Hockey Hockey Men's Hockey None
343 132142 Renuka Yadav F 22.0 159.0 53.0 India IND 2016 Summer 2016 Summer Rio de Janeiro Hockey Hockey Women's Hockey None
344 133554 Sayed Muhammad Yusuf M NaN NaN NaN India IND 1928 Summer 1928 Summer Amsterdam Hockey Hockey Men's Hockey Gold

345 rows × 15 columns

SORTing

Sorting is another common operation required during data analysis. You can sort on one or more columns and in ascending (low to high) or descending (high to low) order.

data.sort_values(by='Age', ascending=False).head()
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal
257054 128719 John Quincy Adams Ward M 97.0 NaN NaN United States USA 1928 Summer 1928 Summer Amsterdam Art Competitions Art Competitions Mixed Sculpturing, Statues NaN
98118 49663 Winslow Homer M 96.0 NaN NaN United States USA 1932 Summer 1932 Summer Los Angeles Art Competitions Art Competitions Mixed Painting, Unknown Event NaN
60863 31173 Thomas Cowperthwait Eakins M 88.0 NaN NaN United States USA 1932 Summer 1932 Summer Los Angeles Art Competitions Art Competitions Mixed Painting, Unknown Event NaN
60861 31173 Thomas Cowperthwait Eakins M 88.0 NaN NaN United States USA 1932 Summer 1932 Summer Los Angeles Art Competitions Art Competitions Mixed Painting, Unknown Event NaN
60862 31173 Thomas Cowperthwait Eakins M 88.0 NaN NaN United States USA 1932 Summer 1932 Summer Los Angeles Art Competitions Art Competitions Mixed Painting, Unknown Event NaN
sqldf('SELECT * from data ORDER BY Age DESC LIMIT 5')
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal
0 128719 John Quincy Adams Ward M 97.0 None None United States USA 1928 Summer 1928 Summer Amsterdam Art Competitions Art Competitions Mixed Sculpturing, Statues None
1 49663 Winslow Homer M 96.0 None None United States USA 1932 Summer 1932 Summer Los Angeles Art Competitions Art Competitions Mixed Painting, Unknown Event None
2 31173 Thomas Cowperthwait Eakins M 88.0 None None United States USA 1932 Summer 1932 Summer Los Angeles Art Competitions Art Competitions Mixed Painting, Unknown Event None
3 31173 Thomas Cowperthwait Eakins M 88.0 None None United States USA 1932 Summer 1932 Summer Los Angeles Art Competitions Art Competitions Mixed Painting, Unknown Event None
4 31173 Thomas Cowperthwait Eakins M 88.0 None None United States USA 1932 Summer 1932 Summer Los Angeles Art Competitions Art Competitions Mixed Painting, Unknown Event None

JOINing

Joining one dataset with another is a common operation while working with data. This is a simple aspect of SQL which stumps more people than it should.

I reckon, it is important to work with a simple example using few rows of data to get this right. That is what I am going to do using synthetic tables employee and assets with dummy data.

Please notice the column we are joining on, usually termed as key. If the key has duplicate values, the records multiply.

EmployeeID DepartmentID Name
0 100 100 John Doe
1 101 100 Will green
2 102 200 Wilson Miner
3 104 200 Rochel Dmello
4 105 300 Dickie Bird
AssetID EmployeeID DepartmentID
0 aab 100 100
1 aba 101 100
2 baa 201 200
3 cde 202 200
4 efg 103 300

INNER JOIN

The simplest type of join is the inner join. This operation returns the common records from the left table (employee) and the right table (asset).

Pandas uses merge method to join to tables.

employee.merge(asset, on='EmployeeID')
EmployeeID DepartmentID_x Name AssetID DepartmentID_y
0 100 100 John Doe aab 100
1 101 100 Will green aba 100
sqldf('SELECT * FROM employee E JOIN asset A ON E.EmployeeID = A.EmployeeID')
EmployeeID DepartmentID Name AssetID EmployeeID DepartmentID
0 100 100 John Doe aab 100 100
1 101 100 Will green aba 101 100

LEFT OUTER JOIN

In Left Outer Join, you get all records from the left table, and only matching records from the right table. Note the NaN and None values resulting in rows for which key doesn’t match.

employee.merge(asset, on='EmployeeID', how='left')
EmployeeID DepartmentID_x Name AssetID DepartmentID_y
0 100 100 John Doe aab 100
1 101 100 Will green aba 100
2 102 200 Wilson Miner NaN NaN
3 104 200 Rochel Dmello NaN NaN
4 105 300 Dickie Bird NaN NaN
sqldf('SELECT E.*, A.* FROM employee E LEFT JOIN asset A ON E.EmployeeID = A.EmployeeID')
EmployeeID DepartmentID Name AssetID EmployeeID DepartmentID
0 100 100 John Doe aab 100.0 100
1 101 100 Will green aba 101.0 100
2 102 200 Wilson Miner None NaN None
3 104 200 Rochel Dmello None NaN None
4 105 300 Dickie Bird None NaN None

RIGHT OUTER JOIN

Right join has the opposite effect of the left join.

employee.merge(asset, on='EmployeeID', how='right')
EmployeeID DepartmentID_x Name AssetID DepartmentID_y
0 100 100.0 John Doe aab 100
1 101 100.0 Will green aba 100
2 201 NaN NaN baa 200
3 202 NaN NaN cde 200
4 103 NaN NaN efg 300

In fact pandassql doesn’t support Right Join but you can achieve the same result by calling asset as the left table and employee as the right table.

sqldf('SELECT E.*, A.* FROM asset A LEFT JOIN employee E ON A.EmployeeID = E.EmployeeID')
EmployeeID DepartmentID Name AssetID EmployeeID DepartmentID
0 100.0 100.0 John Doe aab 100 100
1 101.0 100.0 Will green aba 101 100
2 NaN NaN None baa 201 200
3 NaN NaN None cde 202 200
4 NaN NaN None efg 103 300

Another type of join pandasql doesn’t support is the Full outer join, which returns all matching as well as not matching records from both the tables.

You need not worry about these missing type of joins in pandassql, all database that I have worked with support Right Join and Full Outer Join. With some trickery you can even achieve these in pandassql but it’s not required for this conversation.

Queries with aggregates

Aggregates are another SQL technique where you group your records based on one or more columns and perform some aggregate calculation on another set of columns such as MIN, MAX, SUM, COUNT, AVG, and so on.

Here is an example of count where we count Medals by team, sort them in descending order and show top % results.

data.groupby('Team', as_index=False)['Medal'].count().sort_values('Medal', ascending=False).head()
Team Medal
1095 United States 5219
976 Soviet Union 2451
398 Germany 1984
412 Great Britain 1673
361 France 1550
sqldf('SELECT Team, COUNT(Medal) FROM data GROUP BY Team ORDER BY COUNT(Medal) DESC LIMIT 5')
Team COUNT(Medal)
0 United States 5219
1 Soviet Union 2451
2 Germany 1984
3 Great Britain 1673
4 France 1550

Here is another example where we filter just one sport swimming and do the earlier calculation.

data[data['Sport']=='Swimming'].groupby('Team', as_index=False)['Medal'].count().sort_values('Medal', ascending=False).head()
Team Medal
197 United States 1066
10 Australia 401
56 East Germany 152
70 Germany 152
72 Great Britain 127
sqldf('SELECT Team, COUNT(Medal) FROM data Where Sport = "Swimming" GROUP BY Team ORDER BY COUNT(Medal) DESC LIMIT 5')
Team COUNT(Medal)
0 United States 1066
1 Australia 401
2 Germany 152
3 East Germany 152
4 Great Britain 127

There is a lot more you can do with SQL, like filtering on aggregated data usin HAVING clause. Partition data by row, rank or dense rank. Use analytical functions to compute values across rows or over a window.

There is even more that you can do with pandas. Here I am listing some resources if you’re ready for a deep dive.

Pandas Resources

SQL Resources

Buy Me A Coffee

Footnotes

  1. “There are many resources but for pandas this is all you really need”↩︎