SQL is a must have skill if you are working with data. Pandas is as essential for data analysis in the python. If you’re someone who knows pandas and want to learn SQL or you know SQL and want to learn pandas, this post will introduce you to some basic skills.
To work on SQL you need access to a database hosted locally or on the cloud. Some popular databases you can work with are the following:
There are many more but I’ve tried to list down the popular ones. To show side by side examples, however, I am not using any database. I have using a SQL engine called pandasql. So to run these examples, you will have to read some data into pandas dataframe and then use pandasql to run SQL queries.
So the steps are as follows:
!pip install -U pandassql
import pandas as pd
from pandassql import sqldf
data = pd.read_csv(<path-to-athlete_events.csv>)
Run these examples
SELECTing Data
Once you import the data, select 5 rows from the data just to inspects. Below is an example of how to select top 5 rows in pandas:
data.head()
ID
Name
Sex
Age
Height
Weight
Team
NOC
Games
Year
Season
City
Sport
Event
Medal
0
1
A Dijiang
M
24.0
180.0
80.0
China
CHN
1992 Summer
1992
Summer
Barcelona
Basketball
Basketball Men's Basketball
NaN
1
2
A Lamusi
M
23.0
170.0
60.0
China
CHN
2012 Summer
2012
Summer
London
Judo
Judo Men's Extra-Lightweight
NaN
2
3
Gunnar Nielsen Aaby
M
24.0
NaN
NaN
Denmark
DEN
1920 Summer
1920
Summer
Antwerpen
Football
Football Men's Football
NaN
3
4
Edgar Lindenau Aabye
M
34.0
NaN
NaN
Denmark/Sweden
DEN
1900 Summer
1900
Summer
Paris
Tug-Of-War
Tug-Of-War Men's Tug-Of-War
Gold
4
5
Christine Jacoba Aaftink
F
21.0
185.0
82.0
Netherlands
NED
1988 Winter
1988
Winter
Calgary
Speed Skating
Speed Skating Women's 500 metres
NaN
You can achieve the same result in SQL:
sqldf("SELECT * FROM data LIMIT 5")
ID
Name
Sex
Age
Height
Weight
Team
NOC
Games
Year
Season
City
Sport
Event
Medal
0
1
A Dijiang
M
24.0
180.0
80.0
China
CHN
1992 Summer
1992
Summer
Barcelona
Basketball
Basketball Men's Basketball
None
1
2
A Lamusi
M
23.0
170.0
60.0
China
CHN
2012 Summer
2012
Summer
London
Judo
Judo Men's Extra-Lightweight
None
2
3
Gunnar Nielsen Aaby
M
24.0
NaN
NaN
Denmark
DEN
1920 Summer
1920
Summer
Antwerpen
Football
Football Men's Football
None
3
4
Edgar Lindenau Aabye
M
34.0
NaN
NaN
Denmark/Sweden
DEN
1900 Summer
1900
Summer
Paris
Tug-Of-War
Tug-Of-War Men's Tug-Of-War
Gold
4
5
Christine Jacoba Aaftink
F
21.0
185.0
82.0
Netherlands
NED
1988 Winter
1988
Winter
Calgary
Speed Skating
Speed Skating Women's 500 metres
None
Both pandas and SQL allow you to select specific columns from the dataframe(pandas) or table(SQL).
data[['Name', 'Age', 'Team', 'Year']].head()
Name
Age
Team
Year
0
A Dijiang
24.0
China
1992
1
A Lamusi
23.0
China
2012
2
Gunnar Nielsen Aaby
24.0
Denmark
1920
3
Edgar Lindenau Aabye
34.0
Denmark/Sweden
1900
4
Christine Jacoba Aaftink
21.0
Netherlands
1988
sqldf('SELECT Name, Age, Team, Year FROM data LIMIT 5')
Name
Age
Team
Year
0
A Dijiang
24.0
China
1992
1
A Lamusi
23.0
China
2012
2
Gunnar Nielsen Aaby
24.0
Denmark
1920
3
Edgar Lindenau Aabye
34.0
Denmark/Sweden
1900
4
Christine Jacoba Aaftink
21.0
Netherlands
1988
FILTERing with constraints
iloc property of a pandas dataframe allows you to select a row by its index as shown below. Here we are particulary selecting the 6th row by applying the filter on index.
data.iloc[6]
ID 5
Name Christine Jacoba Aaftink
Sex F
Age 25.0
Height 185.0
Weight 82.0
Team Netherlands
NOC NED
Games 1992 Winter
Year 1992
Season Winter
City Albertville
Sport Speed Skating
Event Speed Skating Women's 500 metres
Medal NaN
Name: 6, dtype: object
In SQL most databases provide a rowid index to select a specific row from the table.
sqldf("SELECT * FROM data WHERE rowid=6")
ID
Name
Sex
Age
Height
Weight
Team
NOC
Games
Year
Season
City
Sport
Event
Medal
0
5
Christine Jacoba Aaftink
F
21.0
185.0
82.0
Netherlands
NED
1988 Winter
1988
Winter
Calgary
Speed Skating
Speed Skating Women's 1,000 metres
None
You can filter on a condition applied on column or multiple columns. In case of multiple columns, the conditions are combined with boolen expressions such as AND, OR etc.
Here is how to count records which satisfy the condition “Age of the athelete is greater than 30”.
data[data['Age']>30]['ID'].count()
42107
sqldf('SELECT COUNT(*) FROM data WHERE Age > 30')
COUNT(*)
0
42107
Below is an example of multiple conditions. We are selecting all records of Hockey for team India.
sqldf('SELECT * FROM data WHERE Team="India" AND Sport="Hockey"')
ID
Name
Sex
Age
Height
Weight
Team
NOC
Games
Year
Season
City
Sport
Event
Medal
0
1415
Shakeel Ahmed
M
21.0
NaN
NaN
India
IND
1992 Summer
1992
Summer
Barcelona
Hockey
Hockey Men's Hockey
None
1
2453
Anil Alexander Aldrin
M
24.0
NaN
NaN
India
IND
1996 Summer
1996
Summer
Atlanta
Hockey
Hockey Men's Hockey
None
2
2699
Shaukat Ali
M
30.0
NaN
NaN
India
IND
1928 Summer
1928
Summer
Amsterdam
Hockey
Hockey Men's Hockey
Gold
3
2702
Syed Ali
M
19.0
169.0
60.0
India
IND
1976 Summer
1976
Summer
Montreal
Hockey
Hockey Men's Hockey
None
4
2703
Syed Mushtaq Ali
M
22.0
165.0
61.0
India
IND
1964 Summer
1964
Summer
Tokyo
Hockey
Hockey Men's Hockey
Gold
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
340
126867
Sunil Sowmarpet Vitalacharya
M
27.0
176.0
68.0
India
IND
2016 Summer
2016
Summer
Rio de Janeiro
Hockey
Hockey Men's Hockey
None
341
128249
Devindar Sunil Walmiki
M
24.0
178.0
69.0
India
IND
2016 Summer
2016
Summer
Rio de Janeiro
Hockey
Hockey Men's Hockey
None
342
131974
William Xalco
M
20.0
167.0
60.0
India
IND
2004 Summer
2004
Summer
Athina
Hockey
Hockey Men's Hockey
None
343
132142
Renuka Yadav
F
22.0
159.0
53.0
India
IND
2016 Summer
2016
Summer
Rio de Janeiro
Hockey
Hockey Women's Hockey
None
344
133554
Sayed Muhammad Yusuf
M
NaN
NaN
NaN
India
IND
1928 Summer
1928
Summer
Amsterdam
Hockey
Hockey Men's Hockey
Gold
345 rows × 15 columns
SORTing
Sorting is another common operation required during data analysis. You can sort on one or more columns and in ascending (low to high) or descending (high to low) order.
sqldf('SELECT * from data ORDER BY Age DESC LIMIT 5')
ID
Name
Sex
Age
Height
Weight
Team
NOC
Games
Year
Season
City
Sport
Event
Medal
0
128719
John Quincy Adams Ward
M
97.0
None
None
United States
USA
1928 Summer
1928
Summer
Amsterdam
Art Competitions
Art Competitions Mixed Sculpturing, Statues
None
1
49663
Winslow Homer
M
96.0
None
None
United States
USA
1932 Summer
1932
Summer
Los Angeles
Art Competitions
Art Competitions Mixed Painting, Unknown Event
None
2
31173
Thomas Cowperthwait Eakins
M
88.0
None
None
United States
USA
1932 Summer
1932
Summer
Los Angeles
Art Competitions
Art Competitions Mixed Painting, Unknown Event
None
3
31173
Thomas Cowperthwait Eakins
M
88.0
None
None
United States
USA
1932 Summer
1932
Summer
Los Angeles
Art Competitions
Art Competitions Mixed Painting, Unknown Event
None
4
31173
Thomas Cowperthwait Eakins
M
88.0
None
None
United States
USA
1932 Summer
1932
Summer
Los Angeles
Art Competitions
Art Competitions Mixed Painting, Unknown Event
None
JOINing
Joining one dataset with another is a common operation while working with data. This is a simple aspect of SQL which stumps more people than it should.
I reckon, it is important to work with a simple example using few rows of data to get this right. That is what I am going to do using synthetic tables employee and assets with dummy data.
Please notice the column we are joining on, usually termed as key. If the key has duplicate values, the records multiply.
EmployeeID
DepartmentID
Name
0
100
100
John Doe
1
101
100
Will green
2
102
200
Wilson Miner
3
104
200
Rochel Dmello
4
105
300
Dickie Bird
AssetID
EmployeeID
DepartmentID
0
aab
100
100
1
aba
101
100
2
baa
201
200
3
cde
202
200
4
efg
103
300
INNER JOIN
The simplest type of join is the inner join. This operation returns the common records from the left table (employee) and the right table (asset).
sqldf('SELECT * FROM employee E JOIN asset A ON E.EmployeeID = A.EmployeeID')
EmployeeID
DepartmentID
Name
AssetID
EmployeeID
DepartmentID
0
100
100
John Doe
aab
100
100
1
101
100
Will green
aba
101
100
LEFT OUTER JOIN
In Left Outer Join, you get all records from the left table, and only matching records from the right table. Note the NaN and None values resulting in rows for which key doesn’t match.
In fact pandassql doesn’t support Right Join but you can achieve the same result by calling asset as the left table and employee as the right table.
sqldf('SELECT E.*, A.* FROM asset A LEFT JOIN employee E ON A.EmployeeID = E.EmployeeID')
EmployeeID
DepartmentID
Name
AssetID
EmployeeID
DepartmentID
0
100.0
100.0
John Doe
aab
100
100
1
101.0
100.0
Will green
aba
101
100
2
NaN
NaN
None
baa
201
200
3
NaN
NaN
None
cde
202
200
4
NaN
NaN
None
efg
103
300
Another type of join pandasql doesn’t support is the Full outer join, which returns all matching as well as not matching records from both the tables.
You need not worry about these missing type of joins in pandassql, all database that I have worked with support Right Join and Full Outer Join. With some trickery you can even achieve these in pandassql but it’s not required for this conversation.
Queries with aggregates
Aggregates are another SQL technique where you group your records based on one or more columns and perform some aggregate calculation on another set of columns such as MIN, MAX, SUM, COUNT, AVG, and so on.
Here is an example of count where we count Medals by team, sort them in descending order and show top % results.
sqldf('SELECT Team, COUNT(Medal) FROM data Where Sport = "Swimming" GROUP BY Team ORDER BY COUNT(Medal) DESC LIMIT 5')
Team
COUNT(Medal)
0
United States
1066
1
Australia
401
2
Germany
152
3
East Germany
152
4
Great Britain
127
There is a lot more you can do with SQL, like filtering on aggregated data usin HAVING clause. Partition data by row, rank or dense rank. Use analytical functions to compute values across rows or over a window.
There is even more that you can do with pandas. Here I am listing some resources if you’re ready for a deep dive.