Pandas Merge
The merge operation in Pandas merges two DataFrames based on their indexes or a specified column.
The merge()
in Pandas works similar to JOINs in SQL.
Let’s see an example.
import pandas as pd
# create dataframes from the dictionaries
data1 = {
'EmployeeID' : ['E001', 'E002', 'E003', 'E004', 'E005'],
'Name' : ['John Doe', 'Jane Smith', 'Peter Brown', 'Tom Johnson', 'Rita Patel'],
'DeptID': ['D001', 'D003', 'D001', 'D002', 'D003'],
}
employees = pd.DataFrame(data1)
data2 = {
'DeptID': ['D001', 'D002', 'D003'],
'DeptName': ['Sales', 'HR', 'Admin']
}
departments = pd.DataFrame(data2)
# merge dataframes employees and departments
merged_df = pd.merge(employees, departments)
# display DataFrames
print("Employees:")
print(employees)
print()
print("Departments:")
print(departments)
print()
print("Merged DataFrame:")
print(merged_df)
Output
Employees:
EmployeeID Name DeptID
0 E001 John Doe D001
1 E002 Jane Smith D003
2 E003 Peter Brown D001
3 E004 Tom Johnson D002
4 E005 Rita Patel D003
Departments:
DeptID DeptName
0 D001 Sales
1 D002 HR
2 D003 Admin
Merged DataFrame:
EmployeeID Name DeptID DeptName
0 E001 John Doe D001 Sales
1 E003 Peter Brown D001 Sales
2 E002 Jane Smith D003 Admin
3 E005 Rita Patel D003 Admin
4 E004 Tom Johnson D002 HR
In this example, we merged the DataFrames employees and departments using the merge()
method.
Notice that the two DataFrames are merged based on the DeptID
column as it’s common to both the DataFrames.
merge() Syntax in Pandas
The syntax of the merge()
method in Pandas is:
pd.merge(left, right, on=None, how='inner', left_on=None, right_on=None, sort=False)
Here,
left
: specifies the left DataFrame to be mergedright
: specifies the right DataFrame to be mergedon
(optional): specifies column(s) to join onhow
(optional): specifies the type of join to performleft_on
(optional): specifies column(s) from the left DataFrame to use as key(s) for mergingright_on
(optional): specifies column(s) from the right DataFrame to use as key(s) for mergingsort
(optional): ifTrue
, sort the result DataFrame by the join keys
Example: Merge DataFrames Based on Keys
When there are no common columns between two DataFrames, we can merge them by specifying the columns (as keys) in the left_on
and right_on
arguments. For example,
import pandas as pd
# create dataframes from the dictionaries
data1 = {
'EmployeeID': ['E001', 'E002', 'E003', 'E004', 'E005'],
'Name': ['John Doe', 'Jane Smith', 'Peter Brown', 'Tom Johnson', 'Rita Patel'],
'DeptID1': ['D001', 'D003', 'D001', 'D002', 'D006'],
}
employees = pd.DataFrame(data1)
data2 = {
'DeptID2': ['D001', 'D002', 'D003', 'D004'],
'DeptName': ['Sales', 'HR', 'Admin', 'Marketing']
}
departments = pd.DataFrame(data2)
# merge the dataframes
df_merge = pd.merge(employees, departments, left_on='DeptID1', right_on = 'DeptID2', sort = True)
print(df_merge)
Output
EmployeeID Name DeptID1 DeptID2 DeptName
0 E001 John Doe D001 D001 Sales
1 E003 Peter Brown D001 D001 Sales
2 E004 Tom Johnson D002 D002 HR
3 E002 Jane Smith D003 D003 Admin
In the above example, we performed a merge operation on two DataFrames employees and departments using the merge()
method with various arguments.
Here, we used DeptID1
and DeptID2
as the key for merging the DataFrames. Then, we sorted the resulting DataFrame using sort = True
.
Types of Join Operations In merge()
So far, we’ve not defined how to merge the dataframes, thus it defaults to an inner join.
However, we can specify the join type in the how
argument. Here are the 5 join types we can use in the merge()
method:
- Left Join
- Right Join
- Outer Join
- Inner Join (Default)
- Cross Join
Left Join
A left join combines two DataFrames based on a common key and returns a new DataFrame that contains all rows from the left DataFrame and the matched rows from the right DataFrame.
If values are not found in the right dataframe, it fills the space with NaN
. For example,
import pandas as pd
# create dataframes from the dictionaries
data1 = {
'EmployeeID': ['E001', 'E002', 'E003', 'E004', 'E005'],
'Name': ['John Doe', 'Jane Smith', 'Peter Brown', 'Tom Johnson', 'Rita Patel'],
'DeptID': ['D001', 'D003', 'D001', 'D002', 'D006'],
}
employees = pd.DataFrame(data1)
data2 = {
'DeptID': ['D001', 'D002', 'D003', 'D004'],
'DeptName': ['Sales', 'HR', 'Admin', 'Marketing']
}
departments = pd.DataFrame(data2)
# left merge the dataframes
df_merge = pd.merge(employees, departments, on = 'DeptID', how = 'left', sort = True)
print(df_merge)
Output
EmployeeID Name DeptID DeptName
0 E001 John Doe D001 Sales
1 E003 Peter Brown D001 Sales
2 E004 Tom Johnson D002 HR
3 E002 Jane Smith D003 Admin
4 E005 Rita Patel D006 NaN
Right Join
A right join is the opposite of a left join. It returns a new DataFrame that contains all rows from the right DataFrame and the matched rows from the left DataFrame.
If values are not found in the left dataframe, it fills the space with NaN
. For example,
import pandas as pd
# create dataframes from the dictionaries
data1 = {
'EmployeeID': ['E001', 'E002', 'E003', 'E004', 'E005'],
'Name': ['John Doe', 'Jane Smith', 'Peter Brown', 'Tom Johnson', 'Rita Patel'],
'DeptID': ['D001', 'D003', 'D001', 'D002', 'D006'],
}
employees = pd.DataFrame(data1)
data2 = {
'DeptID': ['D001', 'D002', 'D003', 'D004'],
'DeptName': ['Sales', 'HR', 'Admin', 'Marketing']
}
departments = pd.DataFrame(data2)
# right merge the dataframes
df_merge = pd.merge(employees, departments, on = 'DeptID', how = 'right', sort = True)
print(df_merge)
Output
EmployeeID Name DeptID DeptName
0 E001 John Doe D001 Sales
1 E003 Peter Brown D001 Sales
2 E004 Tom Johnson D002 HR
3 E002 Jane Smith D003 Admin
4 NaN NaN D004 Marketing
Inner Join
An inner join combines two DataFrames based on a common key and returns a new DataFrame that contains only rows that have matching values in both of the original DataFrames.
For example,
import pandas as pd
# create dataframes from the dictionaries
data1 = {
'EmployeeID': ['E001', 'E002', 'E003', 'E004', 'E005'],
'Name': ['John Doe', 'Jane Smith', 'Peter Brown', 'Tom Johnson', 'Rita Patel'],
'DeptID': ['D001', 'D003', 'D001', 'D002', 'D006'],
}
employees = pd.DataFrame(data1)
data2 = {
'DeptID': ['D001', 'D002', 'D003', 'D004'],
'DeptName': ['Sales', 'HR', 'Admin', 'Marketing']
}
departments = pd.DataFrame(data2)
# inner merge the dataframes
df_merge = pd.merge(employees, departments, on = 'DeptID', how = 'inner', sort = True)
print(df_merge)
Output
EmployeeID Name DeptID DeptName
0 E001 John Doe D001 Sales
1 E003 Peter Brown D001 Sales
2 E004 Tom Johnson D002 HR
3 E002 Jane Smith D003 Admin
Outer Join
An outer join combines two DataFrames based on a common key. Unlike an inner join, an outer join returns a new DataFrame that contains all rows from both original DataFrames.
If values are not found in the DataFrames, it fills the space with NaN
.
For example,
import pandas as pd
# create dataframes from the dictionaries
data1 = {
'EmployeeID': ['E001', 'E002', 'E003', 'E004', 'E005'],
'Name': ['John Doe', 'Jane Smith', 'Peter Brown', 'Tom Johnson', 'Rita Patel'],
'DeptID': ['D001', 'D003', 'D001', 'D002', 'D006'],
}
employees = pd.DataFrame(data1)
data2 = {
'DeptID': ['D001', 'D002', 'D003', 'D004'],
'DeptName': ['Sales', 'HR', 'Admin', 'Marketing']
}
departments = pd.DataFrame(data2)
# outer merge the dataframes
df_merge = pd.merge(employees, departments, on = 'DeptID', how = 'outer', sort = True)
print(df_merge)
Output
EmployeeID Name DeptID DeptName
0 E001 John Doe D001 Sales
1 E003 Peter Brown D001 Sales
2 E004 Tom Johnson D002 HR
3 E002 Jane Smith D003 Admin
4 NaN NaN D004 Marketing
5 E005 Rita Patel D006 NaN
Cross Join
A cross join in Pandas creates the cartesian product of both DataFrames while preserving the order of the left DataFrame.
For example,
import pandas as pd
# create dataframes from the dictionaries
data1 = {
'EmployeeID': ['E001', 'E002', 'E003', 'E004', 'E005'],
'Name': ['John Doe', 'Jane Smith', 'Peter Brown', 'Tom Johnson', 'Rita Patel'],
'DeptID': ['D001', 'D003', 'D001', 'D002', 'D006'],
}
employees = pd.DataFrame(data1)
data2 = {
'DeptID': ['D001', 'D002', 'D003', 'D004'],
'DeptName': ['Sales', 'HR', 'Admin', 'Marketing']
}
departments = pd.DataFrame(data2)
# merge the dataframes
df_merge = pd.merge(employees, departments, how = 'cross')
print(df_merge)
Output
EmployeeID Name DeptID_x DeptID_y DeptName
0 E001 John Doe D001 D001 Sales
1 E001 John Doe D001 D002 HR
2 E001 John Doe D001 D003 Admin
3 E001 John Doe D001 D004 Marketing
4 E002 Jane Smith D003 D001 Sales
5 E002 Jane Smith D003 D002 HR
6 E002 Jane Smith D003 D003 Admin
7 E002 Jane Smith D003 D004 Marketing
8 E003 Peter Brown D001 D001 Sales
9 E003 Peter Brown D001 D002 HR
10 E003 Peter Brown D001 D003 Admin
11 E003 Peter Brown D001 D004 Marketing
12 E004 Tom Johnson D002 D001 Sales
13 E004 Tom Johnson D002 D002 HR
14 E004 Tom Johnson D002 D003 Admin
15 E004 Tom Johnson D002 D004 Marketing
16 E005 Rita Patel D006 D001 Sales
17 E005 Rita Patel D006 D002 HR
18 E005 Rita Patel D006 D003 Admin
19 E005 Rita Patel D006 D004 Marketing
Join vs Merge vs Concat
There are three different methods to combine DataFrames in Pandas:
join()
: joins two DataFrames based on their indexes, performs left join by defaultmerge()
: joins two DataFrames based on any specified columns, performs inner join by defaultconcat()
: stacks two DataFrames along the vertical or horizontal axis