Pandas Handling Missing Values
In Pandas, missing values, often represented as NaN
(Not a Number), can cause problems during data processing and analysis. These gaps in data can lead to incorrect analysis and misleading conclusions.
Pandas provides a host of functions like dropna()
, fillna()
and combine_first()
to handle missing values.
Let’s consider the following DataFrame to illustrate various techniques on handling missing data:
import pandas as pd
import numpy as np
# create dataframe with missing values
data = {
'A': [1, 2, np.nan, 4, 5],
'B': [np.nan, 2, 3, 4, 5],
'C': [1, 2, 3, np.nan, 5],
'D': [1, 2, 3, 4, 5]
}
df = pd.DataFrame(data)
print(df)
Output
A B C D
0 1.0 NaN 1.0 1
1 2.0 2.0 2.0 2
2 NaN 3.0 3.0 3
3 4.0 4.0 NaN 4
4 5.0 5.0 5.0 5
Here, we have used the NumPy library to generate NaN
values in the DataFrame.
Remove Rows Containing Missing Values
One straightforward way to handle missing values is by removing them. Since the data sets we deal with are often large, eliminating a few rows typically has minimal impact on the final outcome.
We use the dropna()
function to remove rows containing at least one missing value. For example,
import pandas as pd
import numpy as np
# create a dataframe with missing values
data = {
'A': [1, 2, np.nan, 4, 5],
'B': [np.nan, 2, 3, 4, 5],
'C': [1, 2, 3, np.nan, 5],
'D': [1, 2, 3, 4, 5]
}
df = pd.DataFrame(data)
# remove rows with missing values
df.dropna(inplace=True)
print(df)
Output
A B C D
1 2.0 2.0 2.0 2
4 5.0 5.0 5.0 5
In this example, we removed all the rows containing NaN
values using dropna()
. The dropna()
method detects the rows with NaN
values and removes them.
Here, inplace=True
specifies that changes are to be made in the original DataFrame itself.
Replace Missing Values
Instead of deleting the entire row containing missing values, we can replace the missing values with a specified value using fillna()
.
Let’s look at an example.
import pandas as pd
import numpy as np
# create a dataframe with missing values
data = {
'A': [1, 2, np.nan, 4, 5],
'B': [np.nan, 2, 3, 4, 5],
'C': [1, 2, 3, np.nan, 5],
'D': [1, 2, 3, 4, 5]
}
df = pd.DataFrame(data)
# replace missing values with 0
df.fillna(value=0, inplace=True)
print(df)
Output
A B C D
0 1.0 0.0 1.0 1
1 2.0 2.0 2.0 2
2 0.0 3.0 3.0 3
3 4.0 4.0 0.0 4
4 5.0 5.0 5.0 5
In this example, we replaced the NaN
values with 0 using fillna()
.
Replace Missing Values With Mean, Median and Mode
A more refined approach is to replace missing values with the mean, median, or mode of the remaining values in the column. This can give a more accurate representation than just replacing it with a default value.
We can use the fillna()
function with aggregate functions to replace missing values with mean, median or mode.
Let’s look at an example.
import pandas as pd
import numpy as np
# create a dataframe with missing values
data = {
'A': [1, 2, np.nan, 4, 5],
'B': [np.nan, 2, 3, 4, 5],
'C': [1, 2, 3, np.nan, 5],
'D': [1, 2, 3, 4, 5]
}
df = pd.DataFrame(data)
# replace missing values with mean
df['A'].fillna(value=df['A'].mean(), inplace=True)
# replace missing values with median
df['B'].fillna(value=df['B'].median(), inplace=True)
# replace missing values with mode
df['C'].fillna(value=df['C'].mode()[0], inplace=True)
print(df)
Output
A B C D
0 1.0 3.5 1.0 1
1 2.0 2.0 2.0 2
2 3.0 3.0 3.0 3
3 4.0 4.0 1.0 4
4 5.0 5.0 5.0 5
In this example, we replaced the missing values of A
, B
and C
columns with their mean, median and mode respectively.
Here, mode()[0]
returns the most frequent value. Since all the values have the same frequency, it returns the first value of the column.
Replace Values Using Another DataFrame
We can replace missing values in one DataFrame using another DataFrame using the fillna()
method.
Let’s look at an example.
import pandas as pd
import numpy as np
# create a dataframe with missing values
data1 = {
'A': [1, 2, np.nan, 4, 5],
'B': [np.nan, 2, 3, 4, 5],
'C': [1, 2, 3, np.nan, 5],
'D': [1, 2, 3, 4, 5]
}
df1 = pd.DataFrame(data1)
# create datframe to fill the missing values with
data2 = {
'A': [10, 20, 30, 40, 50],
'B': [10, 20, 30, 40, 50],
'C': [10, 20, 30, 40, 50],
'D': [10, 20, 30, 40, 50]
}
df2 = pd.DataFrame(data2)
# replace missing values
df1.fillna(df2, inplace=True)
print(df1)
Output
A B C D
0 1.0 10.0 1.0 1
1 2.0 2.0 2.0 2
2 30.0 3.0 3.0 3
3 4.0 4.0 40.0 4
4 5.0 5.0 5.0 5
Here, we’ve two dataframes df1 and df2. The fillna()
replaces missing values in df1 with corresponding values from df2.