Pandas Data Cleaning
Data cleaning means fixing and organizing messy data. Pandas offers a wide range of tools and functions to help us clean and preprocess our data effectively.
Data cleaning often involves:
- Dropping irrelevant columns.
- Renaming column names to meaningful names.
- Making data values consistent.
- Replacing or filling in missing values.
Drop Rows With Missing Values
In Pandas, we can drop rows with missing values using the dropna()
function. For example,
import pandas as pd
# define a dictionary with sample data which includes some missing values
data = {
'A': [1, 2, 3, None, 5],
'B': [None, 2, 3, 4, 5],
'C': [1, 2, None, None, 5]
}
df = pd.DataFrame(data)
print("Original Data:\n",df)
print()
# use dropna() to remove rows with any missing values
df_cleaned = df.dropna()
print("Cleaned Data:\n",df_cleaned)
Output
Original Data:
A B C
0 1.0 NaN 1.0
1 2.0 2.0 2.0
2 3.0 3.0 NaN
3 NaN 4.0 NaN
4 5.0 5.0 5.0
Cleaned Data:
A B C
1 2.0 2.0 2.0
4 5.0 5.0 5.0
Here, we have used the dropna()
method to remove rows with any missing values. The resulting DataFrame df_cleaned
will only contain rows without any missing values
Fill Missing Values
To fill the missing values in Pandas, we use the fillna()
function. For example,
import pandas as pd
# define a dictionary with sample data which includes some missing values
data = {
'A': [1, 2, 3, None, 5],
'B': [None, 2, 3, 4, 5],
'C': [1, 2, None, None, 5]
}
df = pd.DataFrame(data)
print("Original Data:\n", df)
# filling NaN values with 0
df.fillna(0, inplace=True)
print("\nData after filling NaN with 0:\n", df)
Output
Original Data:
A B C
0 1.0 NaN 1.0
1 2.0 2.0 2.0
2 3.0 3.0 NaN
3 NaN 4.0 NaN
4 5.0 5.0 5.0
Data after filling NaN with 0:
A B C
0 1.0 0.0 1.0
1 2.0 2.0 2.0
2 3.0 3.0 0.0
3 0.0 4.0 0.0
4 5.0 5.0 5.0
Here, we used data.fillna()
to fill the missing values(NaN
) in each column with 0.
Note: The
inplace=True
argument here means that the operation will modify the DataFrame directly, rather than returning a new DataFrame with the modifications.
Use Aggregate Functions to Fill Missing Values
Instead of filling with 0, we can also use aggregate functions to fill missing values.
Let’s look at an example to fill missing values with the mean of each column.
import pandas as pd
# define a dictionary with sample data which includes some missing values
data = {
'A': [1, 2, 3, None, 5],
'B': [None, 2, 3, 4, 5],
'C': [1, 2, None, None, 5]
}
df = pd.DataFrame(data)
print("Original Data:\n", df)
# filling NaN values with the mean of each column
df.fillna(df.mean(), inplace=True)
print("\nData after filling NaN with mean:\n", df)
Output
Original Data:
A B C
0 1.0 NaN 1.0
1 2.0 2.0 2.0
2 3.0 3.0 NaN
3 NaN 4.0 NaN
4 5.0 5.0 5.0
Data after filling NaN with mean:
A B C
0 1.00 3.5 1.000000
1 2.00 2.0 2.000000
2 3.00 3.0 2.666667
3 2.75 4.0 2.666667
4 5.00 5.0 5.000000
Here, the df.mean()
calculates the mean for each column, and the fillna()
method then replaces NaN
values in each column with the respective mean.
Handle Duplicates Values
In Pandas, to handle duplicate rows, we can use the duplicated()
and the drop_duplicates()
function.
duplicated()
- to check for duplicatesdrop_duplicates()
- remove duplicate rows
import pandas as pd
# sample data
data = {
'A': [1, 2, 2, 3, 3, 4],
'B': [5, 6, 6, 7, 8, 8]
}
df = pd.DataFrame(data)
print("Original DataFrame:\n", df.to_string(index=False))
# detect duplicates
print("\nDuplicate Rows:\n", df[df.duplicated()].to_string(index=False))
# remove duplicates based on column 'A'
df.drop_duplicates(subset=['A'], keep='first', inplace=True)
print("\nDataFrame after removing duplicates based on column 'A':\n", df.to_string(index=False))
Output
Original DataFrame:
A B
1 5
2 6
2 6
3 7
3 8
4 8
Duplicate Rows:
A B
2 6
DataFrame after removing duplicates based on column 'A':
A B
1 5
2 6
3 7
4 8
Here,
df[df.duplicated()]
produces a boolean Series to identify duplicate rows.df.drop_duplicates(subset=['A'], keep='first', inplace=True)
, removes duplicates based on columnA
, retaining only the first occurrence of each duplicate directly in the original DataFrame.
Rename Column Names to Meaningful Names
To rename column names to more meaningful names in Pandas, we can use the rename()
function. For example,
import pandas as pd
# sample data
data = {
'A': [25, 30, 35],
'B': ['John', 'Doe', 'Smith'],
'C': [50000, 60000, 70000]
}
df = pd.DataFrame(data)
# rename columns
df.rename(columns={'A': 'Age', 'B': 'Name', 'C': 'Salary'}, inplace=True)
print(df.to_string(index=False))
Output
Age Name Salary
25 John 50000
30 Doe 60000
35 Smith 70000
Here, the columns of df are renamed from A
, B
, and C
to more meaningful names Age
, Name
, and Salary
respectively.