Jul 9, 2023

Pandas Data Cleaning

Data cleaning means fixing and organizing messy data. Pandas offers a wide range of tools and functions to help us clean and preprocess our data effectively.

Data cleaning often involves:

Dropping irrelevant columns.
Renaming column names to meaningful names.
Making data values consistent.
Replacing or filling in missing values.

Drop Rows With Missing Values

In Pandas, we can drop rows with missing values using the dropna() function. For example,

import pandas as pd

# define a dictionary with sample data which includes some missing values
data = {
    'A': [1, 2, 3, None, 5],
    'B': [None, 2, 3, 4, 5],
    'C': [1, 2, None, None, 5]
}

df = pd.DataFrame(data)
print("Original Data:\n",df)
print()

# use dropna() to remove rows with any missing values
df_cleaned = df.dropna()

print("Cleaned Data:\n",df_cleaned)

Output

Original Data:
     A    B    C
0  1.0  NaN  1.0
1  2.0  2.0  2.0
2  3.0  3.0  NaN
3  NaN  4.0  NaN
4  5.0  5.0  5.0

Cleaned Data:
    A    B    C
1  2.0  2.0  2.0
4  5.0  5.0  5.0

Here, we have used the dropna() method to remove rows with any missing values. The resulting DataFrame df_cleaned will only contain rows without any missing values

Fill Missing Values

To fill the missing values in Pandas, we use the fillna() function. For example,

import pandas as pd

# define a dictionary with sample data which includes some missing values
data = {
    'A': [1, 2, 3, None, 5],
    'B': [None, 2, 3, 4, 5],
    'C': [1, 2, None, None, 5]
}

df = pd.DataFrame(data)

print("Original Data:\n", df)

# filling NaN values with 0
df.fillna(0, inplace=True)

print("\nData after filling NaN with 0:\n", df)

Output

Original Data:
    A    B    C
0  1.0  NaN  1.0
1  2.0  2.0  2.0
2  3.0  3.0  NaN
3  NaN  4.0  NaN
4  5.0  5.0  5.0
Data after filling NaN with 0:
   A    B    C
0  1.0  0.0  1.0
1  2.0  2.0  2.0
2  3.0  3.0  0.0
3  0.0  4.0  0.0
4  5.0  5.0  5.0

Here, we used data.fillna() to fill the missing values(NaN) in each column with 0.

Note: The inplace=True argument here means that the operation will modify the DataFrame directly, rather than returning a new DataFrame with the modifications.

Use Aggregate Functions to Fill Missing Values

Instead of filling with 0, we can also use aggregate functions to fill missing values.

Let’s look at an example to fill missing values with the mean of each column.

import pandas as pd

# define a dictionary with sample data which includes some missing values
data = {
    'A': [1, 2, 3, None, 5],
    'B': [None, 2, 3, 4, 5],
    'C': [1, 2, None, None, 5]
}

df = pd.DataFrame(data)

print("Original Data:\n", df)

# filling NaN values with the mean of each column
df.fillna(df.mean(), inplace=True)

print("\nData after filling NaN with mean:\n", df)

Output

Original Data:
    A    B    C
0  1.0  NaN  1.0
1  2.0  2.0  2.0
2  3.0  3.0  NaN
3  NaN  4.0  NaN
4  5.0  5.0  5.0
Data after filling NaN with mean:
     A    B         C
0  1.00  3.5  1.000000
1  2.00  2.0  2.000000
2  3.00  3.0  2.666667
3  2.75  4.0  2.666667
4  5.00  5.0  5.000000

Here, the df.mean() calculates the mean for each column, and the fillna() method then replaces NaN values in each column with the respective mean.

Handle Duplicates Values

In Pandas, to handle duplicate rows, we can use the duplicated() and the drop_duplicates() function.

duplicated() - to check for duplicates
drop_duplicates() - remove duplicate rows

import pandas as pd

# sample data
data = {
    'A': [1, 2, 2, 3, 3, 4],
    'B': [5, 6, 6, 7, 8, 8]
}
df = pd.DataFrame(data)

print("Original DataFrame:\n", df.to_string(index=False))

# detect duplicates
print("\nDuplicate Rows:\n", df[df.duplicated()].to_string(index=False))

# remove duplicates based on column 'A'
df.drop_duplicates(subset=['A'], keep='first', inplace=True)

print("\nDataFrame after removing duplicates based on column 'A':\n", df.to_string(index=False))

Output

Original DataFrame:
 A  B
 1  5
 2  6
 2  6
 3  7
 3  8
 4  8
Duplicate Rows:
 A  B
 2  6
DataFrame after removing duplicates based on column 'A':
 A  B
 1  5
 2  6
 3  7
 4  8

Here,

df[df.duplicated()] produces a boolean Series to identify duplicate rows.
df.drop_duplicates(subset=['A'], keep='first', inplace=True), removes duplicates based on column A, retaining only the first occurrence of each duplicate directly in the original DataFrame.

Rename Column Names to Meaningful Names

To rename column names to more meaningful names in Pandas, we can use the rename() function. For example,

import pandas as pd

# sample data
data = {
    'A': [25, 30, 35],
    'B': ['John', 'Doe', 'Smith'],
    'C': [50000, 60000, 70000]
}

df = pd.DataFrame(data)

# rename columns
df.rename(columns={'A': 'Age', 'B': 'Name', 'C': 'Salary'}, inplace=True)

print(df.to_string(index=False))

Output

Age  Name  Salary
  25  John   50000
  30   Doe   60000
  35 Smith   70000

Here, the columns of df are renamed from A, B, and C to more meaningful names Age, Name, and Salary respectively.