Jul 9, 2023

Pandas Categorical

Categorical data is a type of data that represents categories or labels rather than numerical values.

In simple words, it is a way of classifying into distinct categories, such as genders, country names, or education levels.

Categorical data is handy when we have data that naturally fit into predefined options.

Create Categorical Data Type in Pandas

In Pandas, the Categorical() method is used to create a categorical data type from a given sequence of values.

import pandas as pd

data = ['red', 'blue', 'green', 'red', 'blue']

# create a categorical column
categorical_data = pd.Categorical(data)

print(categorical_data)

Output

['red', 'blue', 'green', 'red', 'blue']
Categories (3, object): ['blue', 'green', 'red']

In the above example, the Categorical() function converts the data list into a categorical series.

The output includes the original data values and a list of unique categories present in the data.

Convert Pandas Series to Categorical Series

In Pandas, we can convert a regular Pandas Series to a Categorical Series using either the astype() function or the dtype parameter within the pd.Series() constructor.

Using the astype() Function

import pandas as pd

# create a regular Series
data = ['red', 'blue', 'green', 'red', 'blue']
series1 = pd.Series(data)

# convert the Series to a categorical Series using .astype()
categorical_s = series1.astype('category')

print(categorical_s)

Output

0      red
1     blue
2    green
3      red
4     blue
dtype: category
Categories (3, object): ['blue', 'green', 'red']

Here, series1.astype('category') specifies we want to convert the series1 series into a categorical series.

Using the dtype parameter Inside Series()

import pandas as pd

# create a categorical Series
data = ['A', 'B', 'A', 'C', 'B']
cat_series = pd.Series(data, dtype="category")

print(cat_series)

Here, we have used the dtype="category" parameter inside Series() to convert normal series into categorical series.

The output will be the same as above.

Access Categories and Codes in Pandas

In Pandas, the cat accessor allows us to access categories and codes. Here’s the attributes provided by the cat accessor to access categories and codes:

categories - returns the unique categories present in the categorical variable
codes - returns the integer codes representing the categories for each element in the categorical variable

Let’s look at an example.

import pandas as pd

# create a categorical Series
data = ['A', 'B', 'A', 'C', 'B']
cat_series = pd.Series(data, dtype="category")

# using .cat accessor
print(cat_series.cat.categories)
print(cat_series.cat.codes)

Output

Index(['A', 'B', 'C'], dtype='object')
0    0
1    1
2    0
3    2
4    1
dtype: int8

In the above example, first we have used cat_series.cat.categories to access the unique categories present in cat_series.

In this case, the output will be Index(['A', 'B', 'C'], dtype='object'), which are the distinct categories in the data.

Then, we have used cat_series.cat.codes to access the integer codes corresponding to the categories in cat_series.

Let’s see how we got the output,

Here,

The element at index 0 of cat_series is A, which corresponds to category 0.
The element at index 1 of cat_series is B, which corresponds to category 1.
The element at index 2 of cat_series is A, which again corresponds to category 0.
The element at index 3 of cat_series is C, which corresponds to category 2.
The element at index 4 of cat_series is B, which again corresponds to category 1.

Rename Categories in Pandas

We can rename the categories in Pandas using the cat.rename_categories() method. For example,

import pandas as pd

# create a categorical Series
data = ['A', 'B', 'A', 'C', 'B']
cat_series = pd.Series(data, dtype="category")

# create a dictionary for renaming categories
category_mapping = {"A": "Category A", "B": "Category B", "C": "Category C"}

# rename categories using .rename_categories() and recreate the Series
cat_series_renamed = cat_series.cat.rename_categories(category_mapping)

print(cat_series_renamed)

Output

0    Category A
1    Category B
2    Category A
3    Category C
4    Category B
dtype: category
Categories (3, object): ['Category A', 'Category B', 'Category C']

In this example, the categories A, B, and C are renamed to Category A, Category B, and Category C respectively.

Add New Categories in Pandas

In Pandas, we can add new categories to the existing set of categories using the cat.add_categories() method.

Let’s look at an example.

import pandas as pd

# create a categorical Series
data = ['A', 'B', 'A', 'C', 'B']
cat_series = pd.Series(data, dtype="category")

# add new categories and reassign the variable
new_categories = ['D', 'E']
cat_series = cat_series.cat.add_categories(new_categories)

print(cat_series)

Output

0    A
1    B
2    A
3    C
4    B
dtype: category
Categories (5, object): ['A', 'B', 'C', 'D', 'E']

Here, we added the new categories D and E to the categorical Series, and the result was assigned back to cat_series, effectively updating the variable with the new categories.

Remove Categories in Pandas

To remove categories from a categorical variable in Pandas, we can use the cat.remove_categories() method.

Let’s look at an example.

import pandas as pd

# create a categorical Series
data = ['A', 'B', 'A', 'C', 'B']
cat_series = pd.Series(data, dtype="category")

# display the original categorical variable
print("Original Series:")
print(cat_series)

# remove specific categories
categories_to_remove = ["B", "C"]
cat_series_removed = cat_series.cat.remove_categories(categories_to_remove)

# display the modified categorical variable
print("\nModified Series:")
print(cat_series_removed)

Output

Original Series:
0    A
1    B
2    A
3    C
4    B
dtype: category
Categories (3, object): ['A', 'B', 'C']
Modified Series:
0      A
1    NaN
2      A
3    NaN
4    NaN
dtype: category
Categories (1, object): ['A']

In this example, we have used the cat.remove_categories() to remove the categories B and C from cat_series.

Check if Categorical Variable is Ordered or Not

In Pandas, to check if a categorical variable is ordered, you can use the ordered attribute provided by the cat accessor in pandas. For example,

import pandas as pd

# create an ordered categorical Series
data = ['low', 'medium', 'high', 'low', 'medium']
ordered_cat_series = pd.Categorical(data, categories=['low', 'medium', 'high'], ordered=True)

# check if the categorical variable is ordered
is_ordered = ordered_cat_series.ordered

print("Is ordered:", is_ordered)

Output

Is ordered: True

In this example, ordered_cat_series.ordered will be True because the categorical variable ordered_cat_series was created with the ordered=True parameter.

Note: Ordering categorical variables in Pandas helps in maintaining a logical sequence for analysis and visualization. Recognizing this order ensures accurate statistical tests, meaningful visual representations, and consistent data interpretation.