Pandas Reshape
In Pandas, reshaping data refers to the process of converting a DataFrame from one format to another for better data visualization and analysis.
Pandas provides multiple methods like pivot()
, pivot_table()
, stack()
, unstack()
and melt()
to reshape data. We can choose the method based on our analysis requirement.
Reshape Data Using pivot()
In Pandas, the pivot()
function reshapes data based on column values.
It takes simple column-wise data as input, and groups the entries into a two-dimensional table.
Let’s look at an example.
import pandas as pd
# create a DataFrame
data = {'Date': ['2023-08-01', '2023-08-01', '2023-08-02', '2023-08-02'],
'Category': ['A', 'B', 'A', 'B'],
'Value': [10, 20, 30, 40]}
df = pd.DataFrame(data)
print("Original Dataframe:\n", df)
# pivot the DataFrame
pivot_df = df.pivot(index='Date', columns='Category', values='Value')
print("Reshaped DataFrame:\n", pivot_df)
Output
Original Dataframe:
Date Category Value
0 2023-08-01 A 10
1 2023-08-01 B 20
2 2023-08-02 A 30
3 2023-08-02 B 40
Reshaped DataFrame:
Category A B
Date
2023-08-01 10 20
2023-08-02 30 40
In this example, we have passed the parameters index, columns and values to the pivot function. Here,
index
specifies the column to be used as the index for the pivoted DataFramecolumns
specifies the column whose unique values will become the new column headersvalues
specifies the column containing the values to be placed in the new columns
So as we can see in the output, the DataFrame has been pivoted, with the unique values from the Category
column (A
and B
) becoming separate columns.
And the corresponding values from the Value
column are then placed in the respective cells.
Reshape Data Using pivot_table()
The pivot_table()
function in Pandas is a way for reshaping and summarizing data in a DataFrame.
It allows us to create a pivot table that aggregates and summarizes data based on the specified index, columns, and aggregation functions.
Let’s look at an example.
import pandas as pd
# create a DataFrame
data = {'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
'Value': [10, 20, 30, 40, 50, 60]}
df = pd.DataFrame(data)
print("Original Dataframe:\n", df)
# create a pivot table
pivot_table_df = df.pivot_table(index='Category', values='Value', aggfunc='mean')
print("Reshaped Dataframe:\n", pivot_table_df)
Output
Original Dataframe:
Category Value
0 A 10
1 B 20
2 A 30
3 B 40
4 A 50
5 B 60
Reshaped Dataframe:
Value
Category
A 30.0
B 40.0
In the above example, we have used pivot_table()
to create a pivot table from the original df
DataFrame.
Inside the pivot_table()
, we have passed parameters as
-
index = Category
- specifying theCategory
column as the index of the pivot table. This means that each unique value inCategory
will become a row index in the pivot table -
values = Value
- specifying theValue
column as the source of values that will be used for aggregation in the pivot table -
aggfunc = mean
- we are using the mean aggregation function to calculate the average value of theValue
column for each category.
Hence in the output, the pivot table has the categories A
and B
as row indices and displays the mean values of their corresponding Value
column entries.
This allows us to quickly compare the average values for different categories using a more compact and organized format.
Reshape Data Using stack() and unstack()
In Pandas, we can also use the stack()
and unstack()
to reshape data.
stack()
is used to pivot a level of the column labels, transforming them into innermost row index levels.unstack()
is used to pivot a level of the row index, transforming it into an outermost column level
Let’s look at an example.
import pandas as pd
# create a DataFrame
data = {'Date': ['2023-08-01', '2023-08-02'],
'Category_A': [10, 20],
'Category_B': [30, 40]}
df = pd.DataFrame(data)
# set 'Date' column as the index
df.set_index('Date', inplace=True)
# stack the columns into rows
stacked_df = df.stack()
print("Stack:\n", stacked_df)
print()
# unstack the rows back to columns
unstacked_df = stacked_df.unstack()
print("Unstack: \n", unstacked_df)
Output
Stack:
Date
2023-08-01 Category_A 10
Category_B 30
2023-08-02 Category_A 20
Category_B 40
dtype: int64
Unstack:
Category_A Category_B
Date
2023-08-01 10 30
2023-08-02 20 40
Here,
stack()
is applied to thedf
DataFrame, which pivots the column labels (Category_A
andCategory_B
) into a new level of row index.unstack()
is applied to thestacked_df
, which reverses the operation and pivots the innermost level of row index back to columns.
Use of melt() to Reshape DataFrame
The melt()
function in Pandas transforms a DataFrame from a wide format (multiple columns representing different variables) to a long format (a single column representing all variables).
import pandas as pd
# create a sample DataFrame
data = {'Name': ['Alice', 'Bob'],
'Math': [90, 85],
'History': [75, 92]}
df = pd.DataFrame(data)
# melt the DataFrame
melted_df = pd.melt(df, id_vars='Name', var_name='Subject', value_name='Score')
print(melted_df)
Output
Name Subject Score
0 Alice Math 90
1 Bob Math 85
2 Alice History 75
3 Bob History 92
In this example, we have used the melt()
function to transform the DataFrame df
from a wide format to a long format.
Inside melt()
, we have passed different parameters,
id_vars
specifies the column that we want to keep unchangedvar_name
specifies the name for the new column that will hold the variable names (Math
andHistory
).value_name
specifies the name for the new column that will hold the values (the scores).