Pandas Reshape

In Pandas, reshaping data refers to the process of converting a DataFrame from one format to another for better data visualization and analysis.

Pandas provides multiple methods like pivot(), pivot_table(), stack(), unstack() and melt() to reshape data. We can choose the method based on our analysis requirement.


Reshape Data Using pivot()

In Pandas, the pivot() function reshapes data based on column values.

It takes simple column-wise data as input, and groups the entries into a two-dimensional table.

Let’s look at an example.

import pandas as pd

# create a DataFrame
data = {'Date': ['2023-08-01', '2023-08-01', '2023-08-02', '2023-08-02'],
        'Category': ['A', 'B', 'A', 'B'],
        'Value': [10, 20, 30, 40]}
df = pd.DataFrame(data)

print("Original Dataframe:\n", df)

# pivot the  DataFrame
pivot_df = df.pivot(index='Date', columns='Category', values='Value')
print("Reshaped DataFrame:\n", pivot_df)

Output

Original Dataframe: 
         Date      Category  Value
0  2023-08-01        A     10
1  2023-08-01        B     20
2  2023-08-02        A     30
3  2023-08-02        B     40

Reshaped DataFrame: 
Category     A   B
Date              
2023-08-01  10  20
2023-08-02  30  40

In this example, we have passed the parameters index, columns and values to the pivot function. Here,

  • index specifies the column to be used as the index for the pivoted DataFrame
  • columns specifies the column whose unique values will become the new column headers
  • values specifies the column containing the values to be placed in the new columns

So as we can see in the output, the DataFrame has been pivoted, with the unique values from the Category column (A and B) becoming separate columns.

And the corresponding values from the Value column are then placed in the respective cells.


Reshape Data Using pivot_table()

The pivot_table() function in Pandas is a way for reshaping and summarizing data in a DataFrame.

It allows us to create a pivot table that aggregates and summarizes data based on the specified index, columns, and aggregation functions.

Let’s look at an example.

import pandas as pd

# create a DataFrame
data = {'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
        'Value': [10, 20, 30, 40, 50, 60]}
df = pd.DataFrame(data)
print("Original Dataframe:\n", df)

# create a pivot table
pivot_table_df = df.pivot_table(index='Category', values='Value', aggfunc='mean')
print("Reshaped Dataframe:\n", pivot_table_df)

Output

Original Dataframe: 
   Category  Value
0        A     10
1        B     20
2        A     30
3        B     40
4        A     50
5        B     60 

Reshaped Dataframe: 
Value
Category       
A          30.0
B          40.0

In the above example, we have used pivot_table() to create a pivot table from the original df DataFrame.

Inside the pivot_table(), we have passed parameters as

  1. index = Category - specifying the Category column as the index of the pivot table. This means that each unique value in Category will become a row index in the pivot table

  2. values = Value - specifying the Value column as the source of values that will be used for aggregation in the pivot table

  3. aggfunc = mean - we are using the mean aggregation function to calculate the average value of the Value column for each category.

Hence in the output, the pivot table has the categories A and B as row indices and displays the mean values of their corresponding Value column entries.

This allows us to quickly compare the average values for different categories using a more compact and organized format.


Reshape Data Using stack() and unstack()

In Pandas, we can also use the stack() and unstack() to reshape data.

  • stack() is used to pivot a level of the column labels, transforming them into innermost row index levels.
  • unstack() is used to pivot a level of the row index, transforming it into an outermost column level

Let’s look at an example.

import pandas as pd

# create a DataFrame
data = {'Date': ['2023-08-01', '2023-08-02'],
        'Category_A': [10, 20],
        'Category_B': [30, 40]}
df = pd.DataFrame(data)

# set 'Date' column as the index
df.set_index('Date', inplace=True)

# stack the columns into rows
stacked_df = df.stack()
print("Stack:\n", stacked_df)
print()

# unstack the rows back to columns
unstacked_df = stacked_df.unstack()
print("Unstack: \n", unstacked_df)

Output

Stack:
Date                  
2023-08-01  Category_A    10
            Category_B    30
2023-08-02  Category_A    20
            Category_B    40
dtype: int64

Unstack: 
Category_A  Category_B
Date                              
2023-08-01          10          30
2023-08-02          20          40

Here,

  • stack() is applied to the df DataFrame, which pivots the column labels (Category_A and Category_B) into a new level of row index.
  • unstack() is applied to the stacked_df, which reverses the operation and pivots the innermost level of row index back to columns.

Use of melt() to Reshape DataFrame

The melt() function in Pandas transforms a DataFrame from a wide format (multiple columns representing different variables) to a long format (a single column representing all variables).

import pandas as pd

# create a sample DataFrame
data = {'Name': ['Alice', 'Bob'],
        'Math': [90, 85],
        'History': [75, 92]}
df = pd.DataFrame(data)

# melt the DataFrame
melted_df = pd.melt(df, id_vars='Name', var_name='Subject', value_name='Score')

print(melted_df)

Output

     Name  Subject  Score
0  Alice     Math     90
1    Bob     Math     85
2  Alice  History     75
3    Bob  History     92

In this example, we have used the melt() function to transform the DataFrame df from a wide format to a long format.

Inside melt(), we have passed different parameters,

  • id_vars specifies the column that we want to keep unchanged
  • var_name specifies the name for the new column that will hold the variable names ( Math and History).
  • value_name specifies the name for the new column that will hold the values (the scores).
© 2024 All rights reserved. | Made With 🤍 By The_MAK Team