Jul 9, 2023

Pandas DataFrame Analysis

Pandas DataFrame objects come with a variety of built-in functions like head(), tail() and info() that allow us to view and analyze DataFrames.

View Data in a Pandas DataFrame

A Pandas Dataframe can be displayed as any other Python variable using the print() function.

However, when dealing with very large DataFrames with large numbers of rows and columns, the print() function is unable to display the whole DataFrame. Instead, it prints only a part of the DataFrame.

In the case of large DataFrames, we can use head(), tail() and info() methods to get the overview of the DataFrame.

Pandas head()

The head() method provides a rapid summary of a DataFrame. It returns the column headers and a specified number of rows from the beginning. For example,

import pandas as pd

# create a dataframe
data = {'Name': ['John', 'Alice', 'Bob', 'Emma', 'Mike', 'Sarah', 'David', 'Linda', 'Tom', 'Emily'],
        'Age': [25, 30, 35, 28, 32, 27, 40, 33, 29, 31],
        'City': ['New York', 'Paris', 'London', 'Sydney', 'Tokyo', 'Berlin', 'Rome', 'Madrid', 'Toronto', 'Moscow']}
df = pd.DataFrame(data)

# display the first three rows
print('First Three Rows:')
print(df.head(3))
print()

# display the first five rows
print('First Five Rows:')
print(df.head())

Output

First Three Rows:
    Name   Age      City
0   John   25  New York
1  Alice   30     Paris
2    Bob   35    London

First Five Rows:
    Name   Age      City
0   John   25  New York
1  Alice   30     Paris
2    Bob   35    London
3   Emma   28    Sydney
4   Mike   32     Tokyo

In this example, we displayed selected rows of the df DataFrame starting from the top using head().

Notice that the first five rows are selected by default when no argument is passed to the head() method.

Pandas tail()

The tail() method is similar to head() but it returns data starting from the end of the DataFrame. For example,

import pandas as pd

# create a dataframe
data = {'Name': ['John', 'Alice', 'Bob', 'Emma', 'Mike', 'Sarah', 'David', 'Linda', 'Tom', 'Emily'],
        'Age': [25, 30, 35, 28, 32, 27, 40, 33, 29, 31],
        'City': ['New York', 'Paris', 'London', 'Sydney', 'Tokyo', 'Berlin', 'Rome', 'Madrid', 'Toronto', 'Moscow']}

df = pd.DataFrame(data)

# display the last three rows
print('Last Three Rows:')
print(df.tail(3))
print()

# display the last five rows
print('Last Five Rows:')
print(df.tail())

Output

Last Three Rows:
    Name   Age     City
7  Linda   33   Madrid
8    Tom   29  Toronto
9  Emily   31   Moscow

Last Five Rows:
    Name   Age     City
5  Sarah   27   Berlin
6  David   40     Rome
7  Linda   33   Madrid
8    Tom   29  Toronto
9  Emily   31   Moscow

In this example, we displayed selected rows of the df DataFrame starting from the bottom using tail().

Notice that the last five rows are selected by default when no argument is passed to the tail() method.

Get DataFrame Information

The info() method gives us the overall information about the DataFrame such as its class, data type, size etc. For example,

import pandas as pd

# create dataframe
data = {'Name': ['John', 'Alice', 'Bob', 'Emma', 'Mike', 'Sarah', 'David', 'Linda', 'Tom', 'Emily'],
        'Age': [25, 30, 35, 28, 32, 27, 40, 33, 29, 31],
        'City': ['New York', 'Paris', 'London', 'Sydney', 'Tokyo', 'Berlin', 'Rome', 'Madrid', 'Toronto', 'Moscow']}
df = pd.DataFrame(data)

# get info about dataframe
df.info()

Output

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    10 non-null     object
 1   Age     10 non-null     int64 
 2   City    10 non-null     object
dtypes: int64(1), object(2)
memory usage: 372.0+ bytes

As you can see, the info() method provides the following information about a Pandas DataFrame:

Class: The class of the object, which indicates that it is a pandas DataFrame
RangeIndex: The index range of the DataFrame, showing the starting and ending index values
Data columns: The total number of columns in the DataFrame
Column names: The names of the columns in the DataFrame
Non-Null Count: The count of non-null values for each column
Dtype: The data types of the columns
Memory usage: The memory usage of the DataFrame in bytes

The provided information enables us to understand about the dataset like its structure, dimension, and missing values. This insight is essential for data exploration, cleaning, manipulation, and analysis.