Jul 9, 2023

Pandas Select

Pandas select refers to the process of extracting specific portions of data from a DataFrame.

Data selection involves choosing specific rows and columns based on labels, positions, or conditions.

Pandas provides various methods, such as basic indexing, slicing, boolean indexing, and querying, to efficiently extract, filter, and transform data, enabling users to focus on relevant information for analysis and decision-making.

Select Data Using Indexing and Slicing

In Pandas, we can use square brackets and their labels or positions to select the data we want.

Let’s look at an example.

import pandas as pd

# create a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 30, 22, 27, 29],
    'Salary': [50000, 60000, 45000, 55000, 52000]
}

df = pd.DataFrame(data)

# selecting a single column
name_column = df['Name']

print("Selecting single column: Name")
print(name_column)
print()

# selecting multiple columns
age_salary_columns = df[['Age', 'Salary']]

print("Selecting multiple columns: Age and Salary")
print(age_salary_columns.to_string(index=False))
print()

# selecting rows using slicing
selected_rows = df[1:4]

print("Selecting rows 1 to 3")
print(selected_rows.to_string(index=False))
print()

Output

Selecting single column: Name
0      Alice
1        Bob
2    Charlie
3      David
4        Eve
Name: Name, dtype: object

Selecting multiple columns: Age and Salary
Age  Salary
25   50000
30   60000
22   45000
27   55000
29   52000

Selecting rows 1 to 3
Name      Age  Salary
Bob        30    60000
Charlie    22    45000
David      27    55000

In the above example, we have created a DataFrame named df that is using dictionary data containing three columns: Name, Age, and Salary. Each column is represented by a list of values.

Then we,

selected a single column Name using df['Name']
selected multiple columns Age and Salary using df[['Age', 'Salary']]
selected rows from 1 to 3 using slicing df[1:4]

Note: The .to_string(index=False) is used to display values without the index.

Using loc and iloc to Select Data

The loc and iloc methods in Pandas are used to access data by using label or integer index.

loc selects rows and columns with specific labels
iloc selects rows and columns at specific index

Let’s take a look at an example.

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
    'Age': [25, 30, 22, 27, 29],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'San Francisco']
}

df = pd.DataFrame(data)
print(f"Original DataFrame \n {df} \n") 

# loc to select rows and columns by labels
# select rows 1 to 3 and columns Name and Age
selected_data_loc = df.loc[1:3, ['Name', 'Age']]

print(selected_data_loc.to_string(index = False))
print() 

# iloc to select rows and columns by index
# select rows 1 to 3 and columns 0 and 2 
selected_data_iloc = df.iloc[1:4, [0, 2]]

print(selected_data_iloc.to_string(index = False))

Output

Original DataFrame 
       Name  Age           City
0    Alice   25       New York
1      Bob   30    Los Angeles
2  Charlie   22        Chicago
3    David   27        Houston
4    Emily   29  San Francisco 

Name     Age
Bob       30
Charlie   22
David     27

Name        City
Bob      Los Angeles
Charlie  Chicago
David    Houston

Here,

Using df.loc[1:3, ['Name', 'Age']] - selects rows 1 to 3 and columns Name and Age from df
Using df.iloc[1:4, [0, 2]] - selects rows 1 to 3 and columns at index positions 0 and 2 from df

Select Rows Based on Specific Criteria

In Pandas, we can use boolean conditions to filter rows based on specific criteria. For example,

import pandas as pd

# creating a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
    'Age': [25, 30, 22, 28, 24],
    'Gender': ['Female', 'Male', 'Male', 'Male', 'Female']
}

df = pd.DataFrame(data)

# select rows where Age is greater than 25
selected_rows = df[df['Age'] > 25]

print(selected_rows)

Output

     Name  Age Gender
1    Bob   30   Male
3  David   28   Male

In this example, we have selected the rows where the age is greater than 25.

The boolean indexing is done using the condition

df['Age'] > 25

This creates a boolean mask. And when this mask is applied to the DataFrame, it selects only the rows where the condition is True.

query() to Select Data

The query() method in Pandas allows you to select data using a more SQL-like syntax.

Let’s take a look at an example.

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [25, 30, 22, 28, 35],
    'Score': [85, 90, 75, 80, 95]
}

df = pd.DataFrame(data)

# select the rows where the age is greater than 25
selected_rows = df.query('Age > 25')

print(selected_rows.to_string(index = False))

Output

Name  Age  Score
Bob    30     90
David  28     80
Eva    35     95

In this example, the query Age > 25 selects the rows where the Age column’s values are greater than 25.

Select Rows Based on a List of Values

Pandas provides us with the method named isin() to filter rows based on a list of values. For example,

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
    'Age': [25, 30, 22, 28, 24]
}

df = pd.DataFrame(data)

# create a list of names to select
names_to_filter = ['Bob', 'David']

# use isin() to select rows based on the 'Name' column
selected_rows = df[df['Name'].isin(names_to_filter)]

print(selected_rows.to_string(index = False))

Output

Name  Age
Bob   30
David 28

In this example, we want to select only the rows where the name is either Bob or David.

We created a list names_to_filter with the names we want to filter by and then used the isin() method to filter the rows based on the values in the Name column.