Polars Filtering Tutorial
last modified March 1, 2025
Polars is a fast and efficient DataFrame library for Python. It is designed for high-performance data manipulation and analysis. Filtering is a common operation in data analysis, allowing you to extract specific rows based on conditions.
This tutorial covers how to filter data in Polars DataFrames. We will explore basic and advanced filtering techniques with practical examples.
Basic Filtering
Filtering in Polars is done using the filter
method. This method
takes a condition and returns a new DataFrame with rows that satisfy the
condition.
import polars as pl data = { 'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Age': [25, 30, 35, 40], 'City': ['New York', 'Los Angeles', 'Chicago', 'Houston'] } df = pl.DataFrame(data) filtered_df = df.filter(pl.col('Age') > 30) print(filtered_df)
The filter(pl.col('Age') > 30)
filters rows where the 'Age' column
is greater than 30. This is useful for extracting specific subsets of data.
Multiple Conditions
You can combine multiple conditions using logical operators like &
(AND) and |
(OR). This allows for more complex filtering.
import polars as pl data = { 'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Age': [25, 30, 35, 40], 'City': ['New York', 'Los Angeles', 'Chicago', 'Houston'] } df = pl.DataFrame(data) filtered_df = df.filter((pl.col('Age') > 30) & (pl.col('City') == 'Chicago')) print(filtered_df)
The (pl.col('Age') > 30) & (pl.col('City') == 'Chicago')
filters rows where the 'Age' is greater than 30 and the 'City' is 'Chicago'.
This is useful for precise data extraction.
Filtering with String Operations
Polars supports string operations for filtering. You can use methods like
str.contains
to filter rows based on string patterns.
import polars as pl data = { 'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Age': [25, 30, 35, 40], 'City': ['New York', 'Los Angeles', 'Chicago', 'Houston'] } df = pl.DataFrame(data) filtered_df = df.filter(pl.col('City').str.contains('New')) print(filtered_df)
The str.contains('New')
filters rows where the 'City' column
contains the substring 'New'. This is useful for pattern-based filtering.
Filtering with Null Values
Polars provides methods to handle null values during filtering. You can use
is_null
and is_not_null
to filter rows with null
or non-null values.
import polars as pl data = { 'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Age': [25, None, 35, 40], 'City': ['New York', 'Los Angeles', None, 'Houston'] } df = pl.DataFrame(data) filtered_df = df.filter(pl.col('Age').is_not_null()) print(filtered_df)
The is_not_null
filters rows where the 'Age' column is not null.
This is useful for cleaning data with missing values.
Filtering with Custom Functions
You can apply custom functions to filter rows. This allows for flexible and complex filtering logic.
import polars as pl data = { 'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Age': [25, 30, 35, 40], 'City': ['New York', 'Los Angeles', 'Chicago', 'Houston'] } df = pl.DataFrame(data) def custom_filter(age): return age > 30 and age < 40 filtered_df = df.filter(pl.col('Age').apply(custom_filter)) print(filtered_df)
The apply(custom_filter)
applies a custom function to filter rows
where the 'Age' is between 30 and 40. This is useful for advanced filtering.
Best Practices for Filtering
- Understand Data: Analyze data structure before applying filters.
- Use Efficient Conditions: Optimize conditions for performance.
- Handle Null Values: Use
is_null
andis_not_null
for null handling. - Validate Results: Check filtered data for accuracy and consistency.
Source
In this article, we have explored how to filter data in Polars DataFrames.
Author
List all Polars tutorials.