How to Deal With Missing Data in Polars
by:
blow post content copied from Real Python
click here to view original post
Efficiently handling missing data in Polars is essential for keeping your datasets clean during analysis. Polars provides powerful tools to identify, replace, and remove null values, ensuring seamless data processing.
This tutorial covers practical techniques for managing missing data and highlights Polars’ capabilities to enhance your data analysis workflow. By following along, you’ll gain hands-on experience with these techniques and learn how to ensure your datasets are accurate and reliable.
By the end of this tutorial, you’ll understand that:
- Polars allows you to handle missing data using LazyFrames and DataFrames.
- You can check for null values in Polars using the
.null_count()
method. - NaN represents non-numeric values while null indicates missing data.
- You can replace NaN in Polars by converting them to nulls and using
.fill_null()
. - You can fix missing data by identifying, replacing, or removing null values.
Before you go any further, you’ll need some data. To begin with, you’ll use the tips.parquet
file included in the downloadable materials that you can access by clicking the link below:
Get Your Code: Click here to download the free sample code that shows you how to deal with missing data in Polars.
The tips.parquet
file is a doctored version of data publicly available from Kaggle. The dataset contains information about the tips collected at a fictitious restaurant over several days. Be sure to download it and place it in your project folder before getting started.
Note: The Parquet format is a format for storing large volumes of data. Disk size is minimized because of the compression algorithm it uses.
In addition, Parquet uses a columnar format and maintains metadata about each column’s content. This means columns can be searched very efficiently, often in parallel, without the need to search through the entire file.
The table below shows details about the columns in the tips.parquet
file, along with their Polars data types. The text in parentheses beside each data type shows how these types are annotated in a DataFrame heading when Polars displays its results:
Column Name | Polars Data Type | Description |
---|---|---|
record_id |
Int64 (i64) |
Unique row identifier |
total |
Float64 (f64) |
Bill total |
tip |
Float64 (f64) |
Tip given |
gender |
String (str) |
Diner’s gender |
smoker |
Boolean (bool) |
Diner’s smoker status |
day |
String (str) |
Day of meal |
time |
String (str) |
Time of meal |
As a starting point, you’ll investigate each of the columns in your data to find out whether or not they contain any null values. To use Polars, you first need to install the Polars library into your Python environment. To do this from a command prompt you use:
In a Jupyter Notebook, the command becomes:
!python -m pip install polars
Either way, you can then begin to use the Polars library and all of its cool features. Here’s what the data looks like:
>>> import polars as pl
>>> tips = pl.scan_parquet("tips.parquet")
>>> tips.collect()
shape: (180, 7)
┌───────────┬───────┬──────┬────────┬────────┬─────┬────────┐
│ record_id ┆ total ┆ tip ┆ gender ┆ smoker ┆ day ┆ time │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ str ┆ bool ┆ str ┆ str │
╞═══════════╪═══════╪══════╪════════╪════════╪═════╪════════╡
│ 1 ┆ 28.97 ┆ 3.0 ┆ Male ┆ true ┆ Fri ┆ Dinner │
│ 2 ┆ 22.49 ┆ 3.5 ┆ Male ┆ false ┆ Fri ┆ Dinner │
│ 3 ┆ 5.75 ┆ 1.0 ┆ Female ┆ true ┆ Fri ┆ null │
│ 4 ┆ null ┆ null ┆ Male ┆ true ┆ Fri ┆ Dinner │
│ 5 ┆ 22.75 ┆ 3.25 ┆ Female ┆ false ┆ Fri ┆ Dinner │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ 176 ┆ 40.55 ┆ 3.0 ┆ Male ┆ true ┆ Sun ┆ Dinner │
│ 177 ┆ 20.69 ┆ 5.0 ┆ Male ┆ false ┆ Sun ┆ Dinner │
│ 178 ┆ 20.9 ┆ 3.5 ┆ Female ┆ true ┆ Sun ┆ Dinner │
│ 179 ┆ 30.46 ┆ 2.0 ┆ Male ┆ true ┆ Sun ┆ Dinner │
│ 180 ┆ 18.15 ┆ 3.5 ┆ Female ┆ true ┆ Sun ┆ Dinner │
└───────────┴───────┴──────┴────────┴────────┴─────┴────────┘
First of all, you import the Polars library into your program. It’s considered good practice to import it using the alias pl
. You then read the content of the tips.parquet
file into Polars. To do this, you use the scan_parquet()
function. This reads the file’s data into a Polars LazyFrame.
Unlike traditional DataFrames that store data, LazyFrames contain only a set of instructions—called a query plan—that defines how the data should be processed. To see the actual data, you still need to read it into a Polars DataFrame. This is called materializing the LazyFrame and is achieved using the .collect()
method.
Before a LazyFrame materializes its data, its query plan is optimized. For example, Polars can choose to only read some data from the data source if those are enough to fulfill the query. Also, when you define a LazyFrame containing multiple instructions, there are no delays while you create it because you don’t need to wait for earlier data reads to complete before adding new instructions. This makes LazyFrames the preferred approach in Polars.
The result has a shape of 180 rows and 7 columns. This is shown in the output as the shape of the LazyFrame.
Next, you need to figure out if there’s any missing data you need to deal with:
>>> (
... tips
... .null_count()
... ).collect()
shape: (1, 7)
┌───────────┬───────┬─────┬────────┬────────┬─────┬──────┐
│ record_id ┆ total ┆ tip ┆ gender ┆ smoker ┆ day ┆ time │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ u32 ┆ u32 ┆ u32 ┆ u32 ┆ u32 ┆ u32 │
╞═══════════╪═══════╪═════╪════════╪════════╪═════╪══════╡
│ 0 ┆ 2 ┆ 4 ┆ 0 ┆ 0 ┆ 0 ┆ 2 │
└───────────┴───────┴─────┴────────┴────────┴─────┴──────┘
To check for the presence of nulls, you use .null_count()
on your LazyFrame which adds in an instruction to find a count of the nulls in each column of your data. Normally, this would require a read of the entire file. However, because a Parquet file stores a count of nulls for each column in its metadata, obtaining the counts is instantaneous.
To actually trigger the data read, you again use the LazyFrame’s .collect()
method. This will implement the optimized version of the plan contained within your LazyFrame to obtain the required data.
Read the full article at https://realpython.com/polars-missing-data/ »
[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]
January 22, 2025 at 07:30PM
Click here for more details...
=============================
The original post is available in Real Python by
this post has been published as it is through automation. Automation script brings all the top bloggers post under a single umbrella.
The purpose of this blog, Follow the top Salesforce bloggers and collect all blogs in a single place through automation.
============================
Post a Comment