How to Deal With Missing Data in Polars

How to Deal With Missing Data in Polars
by:
blow post content copied from  Real Python
click here to view original post


Efficiently handling missing data in Polars is essential for keeping your datasets clean during analysis. Polars provides powerful tools to identify, replace, and remove null values, ensuring seamless data processing.

This tutorial covers practical techniques for managing missing data and highlights Polars’ capabilities to enhance your data analysis workflow. By following along, you’ll gain hands-on experience with these techniques and learn how to ensure your datasets are accurate and reliable.

By the end of this tutorial, you’ll understand that:

  • Polars allows you to handle missing data using LazyFrames and DataFrames.
  • You can check for null values in Polars using the .null_count() method.
  • NaN represents non-numeric values while null indicates missing data.
  • You can replace NaN in Polars by converting them to nulls and using .fill_null().
  • You can fix missing data by identifying, replacing, or removing null values.

Before you go any further, you’ll need some data. To begin with, you’ll use the tips.parquet file included in the downloadable materials that you can access by clicking the link below:

The tips.parquet file is a doctored version of data publicly available from Kaggle. The dataset contains information about the tips collected at a fictitious restaurant over several days. Be sure to download it and place it in your project folder before getting started.

The table below shows details about the columns in the tips.parquet file, along with their Polars data types. The text in parentheses beside each data type shows how these types are annotated in a DataFrame heading when Polars displays its results:

Column Name Polars Data Type Description
record_id Int64 (i64) Unique row identifier
total Float64 (f64) Bill total
tip Float64 (f64) Tip given
gender String (str) Diner’s gender
smoker Boolean (bool) Diner’s smoker status
day String (str) Day of meal
time String (str) Time of meal

As a starting point, you’ll investigate each of the columns in your data to find out whether or not they contain any null values. To use Polars, you first need to install the Polars library into your Python environment. To do this from a command prompt you use:

Windows PowerShell
PS> python -m pip install polars
Shell
$ python -m pip install polars

In a Jupyter Notebook, the command becomes:

Python
!python -m pip install polars

Either way, you can then begin to use the Polars library and all of its cool features. Here’s what the data looks like:

Python
>>> import polars as pl

>>> tips = pl.scan_parquet("tips.parquet")

>>> tips.collect()
shape: (180, 7)
┌───────────┬───────┬──────┬────────┬────────┬─────┬────────┐
│ record_id ┆ total ┆ tip  ┆ gender ┆ smoker ┆ day ┆ time   │
│ ---       ┆ ---   ┆ ---  ┆ ---    ┆ ---    ┆ --- ┆ ---    │
│ i64       ┆ f64   ┆ f64  ┆ str    ┆ bool   ┆ str ┆ str    │
╞═══════════╪═══════╪══════╪════════╪════════╪═════╪════════╡
│ 1         ┆ 28.97 ┆ 3.0  ┆ Male   ┆ true   ┆ Fri ┆ Dinner │
│ 2         ┆ 22.49 ┆ 3.5  ┆ Male   ┆ false  ┆ Fri ┆ Dinner │
│ 3         ┆ 5.75  ┆ 1.0  ┆ Female ┆ true   ┆ Fri ┆ null   │
│ 4         ┆ null  ┆ null ┆ Male   ┆ true   ┆ Fri ┆ Dinner │
│ 5         ┆ 22.75 ┆ 3.25 ┆ Female ┆ false  ┆ Fri ┆ Dinner │
│ …         ┆ …     ┆ …    ┆ …      ┆ …      ┆ …   ┆ …      │
│ 176       ┆ 40.55 ┆ 3.0  ┆ Male   ┆ true   ┆ Sun ┆ Dinner │
│ 177       ┆ 20.69 ┆ 5.0  ┆ Male   ┆ false  ┆ Sun ┆ Dinner │
│ 178       ┆ 20.9  ┆ 3.5  ┆ Female ┆ true   ┆ Sun ┆ Dinner │
│ 179       ┆ 30.46 ┆ 2.0  ┆ Male   ┆ true   ┆ Sun ┆ Dinner │
│ 180       ┆ 18.15 ┆ 3.5  ┆ Female ┆ true   ┆ Sun ┆ Dinner │
└───────────┴───────┴──────┴────────┴────────┴─────┴────────┘

First of all, you import the Polars library into your program. It’s considered good practice to import it using the alias pl. You then read the content of the tips.parquet file into Polars. To do this, you use the scan_parquet() function. This reads the file’s data into a Polars LazyFrame.

Unlike traditional DataFrames that store data, LazyFrames contain only a set of instructions—called a query plan—that defines how the data should be processed. To see the actual data, you still need to read it into a Polars DataFrame. This is called materializing the LazyFrame and is achieved using the .collect() method.

Before a LazyFrame materializes its data, its query plan is optimized. For example, Polars can choose to only read some data from the data source if those are enough to fulfill the query. Also, when you define a LazyFrame containing multiple instructions, there are no delays while you create it because you don’t need to wait for earlier data reads to complete before adding new instructions. This makes LazyFrames the preferred approach in Polars.

The result has a shape of 180 rows and 7 columns. This is shown in the output as the shape of the LazyFrame.

Next, you need to figure out if there’s any missing data you need to deal with:

Python
>>> (
...     tips
...     .null_count()
... ).collect()
shape: (1, 7)
┌───────────┬───────┬─────┬────────┬────────┬─────┬──────┐
│ record_id ┆ total ┆ tip ┆ gender ┆ smoker ┆ day ┆ time │
│ ---       ┆ ---   ┆ --- ┆ ---    ┆ ---    ┆ --- ┆ ---  │
│ u32       ┆ u32   ┆ u32 ┆ u32    ┆ u32    ┆ u32 ┆ u32  │
╞═══════════╪═══════╪═════╪════════╪════════╪═════╪══════╡
│ 0         ┆ 2     ┆ 4   ┆ 0      ┆ 0      ┆ 0   ┆ 2    │
└───────────┴───────┴─────┴────────┴────────┴─────┴──────┘

To check for the presence of nulls, you use .null_count() on your LazyFrame which adds in an instruction to find a count of the nulls in each column of your data. Normally, this would require a read of the entire file. However, because a Parquet file stores a count of nulls for each column in its metadata, obtaining the counts is instantaneous.

To actually trigger the data read, you again use the LazyFrame’s .collect() method. This will implement the optimized version of the plan contained within your LazyFrame to obtain the required data.

Read the full article at https://realpython.com/polars-missing-data/ »


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]


January 22, 2025 at 07:30PM
Click here for more details...

=============================
The original post is available in Real Python by
this post has been published as it is through automation. Automation script brings all the top bloggers post under a single umbrella.
The purpose of this blog, Follow the top Salesforce bloggers and collect all blogs in a single place through automation.
============================

Salesforce