pandas GroupBy: Your Guide to Grouping Data in Python :

pandas GroupBy: Your Guide to Grouping Data in Python
by:
blow post content copied from  Real Python
click here to view original post


The pandas .groupby() method allows you to efficiently analyze and transform datasets when working with data in Python. With df.groupby(), you can split a DataFrame into groups based on column values, apply functions to each group, and combine the results into a new DataFrame. This technique is essential for tasks like aggregation, filtering, and transformation on grouped data.

By the end of this tutorial, you’ll understand that:

  • Calling .groupby("column_name") splits a DataFrame into groups, applies a function to each group, and combines the results.
  • To group by multiple columns, you can pass a list of column names to .groupby().
  • Common aggregation methods in pandas include .sum(), .mean(), and .count().
  • You can use custom functions with pandas .groupby() to perform specific operations on groups.

This tutorial assumes that you have some experience with pandas itself, including how to read CSV files into memory as pandas objects with read_csv(). If you need a refresher, then check out Reading CSVs With pandas and pandas: How to Read and Write Files.

You can download the source code for all the examples in this tutorial by clicking on the link below:

Prerequisites

Before you proceed, make sure that you have the latest version of pandas available within a new virtual environment:

Windows PowerShell
PS> python -m venv venv
PS> venv\Scripts\activate
(venv) PS> python -m pip install pandas
Shell
$ python3 -m venv venv
$ source venv/bin/activate
(venv) $ python -m pip install pandas

In this tutorial, you’ll focus on three datasets:

  1. The U.S. Congress dataset contains public information on historical members of Congress and illustrates several fundamental capabilities of .groupby().
  2. The air quality dataset contains periodic gas sensor readings. This will allow you to work with floats and time series data.
  3. The news aggregator dataset holds metadata on several hundred thousand news articles. You’ll be working with strings and doing text munging with .groupby().

You can download the source code for all the examples in this tutorial by clicking on the link below:

Once you’ve downloaded the .zip file, unzip the file to a folder called groupby-data/ in your current directory. Before you read on, ensure that your directory tree looks like this:

./
│
└── groupby-data/
    │
    ├── legislators-historical.csv
    ├── airqual.csv
    └── news.csv

With pandas installed, your virtual environment activated, and the datasets downloaded, you’re ready to jump in!

Example 1: U.S. Congress Dataset

You’ll jump right into things by dissecting a dataset of historical members of Congress. You can read the CSV file into a pandas DataFrame with read_csv():

Python pandas_legislators.py
import pandas as pd

dtypes = {
    "first_name": "category",
    "gender": "category",
    "type": "category",
    "state": "category",
    "party": "category",
}
df = pd.read_csv(
    "groupby-data/legislators-historical.csv",
    dtype=dtypes,
    usecols=list(dtypes) + ["birthday", "last_name"],
    parse_dates=["birthday"]
)

The dataset contains members’ first and last names, birthday, gender, type ("rep" for House of Representatives or "sen" for Senate), U.S. state, and political party. You can use df.tail() to view the last few rows of the dataset:

Python
>>> from pandas_legislators import df
>>> df.tail()
      last_name first_name   birthday gender type state       party
11970   Garrett     Thomas 1972-03-27      M  rep    VA  Republican
11971    Handel      Karen 1962-04-18      F  rep    GA  Republican
11972     Jones     Brenda 1959-10-24      F  rep    MI    Democrat
11973    Marino        Tom 1952-08-15      M  rep    PA  Republican
11974     Jones     Walter 1943-02-10      M  rep    NC  Republican

The DataFrame uses categorical dtypes for space efficiency:

Python
>>> df.dtypes
last_name             object
first_name          category
birthday      datetime64[ns]
gender              category
type                category
state               category
party               category
dtype: object

You can see that most columns of the dataset have the type category, which reduces the memory load on your machine.

Read the full article at https://realpython.com/pandas-groupby/ »


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]


January 19, 2025 at 07:30PM
Click here for more details...

=============================
The original post is available in Real Python by
this post has been published as it is through automation. Automation script brings all the top bloggers post under a single umbrella.
The purpose of this blog, Follow the top Salesforce bloggers and collect all blogs in a single place through automation.
============================

Salesforce