Grouping and Aggregating

Grouping and Aggregating#

In this chapter, we’ll learn how to summarise data by groups — one of the most powerful techniques in data analysis. When you have questions like “What’s the average salary by department?” or “How many sales did each region make?”, grouping and aggregating is the answer.

The Split-Apply-Combine Pattern#

The fundamental concept behind grouping operations is called split-apply-combine:

Split: Divide the data into groups based on some criteria
Apply: Apply a function to each group independently (like calculating the mean)
Combine: Combine the results back into a single data structure

This pattern is incredibly powerful and forms the basis of most data summarisation tasks.

Split-Apply-Combine diagram — Fig. 1 The split-apply-combine pattern: data is split into groups, a function is applied to each group, and the results are combined.#

Let’s start with a practical example. First, we’ll create a sample employee dataset:

import pandas as pd

# Create employee data
employees = pd.DataFrame({
    'employee_id': range(1, 13),
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank',
             'Grace', 'Henry', 'Ivy', 'Jack', 'Kate', 'Leo'],
    'department': ['Sales', 'IT', 'Sales', 'HR', 'IT', 'Sales',
                   'HR', 'IT', 'Sales', 'HR', 'IT', 'Sales'],
    'salary': [45000, 55000, 48000, 42000, 62000, 51000,
               44000, 58000, 47000, 46000, 60000, 52000],
    'years_experience': [2, 5, 3, 1, 8, 4, 2, 6, 3, 4, 7, 5],
    'performance_score': ['Good', 'Excellent', 'Average', 'Good', 'Excellent', 'Good',
                          'Average', 'Good', 'Good', 'Excellent', 'Good', 'Average']
})

employees

    employee_id     name department  salary  years_experience performance_score
           1    Alice      Sales   45000                 2              Good
           2      Bob         IT   55000                 5         Excellent
           3  Charlie      Sales   48000                 3           Average
           4    Diana         HR   42000                 1              Good
           5      Eve         IT   62000                 8         Excellent
           6    Frank      Sales   51000                 4              Good
           7    Grace         HR   44000                 2           Average
           8    Henry         IT   58000                 6              Good
           9      Ivy      Sales   47000                 3              Good
          10     Jack         HR   46000                 4         Excellent
         11     Kate         IT   60000                 7              Good
         12      Leo      Sales   52000                 5           Average

Basic Grouping with `groupby()`#

The groupby() method is your gateway to grouped operations. Let’s group our employees by department:

grouped = employees.groupby('Department')
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x...>

Notice that groupby() doesn’t immediately compute anything — it returns a GroupBy object. Think of it as a “waiting room” where the data is organised into groups, ready for you to tell it what calculation to perform.

Applying an Aggregation Function#

Now let’s calculate the average salary for each department:

employees.groupby('department')['salary'].mean()

department
HR        44000.0
IT        58750.0
Sales     48600.0
Name: salary, dtype: float64

Let’s break down what happened:

employees.groupby('department') — Split the data by department
['salary'] — Select just the salary column from each group
.mean() — Calculate the mean of each group’s salaries

The result is a Series where the index contains the department names and the values are the average salaries.

Tip

The result of a groupby operation is often a Series with the grouping column as the index. Use .reset_index() to convert it back to a regular DataFrame.

Converting the Result to a DataFrame#

If you want a proper DataFrame with columns instead of a Series with an index:

avg_salary = employees.groupby('department')['salary'].mean()
avg_salary.reset_index()

  department   salary
       HR  44000.0
       IT  58750.0
    Sales  48600.0

Now we have a DataFrame with ‘department’ as a regular column. This is often more useful for further analysis or merging with other data.

Common Aggregation Functions#

Pandas provides many built-in aggregation functions you can use with groups:

Function	Description
`.mean()`	Average of the group
`.sum()`	Sum of all values
`.count()`	Number of values (excludes NaN)
`.size()`	Number of rows (includes NaN)
`.min()`	Minimum value
`.max()`	Maximum value
`.std()`	Standard deviation
`.var()`	Variance
`.median()`	Median value
`.first()`	First value in the group
`.last()`	Last value in the group

Examples#

# Total salary per department
employees.groupby('department')['salary'].sum()

department
HR        132000
IT        235000
Sales     243000
Name: salary, dtype: int64

# Number of employees per department
employees.groupby('department')['employee_id'].count()

department
HR       3
IT       4
Sales    5
Name: employee_id, dtype: int64

# Maximum years of experience per department
employees.groupby('department')['years_experience'].max()

department
HR       4
IT       8
Sales    5
Name: years_experience, dtype: int64

Grouping by Multiple Columns#

You can group by more than one column to get more granular summaries. Simply pass a list of column names to groupby():

# Average salary by department AND performance score
avg_by_dept_perf = employees.groupby(['department', 'performance_score'])['salary'].mean()
avg_by_dept_perf

department  performance_score
HR          Average              44000.0
            Excellent            46000.0
            Good                 42000.0
IT          Excellent            58500.0
            Good                 59000.0
Sales       Average              50000.0
            Good                 47666.666667
Name: salary, dtype: float64

Notice the result has a MultiIndex — a hierarchical index with two levels. This can be useful, but it’s often easier to work with a flat DataFrame.

Using `reset_index()` with Multiple Groups#

avg_by_dept_perf.reset_index()

  department performance_score        salary
       HR           Average  44000.000000
       HR         Excellent  46000.000000
       HR              Good  42000.000000
       IT         Excellent  58500.000000
       IT              Good  59000.000000
    Sales           Average  50000.000000
    Sales              Good  47666.666667

Using `unstack()` for Pivot-Table Style#

The unstack() method pivots one level of the index to columns, creating a pivot table-like structure:

avg_by_dept_perf.unstack()

performance_score  Average  Excellent          Good
department
HR                 44000.0    46000.0  42000.000000
IT                     NaN    58500.0  59000.000000
Sales              50000.0        NaN  47666.666667

This makes it easy to compare values across one dimension. Notice that NaN appears where there are no employees matching that combination (e.g., no IT employees with ‘Average’ performance in our data).

Note

The difference between reset_index() and unstack():

reset_index() gives you a “long” format with each combination as a row
unstack() gives you a “wide” format with one grouping variable as columns

Multiple Aggregations with `agg()`#

Often you want to calculate several statistics at once. The agg() method (short for aggregate) lets you do this:

Single Column, Multiple Functions#

# Multiple statistics for salary by department
employees.groupby('department')['salary'].agg(['mean', 'min', 'max', 'count'])

                mean    min    max  count
department
HR          44000.0  42000  46000      3
IT          58750.0  55000  62000      4
Sales       48600.0  45000  52000      5

Multiple Columns, Multiple Functions#

# Different aggregations for different columns
employees.groupby('department').agg({
    'salary': ['mean', 'sum'],
    'years_experience': ['mean', 'max'],
    'employee_id': 'count'
})

             salary              years_experience      employee_id
               mean     sum                 mean max        count
department
HR          44000.0  132000             2.333333   4            3
IT          58750.0  235000             6.500000   8            4
Sales       48600.0  243000             3.400000   5            5

The result has a MultiIndex for columns. You can flatten it:

summary = employees.groupby('department').agg({
    'salary': ['mean', 'sum'],
    'years_experience': ['mean', 'max'],
    'employee_id': 'count'
})

# Flatten the column names
summary.columns = ['_'.join(col) for col in summary.columns]
summary.reset_index()

  department  salary_mean  salary_sum  years_experience_mean  years_experience_max  employee_id_count
       HR      44000.0      132000               2.333333                     4                  3
       IT      58750.0      235000               6.500000                     8                  4
    Sales      48600.0      243000               3.400000                     5                  5

Named Aggregations (Cleaner Syntax)#

Pandas also supports a cleaner syntax for naming your aggregated columns:

employees.groupby('department').agg(
    avg_salary=('salary', 'mean'),
    total_salary=('salary', 'sum'),
    num_employees=('employee_id', 'count'),
    max_experience=('years_experience', 'max')
).reset_index()

  department  avg_salary  total_salary  num_employees  max_experience
       HR     44000.0        132000              3               4
       IT     58750.0        235000              4               8
    Sales     48600.0        243000              5               5

This syntax is often preferred because it produces clean, single-level column names directly.

Applying Custom Functions with `apply()`#

Sometimes the built-in aggregation functions aren’t enough. The apply() method lets you use custom functions.

Creating Categories from Continuous Data#

Let’s categorise employees by their years of experience:

def categorise_experience(years):
    if years < 3:
        return 'Junior'
    elif years < 6:
        return 'Mid-Level'
    else:
        return 'Senior'

# Apply the function to create a new column
employees['experience_level'] = employees['years_experience'].apply(categorise_experience)
employees[['name', 'years_experience', 'experience_level']]

       name  years_experience experience_level
   Alice                 2           Junior
     Bob                 5        Mid-Level
 Charlie                 3        Mid-Level
   Diana                 1           Junior
     Eve                 8           Senior
   Frank                 4        Mid-Level
   Grace                 2           Junior
   Henry                 6           Senior
     Ivy                 3        Mid-Level
    Jack                 4        Mid-Level
   Kate                 7           Senior
    Leo                 5        Mid-Level

Applying Functions to Groups#

You can also apply custom functions to entire groups:

# Custom function to get the salary range within each department
def salary_range(group):
    return group['salary'].max() - group['salary'].min()

employees.groupby('department').apply(salary_range)

department
HR        4000
IT        7000
Sales     7000
dtype: int64

Lambda Functions for Quick Operations#

For simple operations, lambda functions provide a compact syntax:

# Calculate the proportion of total salary that each department represents
total_salary = employees['salary'].sum()
employees.groupby('department')['salary'].sum().apply(lambda x: x / total_salary * 100)

department
HR        21.639344
IT        38.524590
Sales     39.836066
Name: salary, dtype: float64

Practical Example: Sales Analysis#

Let’s work through a more realistic example with sales data:

# Create sales data
sales = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=20, freq='D'),
    'region': ['North', 'South', 'East', 'West'] * 5,
    'product': ['Widget', 'Gadget', 'Widget', 'Gizmo', 'Gadget',
                'Widget', 'Gizmo', 'Gadget', 'Widget', 'Gizmo',
                'Gadget', 'Widget', 'Gizmo', 'Gadget', 'Widget',
                'Gizmo', 'Widget', 'Gadget', 'Widget', 'Gizmo'],
    'quantity': [10, 15, 8, 20, 12, 18, 25, 14, 9, 22,
                 16, 11, 19, 13, 7, 24, 10, 17, 12, 21],
    'revenue': [100, 225, 80, 400, 180, 180, 500, 210, 90, 440,
                240, 110, 380, 195, 70, 480, 100, 255, 120, 420]
})

sales.head(10)

        date region product  quantity  revenue
2024-01-01  North  Widget        10      100
2024-01-02  South  Gadget        15      225
2024-01-03   East  Widget         8       80
2024-01-04   West   Gizmo        20      400
2024-01-05  North  Gadget        12      180
2024-01-06  South  Widget        18      180
2024-01-07   East   Gizmo        25      500
2024-01-08   West  Gadget        14      210
2024-01-09  North  Widget         9       90
2024-01-10  South   Gizmo        22      440

Business Question 1: Total Revenue by Region#

revenue_by_region = sales.groupby('region')['revenue'].sum().reset_index()
revenue_by_region.columns = ['region', 'total_revenue']
revenue_by_region.sort_values('total_revenue', ascending=False)

  region  total_revenue
 East            960
South           1100
 West           1030
North            470

Business Question 2: Best-Selling Products by Quantity#

sales.groupby('product')['quantity'].sum().sort_values(ascending=False)

product
Gizmo     132
Widget     67
Gadget     87
Name: quantity, dtype: int64

Business Question 3: Average Revenue per Sale by Product and Region#

sales.groupby(['product', 'region'])['revenue'].mean().unstack().round(2)

region    East   North  South   West
product
Gadget   240.0   180.0  225.0  210.0
Gizmo    500.0     NaN  440.0  400.0
Widget    80.0    95.0  180.0  120.0

Business Question 4: Comprehensive Summary#

summary = sales.groupby('region').agg(
    total_sales=('revenue', 'sum'),
    avg_sale=('revenue', 'mean'),
    num_transactions=('revenue', 'count'),
    total_quantity=('quantity', 'sum')
).reset_index()

summary['avg_price_per_unit'] = summary['total_sales'] / summary['total_quantity']
summary.round(2)

  region  total_sales  avg_sale  num_transactions  total_quantity  avg_price_per_unit
 East          960     192.0                 5              57               16.84
North          470      94.0                 5              48                9.79
South         1100     220.0                 5              75               14.67
 West         1030     206.0                 5             112                9.20

Transform: Returning Data in Original Shape#

Sometimes you want to perform a group calculation but keep the result aligned with the original data. The transform() method does exactly this.

Adding Group Statistics to Each Row#

# Add department average salary to each employee row
employees['dept_avg_salary'] = employees.groupby('department')['salary'].transform('mean')
employees[['name', 'department', 'salary', 'dept_avg_salary']]

       name department  salary  dept_avg_salary
   Alice      Sales   45000          48600.0
     Bob         IT   55000          58750.0
 Charlie      Sales   48000          48600.0
   Diana         HR   42000          44000.0
     Eve         IT   62000          58750.0
   Frank      Sales   51000          48600.0
   Grace         HR   44000          44000.0
   Henry         IT   58000          58750.0
     Ivy      Sales   47000          48600.0
    Jack         HR   46000          44000.0
   Kate         IT   60000          58750.0
    Leo      Sales   52000          48600.0

Calculating Difference from Group Mean#

# How much does each employee earn above/below their department average?
employees['salary_vs_dept_avg'] = employees['salary'] - employees['dept_avg_salary']
employees[['name', 'department', 'salary', 'dept_avg_salary', 'salary_vs_dept_avg']]

       name department  salary  dept_avg_salary  salary_vs_dept_avg
   Alice      Sales   45000          48600.0             -3600.0
     Bob         IT   55000          58750.0             -3750.0
 Charlie      Sales   48000          48600.0              -600.0
   Diana         HR   42000          44000.0             -2000.0
     Eve         IT   62000          58750.0              3250.0
   Frank      Sales   51000          48600.0              2400.0
   Grace         HR   44000          44000.0                 0.0
   Henry         IT   58000          58750.0              -750.0
     Ivy      Sales   47000          48600.0             -1600.0
    Jack         HR   46000          44000.0              2000.0
   Kate         IT   60000          58750.0              1250.0
    Leo      Sales   52000          48600.0              3400.0

This is incredibly useful for normalisation or comparing individual values to group benchmarks.

Working with Real Data: String Splitting#

Often real-world data needs cleaning before grouping. A common task is splitting columns:

# Data with location as "City, Country"
locations = pd.DataFrame({
    'employee': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'location': ['London, England', 'Glasgow, Scotland', 'Belfast, Northern Ireland', 'Cardiff, Wales'],
    'salary': [50000, 48000, 52000, 47000]
})

locations

  employee                   location  salary
  Alice            London, England   50000
    Bob          Glasgow, Scotland   48000
Charlie  Belfast, Northern Ireland   52000
  Diana             Cardiff, Wales   47000

Splitting into Separate Columns#

# Split location into city and country
locations[['city', 'country']] = locations['location'].str.split(', ', expand=True)
locations

  employee                   location  salary      city          country
  Alice            London, England   50000    London          England
    Bob          Glasgow, Scotland   48000   Glasgow         Scotland
Charlie  Belfast, Northern Ireland   52000   Belfast  Northern Ireland
  Diana             Cardiff, Wales   47000   Cardiff            Wales

Now we can group by country:

locations.groupby('country')['salary'].mean()

country
England             50000.0
Northern Ireland    52000.0
Scotland            48000.0
Wales               47000.0
Name: salary, dtype: float64

Summary#

Here’s a quick reference for grouping and aggregating:

Task	Syntax
Group by one column	`df.groupby('col')`
Group by multiple columns	`df.groupby(['col1', 'col2'])`
Single aggregation	`df.groupby('col')['value'].mean()`
Multiple aggregations	`df.groupby('col')['value'].agg(['mean', 'sum'])`
Named aggregations	`df.groupby('col').agg(new_name=('col', 'func'))`
Different aggregations per column	`df.groupby('col').agg({'col1': 'mean', 'col2': 'sum'})`
Reset the index	`grouped.reset_index()`
Pivot to columns	`grouped.unstack()`
Transform (keep original shape)	`df.groupby('col')['value'].transform('mean')`
Apply custom function	`df.groupby('col').apply(custom_func)`

Exercises#

Exercise 24

Exercise 1: Basic Grouping

Using the employees DataFrame:

employees = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace', 'Henry'],
    'department': ['Sales', 'IT', 'Sales', 'HR', 'IT', 'Sales', 'HR', 'IT'],
    'salary': [45000, 55000, 48000, 42000, 62000, 51000, 44000, 58000],
    'years_experience': [3, 7, 4, 2, 9, 5, 3, 6],
    'bonus': [2000, 3000, 2500, 1500, 4000, 2800, 1800, 3200]
})

Calculate the average salary for each department
Find the total bonus paid to each department
Count how many employees are in each department

Solution to Exercise 24

import pandas as pd

employees = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace', 'Henry'],
    'department': ['Sales', 'IT', 'Sales', 'HR', 'IT', 'Sales', 'HR', 'IT'],
    'salary': [45000, 55000, 48000, 42000, 62000, 51000, 44000, 58000],
    'years_experience': [3, 7, 4, 2, 9, 5, 3, 6],
    'bonus': [2000, 3000, 2500, 1500, 4000, 2800, 1800, 3200]
})

# 1. Average salary per department
print("Average salary by department:")
print(employees.groupby('department')['salary'].mean())

# 2. Total bonus per department
print("\nTotal bonus by department:")
print(employees.groupby('department')['bonus'].sum())

# 3. Employee count per department
print("\nEmployee count by department:")
print(employees.groupby('department')['name'].count())

Exercise 25

Exercise 2: Multiple Aggregations

Using the same employees DataFrame:

Create a summary table showing for each department:
- Average salary
- Maximum salary
- Total years of experience
- Number of employees
Use the named aggregation syntax to give meaningful column names

Solution to Exercise 25

import pandas as pd

employees = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace', 'Henry'],
    'department': ['Sales', 'IT', 'Sales', 'HR', 'IT', 'Sales', 'HR', 'IT'],
    'salary': [45000, 55000, 48000, 42000, 62000, 51000, 44000, 58000],
    'years_experience': [3, 7, 4, 2, 9, 5, 3, 6],
    'bonus': [2000, 3000, 2500, 1500, 4000, 2800, 1800, 3200]
})

# Using named aggregations
summary = employees.groupby('department').agg(
    avg_salary=('salary', 'mean'),
    max_salary=('salary', 'max'),
    total_experience=('years_experience', 'sum'),
    num_employees=('name', 'count')
).reset_index()

print(summary)

Exercise 26

Exercise 3: Using Custom Functions

Create a function that categorises salaries into bands:

‘Low’: salary < 45000
‘Medium’: 45000 <= salary < 55000
‘High’: salary >= 55000

Apply this function to create a new ‘salary_band’ column
Count how many employees fall into each salary band
Calculate the average years of experience for each salary band

Solution to Exercise 26

import pandas as pd

employees = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace', 'Henry'],
    'department': ['Sales', 'IT', 'Sales', 'HR', 'IT', 'Sales', 'HR', 'IT'],
    'salary': [45000, 55000, 48000, 42000, 62000, 51000, 44000, 58000],
    'years_experience': [3, 7, 4, 2, 9, 5, 3, 6],
    'bonus': [2000, 3000, 2500, 1500, 4000, 2800, 1800, 3200]
})

# Define the categorisation function
def categorise_salary(salary):
    if salary < 45000:
        return 'Low'
    elif salary < 55000:
        return 'Medium'
    else:
        return 'High'

# 1. Apply function to create salary_band column
employees['salary_band'] = employees['salary'].apply(categorise_salary)
print("Employees with salary bands:")
print(employees[['name', 'salary', 'salary_band']])

# 2. Count employees in each band
print("\nCount by salary band:")
print(employees.groupby('salary_band')['name'].count())

# 3. Average experience by salary band
print("\nAverage experience by salary band:")
print(employees.groupby('salary_band')['years_experience'].mean())

Exercise 27

Exercise 4: Using Transform