How to Aggregate Data with PySpark
PySpark has functions for aggregating data in a similar way to SQL. Here you'll learn how to use these functions.
There are two main ways to aggregate data in PySpark:
- Using the groupBy() method’s built-in functions
- Using agg() with PySpark SQL functions
I usually use the agg() function as it gives you more summary functions to choose from, it lets you include more than one summary statistic, and makes naming your summary columns with aliases easy.
However, the first option is good if you’re just after a quick sum or making a basic frequency table to check your data.
Let’s look at both these in more detail now.
Usually when aggregating data, you group your data by one or more columns and then summarise on other columns.
If you’re familiar with SQL then this is similar to specifying summary functions such as
sum(column1) in the
select statement and grouping columns in the
group by statement.
You can group your data using the groupBy method and then use some of the built-in aggregation methods such as
This will give you the total profit for the entire
df DataFrame as there aren’t any columns specified in the
The example below will give you the
sum of profit for each region:
You can of course use other aggregation methods such as
max to get the maximum profit for each region by using:
One function I use very often for creating a quick frequency table is
count, for example:
This is useful when you need to get a feel for your data when working with huge DataFrames.
If you want to define a new column first and then get an aggregation of the new column you can use
groupBy, for example:
df.withColumn('weight_kg', df.weight_lb* 0.4535924) \ .groupBy('region') \ .avg('weight_kg').show()
This will give you the average weight in kg by region.
agg() method lets you use any aggregate functions which are part of the
You can import the
functions module from the
pyspark.sql package like this:
from pyspark.sql import functions as F
Calling this “F” is standard practice and makes sure we can distinguish between PySpark SQL functions and built in Python functions.
functions module means you aren’t just confined to the basic functions from the GroupedData class. You can now use functions like
You can also include more than one aggregate function in the
agg method, for example:
You can also apply aliases to our output by using
.alias(‘column_alias’) after each aggregate function:
Another useful aggregate function is countDistinct, this one does what it says on the tin! And can be used like this:
❓ Did you know?
You can use your calculated summary in a filter - similar to a
having statement in standard SQL, for example:
.where(col('sum_salary') > 500000) \
This will return all departments where the total salary is greater than 500k.
You can use the
corr function to get the correlation coefficient between two columns:
There you have it! You can now aggregate your data in PySpark using either