How to Join DataFrames with PySpark
Follow along with the example below to learn how to create different types of joins with PySpark.
To create a join with PySpark, you can use the join()
DataFrame method.
This example creates a new DataFrame called joinedDF
which is the result of joining df1
to df2
using a left join with id
and region
as join keys.
joinedDf = df1.join(df2, (df1.id == df2.id) & (df1.region == df2.region), how='left')
As you can see above, you start with your initial DataFrame df1
and then chain on .join()
which takes three arguments:
- DataFrame to be joined
- Join condition
- Join type
The first argument is self-explanatory, just use the DataFrame name - df2
in this example.
Join condition
Here you specify which columns from both DataFrames should be used as join keys. This is similar to how you would specify your ON condition in an SQL join.
📝 Note:
Remember that to check if values are equal in Python, you need to use two equals signs ==
If columns are named the same in both DataFrames then we can use a python list to enter the column names e.g. on=[‘id’,‘region’]
.
Join type
If you don’t specify a join type in the third argument of the join()
method, then it will default to an inner join. Here are the other main joins you can specify:
how= | Join Type |
---|---|
'inner' | Inner Join |
'outer' | Outer Join |
'left' | Left Join |
'right' | Right Join |
'cross' | Cross Join |
As you can see, PySpark makes joinig DataFrames easy.
Take a look at these posts to learn about querying data with PySpark, and how to aggregate data.