How to Join DataFrames with PySpark

Follow along with the example below to learn how to create different types of joins with PySpark.

To create a join with PySpark, you can use the join() DataFrame method.

This example creates a new DataFrame called joinedDF which is the result of joining df1 to df2 using a left join with id and region as join keys.

joinedDf = df1.join(df2, (df1.id == df2.id) & (df1.region == df2.region), how='left')

As you can see above, you start with your initial DataFrame df1 and then chain on .join() which takes three arguments:

  1. DataFrame to be joined
  2. Join condition
  3. Join type

The first argument is self-explanatory, just use the DataFrame name - df2 in this example.

Join condition

Here you specify which columns from both DataFrames should be used as join keys. This is similar to how you would specify your ON condition in an SQL join.

📝 Note:

Remember that to check if values are equal in Python, you need to use two equals signs ==

If columns are named the same in both DataFrames then we can use a python list to enter the column names e.g. on=[‘id’,‘region’].

Join type

If you don’t specify a join type in the third argument of the join() method, then it will default to an inner join. Here are the other main joins you can specify:

how= Join Type
'inner' Inner Join
'outer' Outer Join
'left' Left Join
'right' Right Join
'cross' Cross Join

As you can see, PySpark makes joinig DataFrames easy.

Take a look at these posts to learn about querying data with PySpark, and how to aggregate data.