How to Join DataFrames with PySpark
Follow along with the example below to learn how to create different types of joins with PySpark.
To create a join with PySpark, you can use the
join() DataFrame method.
This example creates a new DataFrame called
joinedDF which is the result of joining
df2 using a left join with
region as join keys.
joinedDf = df1.join(df2, (df1.id == df2.id) & (df1.region == df2.region), how='left')
As you can see above, you start with your initial DataFrame
df1 and then chain on
.join() which takes three arguments:
- DataFrame to be joined
- Join condition
- Join type
The first argument is self-explanatory, just use the DataFrame name -
df2 in this example.
Here you specify which columns from both DataFrames should be used as join keys. This is similar to how you would specify your ON condition in an SQL join.
Remember that to check if values are equal in Python, you need to use two equals signs ==
If columns are named the same in both DataFrames then we can use a python list to enter the column names e.g.
If you don’t specify a join type in the third argument of the
join() method, then it will default to an inner join. Here are the other main joins you can specify:
As you can see, PySpark makes joinig DataFrames easy.