What is PySpark and Why Should I Use It?

If you’ve ever heard how useful PySpark is, but haven’t started using it yet then read on to learn more!

PySpark is a way of using unlocking the power of Spark through Python. You may be familiar with Python and its uses in data science, but what exactly is Spark?

What is Spark?

Spark is an open-source platform for running data processing on a computer cluster. Running on a cluster brings many benefits - the main one being performance.

🖥️ Computer Cluster:

A cluster is a group of connected computers which work together to perform resource heavy tasks – in our case, data processing.

Each cluster will have a master node and several worker nodes. The master node distributes tasks and data to the worker nodes which allows for processes to be run in parallel.

As well as performance benefits, clusters are also easier to scale as you can add more nodes as your data grows. Clusters are also more reliable - if one of the worker nodes breaks down then the tasks can be given to another worker node.

All this sounds a bit complicated! How do I distribute my data processing to the different nodes?

Luckily, Spark does this for you. Spark uses data structures called Resilient Distributed Datasets (RDDs). These are objects which lets Spark distribute data across many worker nodes.

📁 RDD:

  • Resilient: Data can be rebuilt if there’s a failure or fault
  • Distributed: Data is distributed across many nodes in a cluster
  • Dataset: Collection of data values

As RDDs are low-level objects, they aren’t the most user-friendly data structures to work with as you need to provide specific instructions for running your query. This is why Spark has a DataFrame layer built on top of RDDs.

DataFrame objects can be thought of as an SQL table or a Pandas DataFrame, where data is structured in rows and columns. The best thing about using DataFrames is that most of the optimization is done for you and you don’t need to worry as much about how the data is processed behind the scenes.

Spark takes care of this for you and you can query the data on the cluster as you would on a single server.

Python logo and Apache Spark logo

What is PySpark (Python + Spark)?

Spark is written in Scala and is implemented using the Hadoop Distributed File System (HDFS). But you don’t need to learn Scala to use Spark – you can use PySpark.

PySpark is a Python API for Spark, so you can access the power of Spark by using Python instead of Scala.

Why should I use PySpark?

PySpark is great for analysing huge datasets. You can run joins and summaries as well as ETL jobs and machine learning pipelines.

Spark processing happens in memory, while Hadoop MapReduce for example has to read and write to disk which means if you have enough memory then PySpark can be many times faster.

If you work with huge datasets on a single server and you’ve noticed that your tasks run slowly then it would definitely be worth considering PySpark.

Something to keep in mind that using a cluster can add some maintenance and setup overheads. A cloud solution such as AWS’s EMR would make setting up and adding more nodes easier if you don’t have a dedicated data engineering team.

If you’d like to learn more about using PySpark to work with data then check out this guide to querying data with PySpark.