How to write Spark ETL Processes

Spark is a powerful tool for extracting data, running transformations, and loading the results in a data store.

Spark runs computations in parallel so execution is lightning fast and clusters can be scaled up for big data. Spark’s native API and spark-daria’s EtlDefinition object allow for elegant definitions of ETL logic.

Extract

Suppose you have a data lake of Parquet files. Here’s some example code that will fetch the data lake, filter the data, and then repartition the data subset.

Read this blog post for more information about repartitioning DataFrames. We’re now ready to transform the extractDF.

Transform

We can define a custom transformation function that takes a DataFrame as an argument and returns a DataFrame to transform the extractDF. Custom transformation functions are reusable and easily testable, so this creates a high quality codebase.

Let’s define a couple of DataFrame transformations.

Let’s create a model() function that chains the custom transformations.

We can run extractDF.transform(model()) to run the transformations on our extract. Easy peasy 😎

Check out this blog post for more details on chaining custom DataFrame transformations.

Load (or Report)

We can use the Spark DataFrame writers to define a generic function that writes a DataFrame to a given location in S3.

The writer function should take a DataFrame as an argument and return nothing (Unit).

EtlDefinition

Let’s instantiate the EtlDefinition case class defined in spark-daria and use the process() method to execute the ETL code.

Here’s how to execute the ETL code:

Wow, that was easy 😉 The EtlDefinition object can even be repurposed for making Slack messages! I’ll cover that in another blog post.

Take a look at the method signatures of the EtlDefinition arguments and make sure you understand how the functions we’ve defined fit into this mold.

Take note that EtlDefinition objects can optionally be instantiated with an arbitrary metadata Map.

Multiple EtlDefinitions

You can organize a collection of EtlDefinition objects in a mutable Map, so they’re easy to fetch and execute.

Next Steps

Here are the key steps to writing good ETL code in Spark.

  • Make sure to repartition the DataFrame after filtering
  • Custom DataFrame transformations should be broken up, tested individually, and then chained in a model() method
  • Create EtlDefinition objects to organize your ETL logic and make sure all of your method signatures are correct

I use the Databricks API, AWS Lambda, and Slack Slash commands to execute ETL jobs directly from Slack. I highly recommend this workflow!

Spark coder, live in Colombia / Brazil / US, love Scala / Python / Ruby, working on empowering Latinos and Latinas in tech

Spark coder, live in Colombia / Brazil / US, love Scala / Python / Ruby, working on empowering Latinos and Latinas in tech