This blog post shows how to write some Spark code with the Java API and run a simple test.

The code snippets in this post are from this GitHub repo.

Project setup

Start by creating a pom.xml file for Maven.


Dependency injection is a design pattern that let’s you write Spark code that’s more flexible and easier to test.

This blog post introduces code that has a dependency, shows how to inject the path as a dependency, and then shows how to inject an entire DataFrame.

Code with a dependency

Let’s create a withStateFullName method that appends a state_name column to a DataFrame.

The withStateFullName appends the state_name column with a broadcast join.

withStateFullName depends on the Config object. withStateFullName


The spark-slack library can be used to speak notifications to Slack from your Spark programs and handle Slack Slash command responses.

You can speak Slack notifications to alert stakeholders when an important job is done running or even speak counts from a Spark DataFrame.

This blog post will also show how to run Spark ETL processes from the Slack command line that will allow your organization to operate more transparently and efficiently.

Slack Messages

Here’s how to speak a “You are amazing” message in the #general channel:


You can use Scaladoc to generate nicely formatted documentation for your Spark projects, just like the official Spark documentation.

Documentation encourages you to write code with clearly defined public interfaces and makes it easier for others to use your code.

The spark-daria project is a good example of an open source project that’s easy to use because the documentation follows Spark’s best practices. This blog will show you how to add documentation to your Spark projects.

How to generate documentation

The sbt doc command generates HTML documentation in the target/scala-2.11/api/ directory. You can open the documentation locally with the open target/scala-2.11/api/index.html command.

Here’s an…


Spark code can be organized in custom transformations, column functions, or user defined functions (UDFs).

Here’s how the different functions should be used in general:

  1. Use custom transformations when writing to adding / removing columns or rows from a DataFrame
  2. Use Column functions when you need a custom Spark SQL function that can be defined with the native Spark API
  3. Use native functions (aka Catalyst expressions) when you want high performance execution
  4. Use UDFs when the native Spark API isn’t sufficient and you can’t express the logic you need with Column functions

Custom Transformations

Let’s take a look at some Spark code…


StructType objects define the schema of Spark DataFrames. StructType objects contain a list of StructField objects that define the name, type, and nullable flag for each column in a DataFrame.

Let’s start with an overview of StructType objects and then demonstrate how StructType columns can be added to DataFrame schemas (essentially creating a nested schema).

StructType columns are a great way to eliminate order dependencies from Spark code.

StructType overview

The StructType case class can be used to define a DataFrame schema as follows.


The concat_ws and split Spark SQL functions can be used to add ArrayType columns to DataFrames.

Let’s demonstrate the concat_ws / split approach by intepreting a StringType column and analyze when this approach is preferable to the array() function.

String interpretation with the array() method

Let’s create a DataFrame with a StringType column and use the array() function to parse out all the colors in the string.


Spark is a powerful tool for extracting data, running transformations, and loading the results in a data store.

Spark runs computations in parallel so execution is lightning fast and clusters can be scaled up for big data. Spark’s native API and spark-daria’s EtlDefinition object allow for elegant definitions of ETL logic.

Extract

Suppose you have a data lake of Parquet files. Here’s some example code that will fetch the data lake, filter the data, and then repartition the data subset.

Read this blog post for more information about repartitioning…


Spark let’s you define custom SQL functions called user defined functions (UDFs). UDFs are great when built-in SQL functions aren’t sufficient, but should be used sparingly because they’re not performant.

This blog post will demonstrate how to define UDFs and will show how to avoid UDFs, when possible, by leveraging native Spark functions.

Simple UDF example

Let’s define a UDF that removes all the whitespace and lowercases all the characters in a string.


The Spark Column class defines predicate methods that allow logic to be expressed consisely and elegantly (e.g. isNull, isNotNull, and isin).

spark-daria defines additional Column methods such as isTrue, isFalse, isNullOrBlank, isNotNullOrBlank, and isNotIn to fill in the Spark API gaps.

This blog post will demonstrate how to express logic with the available Column predicate methods.

A quick note on method signatures

By convention, methods with accessor-like names (i.e. methods that begin with "is") are defined as empty-paren methods. For example, the isTrue method is defined without parenthesis as follows:

Matthew Powers

Spark coder, live in Colombia / Brazil / US, love Scala / Python / Ruby, working on empowering Latinos and Latinas in tech

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store