This blog post shows how to write some Spark code with the Java API and run a simple test.

The code snippets in this post are from this GitHub repo.

Start by creating a pom.xml file for Maven.

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"…


Dependency injection is a design pattern that let’s you write Spark code that’s more flexible and easier to test.

This blog post introduces code that has a dependency, shows how to inject the path as a dependency, and then shows how to inject an entire DataFrame.

Let’s create a withStateFullName


The spark-slack library can be used to speak notifications to Slack from your Spark programs and handle Slack Slash command responses.

You can speak Slack notifications to alert stakeholders when an important job is done running or even speak counts from a Spark DataFrame.

This blog post will also show…


You can use Scaladoc to generate nicely formatted documentation for your Spark projects, just like the official Spark documentation.

Documentation encourages you to write code with clearly defined public interfaces and makes it easier for others to use your code.

The spark-daria project is a good example of an open…


Spark code can be organized in custom transformations, column functions, or user defined functions (UDFs).

Here’s how the different functions should be used in general:

  1. Use custom transformations when writing to adding / removing columns or rows from a DataFrame
  2. Use Column functions when you need a custom Spark SQL…


StructType objects define the schema of Spark DataFrames. StructType objects contain a list of StructField objects that define the name, type, and nullable flag for each column in a DataFrame.

Let’s start with an overview of StructType objects and then demonstrate how StructType columns can be added to DataFrame schemas…


The concat_ws and split Spark SQL functions can be used to add ArrayType columns to DataFrames.

Let’s demonstrate the concat_ws / split approach by intepreting a StringType column and analyze when this approach is preferable to the array() function.

Let’s create a DataFrame with a StringType column and use the…


Spark is a powerful tool for extracting data, running transformations, and loading the results in a data store.

Spark runs computations in parallel so execution is lightning fast and clusters can be scaled up for big data. …


Spark let’s you define custom SQL functions called user defined functions (UDFs). UDFs are great when built-in SQL functions aren’t sufficient, but should be used sparingly because they’re not performant.

This blog post will demonstrate how to define UDFs and will show how to avoid UDFs, when possible, by leveraging…


The Spark Column class defines predicate methods that allow logic to be expressed consisely and elegantly (e.g. isNull, isNotNull, and isin).

spark-daria defines additional Column methods such as isTrue, isFalse, isNullOrBlank, isNotNullOrBlank, and isNotIn to fill in the Spark API gaps.

This blog post will demonstrate how to express logic…

Matthew Powers

Spark coder, live in Colombia / Brazil / US, love Scala / Python / Ruby, working on empowering Latinos and Latinas in tech

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store