Creating a Java Spark project with Maven and junit

This blog post shows how to write some Spark code with the Java API and run a simple test.

The code snippets in this post are from this GitHub repo.

Project setup

Start by creating a pom.xml file for Maven.

This build file adds Spark SQL as a dependency and specifies a Maven version that’ll support some necessary Java language features for creating DataFrames.

Write some code

Let’s create a Transformations class with a myCounter method that returns the number of rows in a DataFrame. myCounter would not ever be useful in a real project, but it’s best to get started with a simple example.

Create the src/main/java/mrpowers/javaspark/Transformations.java file.

The Transformations class lives in the mrpowers.javaspark package. Namespacing is important to prevent class name collisions (we don’t want Java to get our Transformations class confused with another library that has a class with the same name).

We import the Dataset and Row classes from Spark so they can be accessed in the myCounter function.

We could have imported all of the Spark SQL code, including Dataset and Row, with a single wildcard import: import org.apache.spark.sql.* Wildcard imports make it harder to identify where classes are defined and it’s generally best to avoid them.

Write a test

Let’s use junit to test the myCounter function.

Add junit as a dependency in the pom.xml file.

Our test will create a DataFrame with two rows and verify that the myCounter function returns the integer 2 when it’s passed our DataFrame as an input.

Here’s how our test logic at a high level.

The junit assertEquals function is where we actually make our assertion to verify that the actual output and expected output match.

Let’s take a look at the whole test file in src/test/java/mrpowers/javaspark/TransformationsTest.java.

Brace yourself for some verbose code!

Let’s create a SparkSessionTestWrapper interface to access the Spark session in our test. The SparkSession is defined in an interface so multiple test files can use the same SparkSession.

Run the tests with the mvn test command.

Next steps

This tutorial gives us a great foundation to explore more features that all Java Spark programmers need to master. Here are the next steps:

  1. Building JAR files with Maven (similar to building JAR files with SBT)
  2. Chaining custom transformations (we already know how to do this with the Scala API and with PySpark)
  3. Making DataFrame comparisons in the test suite with spark-fast-tests
  4. Using spark-daria in application code

P.S. This is the first Java code I’ve ever written. Please post a comment or email me if you have any suggestions on how to make this code better.

Spark coder, live in Colombia / Brazil / US, love Scala / Python / Ruby, working on empowering Latinos and Latinas in tech

Spark coder, live in Colombia / Brazil / US, love Scala / Python / Ruby, working on empowering Latinos and Latinas in tech