Creating a Spark Project with SBT, IntelliJ, sbt-spark-package, and friends
This blog post will show you how to create a Spark project in SBT, write some tests, and package the code as a JAR file. We’ll start with a brand new IntelliJ project and walk you through every step along the way.
After you understand how to build an SBT project, you’ll be able to rapidly create new projects with the sbt-spark.g8 Gitter Template.
The spark-pika project that we’ll create in this tutorial is available on GitHub.
Add sbt-spark-package
sbt-spark-package is the easiest way to add Spark to a SBT project, even if you’re not building a Spark package. Add the package in the project/plugins.sbt
file.
resolvers += "bintray-spark-packages" at "https://dl.bintray.com/spark-packages/maven/"
addSbtPlugin("org.spark-packages" % "sbt-spark-package" % "0.2.6")
Select the components of Spark will be used in your project and the Spark version in the build.sbt
file.
sparkVersion := "2.2.0"
sparkComponents ++= Seq("sql")
Set the Scala and SBT versions
Spark works best with Scala version 2.11.x and SBT version 0.13.x. Update the build.sbt
file with Scala 2.11.12.
scalaVersion := "2.11.12"
Update the project/build.properties
file with SBT version 0.13.17.
sbt.version = 0.13.17
Spark doesn’t work with Scala 2.12 (this issue tracks Spark’s progress to adding Scala 2.12 support). SBT 1.x uses Scala 2.12, so it’s best to stick with SBT 0.13.x when using Spark.
Add the SparkSession
We’ll wrap the SparkSession in a trait, so it’s easily accessible by our classes and objects.
Scala follows the Java convention of deeply nesting code in empty directories. We’ll put the SparkSessionWrapper
trait in the following directory structure.
spark-pika/
src/
main/
scala/
com/
github/
mrpowers/
spark/
pika/
SparkSessionWrapper.scala
The directory structure is a backwards version of the code URL: https://github.com/MrPowers/spark-pika.
The Spark codebase also follows these conventions. You’ll see imports like this when writing Spark.
import org.apache.spark.sql.DataFrame
IntelliJ does a great job making this directory structure look less ridiculous.
Here’s what the SparkSessionWrapper
code looks like.
package com.github.mrpowers.spark.pika
import org.apache.spark.sql.SparkSession
trait SparkSessionWrapper {
lazy val spark: SparkSession = {
SparkSession
.builder()
.master("local")
.appName("spark pika")
.getOrCreate()
}
}
When a class is extended with the SparkSessionWrapper
we’ll have access to the session via the spark
variable. Starting and stopping the SparkSession is expensive and our code will run faster if we only create one SparkSession.
The getOrCreate()
method uses existing SparkSessions if they’re present. You’ll typically create your own SparkSession when running the code in the development or test environments and use the SparkSession created by a service provider (e.g. Databricks) in production.
Write some code
Let’s create a Tubular
object with a withGoodVibes()
DataFrame transformation that appends a chi
column to a DataFrame.
package com.github.mrpowers.spark.pika
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
object Tubular {
def withGoodVibes()(df: DataFrame): DataFrame = {
df.withColumn(
"chi",
lit("happy")
)
}
}
Read this blog post if you’d like more background information on custom DataFrame transformations.
Let’s run sbt console
and try out our code.
$ sbt console
> val df = List("sue", "fan").toDF("name")
> import com.github.mrpowers.spark.pika.Tubular
> val betterDF = df.transform(Tubular.withGoodVibes())
> betterDF.show()
+----+-----+
|name| chi|
+----+-----+
| sue|happy|
| fan|happy|
+----+-----+
Checking if code functions in the console is more time consuming than simply writing a test.
Add some tests
Let’s add spark-fast-tests and scalatest to the build.sbt
file so we can add some tests.
libraryDependencies += "MrPowers" % "spark-fast-tests" % "2.2.0_0.5.0" % "test"
libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.1" % "test"
Create a src/test/scala/com/github/mrpowers/spark/pika/TubularSpec.scala
file for the test code.
package com.github.mrpowers.spark.pika
import org.scalatest.FunSpec
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import com.github.mrpowers.spark.fast.tests.DataFrameComparer
class TubularSpec
extends FunSpec
with SparkSessionWrapper
with DataFrameComparer {
import spark.implicits._
describe("withGoodVibes") {
it("appends a chi column to a DataFrame") {
val sourceDF = List("sue", "fan").toDF("name")
val actualDF = sourceDF.transform(Tubular.withGoodVibes())
val expectedSchema = List(
StructField("name", StringType, true),
StructField("chi", StringType, false)
)
val expectedData = List(
Row("sue", "happy"),
Row("fan", "happy")
)
val expectedDF = spark.createDataFrame(
spark.sparkContext.parallelize(expectedData),
StructType(expectedSchema)
)
assertSmallDataFrameEquality(actualDF, expectedDF)
}
}
}
Run the sbt test
command and verify that the test is passing.
Let’s use spark-daria to refactor this test and make the code more concise.
Cleaning up the test
spark-daria defines a createDF
method that’s terse like toDF()
while allowing for full control of the DataFrame that’s created like createDataFrame()
. Read this blog post for a full description on different ways to create DataFrames in Spark.
Add spark-daria to the build.sbt
file.
libraryDependencies += "mrpowers" % "spark-daria" % "2.2.0_0.12.0" % "test"
Let’s refactor the test file with createDF
.
package com.github.mrpowers.spark.pika
import org.scalatest.FunSpec
import org.apache.spark.sql.types.StringType
import com.github.mrpowers.spark.fast.tests.DataFrameComparer
import com.github.mrpowers.spark.daria.sql.SparkSessionExt._
class TubularSpec
extends FunSpec
with SparkSessionWrapper
with DataFrameComparer {
describe("withGoodVibes") {
it("appends a chi column to a DataFrame") {
val sourceDF = spark.createDF(
List(
"sue",
"fan"
), List(
("name", StringType, true)
)
)
val actualDF = sourceDF.transform(Tubular.withGoodVibes())
val expectedDF = spark.createDF(
List(
("sue", "happy"),
("fan", "happy")
), List(
("name", StringType, true),
("chi", StringType, false)
)
)
assertSmallDataFrameEquality(actualDF, expectedDF)
}
}
}
spark-daria’s createDF
function simplifies the import
statements, makes the code more readable, and shrinks the test file from 1,008 to 925 characters. I find spark-daria to be essential for all of my projects.
Building the JAR file
Running sbt package
will generate a file named spark-pika_2.11–0.0.1.jar
.
We should follow the spark-style-guide and include the Spark version in the JAR file name, so it’s easier for JAR file consumers to know what version of Spark they should be using.
Update the build.sbt
file as follows.
artifactName := { (sv: ScalaVersion, module: ModuleID, artifact: Artifact) =>
artifact.name + "_" + sv.binary + "-" + sparkVersion.value + "_" + module.revision + "." + artifact.extension
}
Now sbt package
will generate a file named spark-pika_2.11–2.2.0_0.0.1.jar
.
There are a lot of complexities related to packaging JAR files and I’ll cover these in another blog post.
Next Steps
Now that you know how to create SBT projects with Spark, you can use the sbt-spark.g8 Gitter Template to bootstrap new projects.
In the next blog post, we’ll cover advanced SBT tactics for efficiently running tests, generating JAR files, and managing dependencies.