Different approaches to manually create Spark DataFrames
This blog post explains the Spark and spark-daria helper methods to manually create DataFrames for local development or testing.
We’ll demonstrate why the
createDF() method defined in spark-daria is better than the
createDataFrame() methods from the Spark source code.
See this blog post if you’re working with PySpark (the rest of this post uses Scala).
toDF() provides a concise syntax for creating DataFrames and can be accessed after importing Spark implicits.
toDF() method can be called on a sequence object to create a DataFrame.
val someDF = Seq(
someDF has the following schema.
| — number: integer (nullable = false)
| — word: string (nullable = true)
toDF() is limited because the column type and nullable flag cannot be customized. In this example, the
number column is not nullable and the
word column is nullable.
import spark.implicits._ statement can only be run inside of class definitions when the Spark Session is available. All imports should be at the top of the file before the class definition, so
toDF() encourages bad Scala coding practices.
toDF() is suitable for local testing, but production grade code that’s checked into master should use a better solution.
createDataFrame() method addresses the limitations of the
toDF() method and allows for full schema customization and good Scala coding practices.
Here is how to create
val someData = Seq(
val someSchema = List(
StructField("number", IntegerType, true),
StructField("word", StringType, true)
val someDF = spark.createDataFrame(
createDataFrame() provides the functionality we need, but the syntax is verbose. Our test files will become cluttered and difficult to read if
createDataFrame() is used frequently.
createDF() is defined in spark-daria and allows for the following terse syntax.
val someDF = spark.createDF(
("number", IntegerType, true),
("word", StringType, true)
createDF() creates readable code like
toDF() and allows for full schema customization like
createDataFrame(). It’s the best of both worlds.
Datasets are similar to DataFrames, but preferable at times because they offer more type safety.
See this blog post for explanations on how to create Datasets with the
Including spark-daria in your projects
The spark-daria README provides the following project setup instructions.
- Update your
libraryDependencies += "com.github.mrpowers" %% "spark-daria" % "0.38.2"
2. Import the spark-daria code into your project:
DataFrames are a fundamental data structure that are at the core of my Spark analyses.
See this blog post for the different approaches on how to create Datasets, a related data structure.
I wrote a Beautiful Spark Code book that teaches the core aspects of Spark development with DataFrames. The book is the best way to learn how to get good at Spark quickly.