Different approaches to manually create Spark DataFrames

toDF()

toDF() provides a concise syntax for creating DataFrames and can be accessed after importing Spark implicits.

import spark.implicits._
val someDF = Seq(
(8, "bat"),
(64, "mouse"),
(-27, "horse")
).toDF("number", "word")
root
| — number: integer (nullable = false)
| — word: string (nullable = true)

createDataFrame()

The createDataFrame() method addresses the limitations of the toDF() method and allows for full schema customization and good Scala coding practices.

val someData = Seq(
Row(8, "bat"),
Row(64, "mouse"),
Row(-27, "horse")
)

val someSchema = List(
StructField("number", IntegerType, true),
StructField("word", StringType, true)
)

val someDF = spark.createDataFrame(
spark.sparkContext.parallelize(someData),
StructType(someSchema)
)

createDF()

createDF() is defined in spark-daria and allows for the following terse syntax.

val someDF = spark.createDF(
List(
(8, "bat"),
(64, "mouse"),
(-27, "horse")
), List(
("number", IntegerType, true),
("word", StringType, true)
)
)

Creating Datasets

Datasets are similar to DataFrames, but preferable at times because they offer more type safety.

Including spark-daria in your projects

The spark-daria README provides the following project setup instructions.

  1. Update your build.sbt file.
libraryDependencies += "com.github.mrpowers" %% "spark-daria" % "0.38.2"
import com.github.mrpowers.spark.daria.sql.SparkSessionExt._

Closing Thoughts

DataFrames are a fundamental data structure that are at the core of my Spark analyses.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Matthew Powers

Matthew Powers

2.4K Followers

Spark coder, live in Colombia / Brazil / US, love Scala / Python / Ruby, working on empowering Latinos and Latinas in tech