Different approaches to manually create Spark DataFrames


toDF() provides a concise syntax for creating DataFrames and can be accessed after importing Spark implicits.

import spark.implicits._
val someDF = Seq(
(8, "bat"),
(64, "mouse"),
(-27, "horse")
).toDF("number", "word")
| — number: integer (nullable = false)
| — word: string (nullable = true)


The createDataFrame() method addresses the limitations of the toDF() method and allows for full schema customization and good Scala coding practices.

val someData = Seq(
Row(8, "bat"),
Row(64, "mouse"),
Row(-27, "horse")

val someSchema = List(
StructField("number", IntegerType, true),
StructField("word", StringType, true)

val someDF = spark.createDataFrame(


createDF() is defined in spark-daria and allows for the following terse syntax.

val someDF = spark.createDF(
(8, "bat"),
(64, "mouse"),
(-27, "horse")
), List(
("number", IntegerType, true),
("word", StringType, true)

Creating Datasets

Datasets are similar to DataFrames, but preferable at times because they offer more type safety.

Including spark-daria in your projects

The spark-daria README provides the following project setup instructions.

  1. Update your build.sbt file.
libraryDependencies += "com.github.mrpowers" %% "spark-daria" % "0.38.2"
import com.github.mrpowers.spark.daria.sql.SparkSessionExt._

Closing Thoughts

DataFrames are a fundamental data structure that are at the core of my Spark analyses.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Matthew Powers

Matthew Powers


Spark coder, live in Colombia / Brazil / US, love Scala / Python / Ruby, working on empowering Latinos and Latinas in tech