The different type of Spark functions (custom transformations, column functions, UDFs)

  1. Use custom transformations when writing to adding / removing columns or rows from a DataFrame
  2. Use Column functions when you need a custom Spark SQL function that can be defined with the native Spark API
  3. Use native functions (aka Catalyst expressions) when you want high performance execution
  4. Use UDFs when the native Spark API isn’t sufficient and you can’t express the logic you need with Column functions

Custom Transformations

Let’s take a look at some Spark code that’s organized with order dependent variable assignments and then refactor the code with custom transformations.

val df = List(
("joao"),
("gabriel")
).toDF("first_name")

val df2 = df.withColumn(
"greeting",
lit("HEY!")
)

val df3 = df2.withColumn(
"fav_activity",
lit("surfing")
)

df3.show()
+----------+--------+------------+
|first_name|greeting|fav_activity|
+----------+--------+------------+
| joao| HEY!| surfing|
| gabriel| HEY!| surfing|
+----------+--------+------------+
def withGreeting()(df: DataFrame): DataFrame = {
df.withColumn(
"greeting",
lit("HEY!")
)
}

def withFavActivity()(df: DataFrame): DataFrame = {
df.withColumn(
"fav_activity",
lit("surfing")
)
}
val df = List(
("joao"),
("gabriel")
).toDF("first_name")

df
.transform(withGreeting())
.transform(withFavActivity())
.show()
+----------+--------+------------+
|first_name|greeting|fav_activity|
+----------+--------+------------+
| joao| HEY!| surfing|
| gabriel| HEY!| surfing|
+----------+--------+------------+
def customTransformName(arg1: String)(df: DataFrame): DataFrame = {
// code that returns a DataFrame
}

Column functions

Column functions return Column objects, similar to the Spark SQL functions. Let’s look at the spark-daria removeAllWhitespace column function.

def removeAllWhitespace(col: Column): Column = {
regexp_replace(col, "\\s+", "")
}
List(
("I LIKE food"),
(" this fun")
).toDF("words")
.withColumn(
"clean_words",
removeAllWhitespace("words")
)
.show()
+--------------+-----------+
| words|clean_words|
+--------------+-----------+
| I LIKE food| ILIKEfood|
| this fun| thisfun|
+--------------+-----------+

Catalyst functions

As Sim mentioned in the comments, you can write high performance Spark native functions, also known as Catalyst expressions, if you’re interested in advanced Spark hacking.

User defined functions

User defined functions are similar to Column functions, but they use pure Scala instead of the Spark API.

def toLowerFun(str: String): Option[String] = {
val s = Option(str).getOrElse(return None)
Some(s.toLowerCase())
}

val toLower = udf[Option[String], String](toLowerFun)
val df = List(
("HI ThErE"),
("ME so HAPPY"),
(null)
).toDF("phrase")

df
.withColumn(
"lower_phrase",
toLower(col("phrase"))
)
.show()
+-----------+------------+
| phrase|lower_phrase|
+-----------+------------+
| HI ThErE| hi there|
|ME so HAPPY| me so happy|
| null| null|
+-----------+------------+

Conclusion

Organize your Spark code as custom transformations and Column functions. Oftentimes, you’ll be used Column functions within your custom transformations.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Matthew Powers

Matthew Powers

Spark coder, live in Colombia / Brazil / US, love Scala / Python / Ruby, working on empowering Latinos and Latinas in tech