Spark User Defined Functions (UDFs)

Spark let’s you define custom SQL functions called user defined functions (UDFs). UDFs are great when built-in SQL functions aren’t sufficient, but should be used sparingly because they’re not performant.

This blog post will demonstrate how to define UDFs and will show how to avoid UDFs, when possible, by leveraging native Spark functions.

Simple UDF example

Let’s define a UDF that removes all the whitespace and lowercases all the characters in a string.

This code will unfortunately error out if the DataFrame column contains a null value.

Let’s write a lowerRemoveAllWhitespaceUDF function that won’t error out when the DataFrame contains null values.

We can use the explain() method to demonstrate that UDFs are a black box for the Spark engine.

Spark doesn’t know how to convert the UDF into native Spark instructions. Let’s use the native Spark library to refactor this code and help Spark generate a physical plan that can be optimized.

Using Column Functions

Let’s define a function that takes a Column argument, returns a Column, and leverages native Spark functions to lowercase and remove all whitespace from a string.

Notice that the bestLowerRemoveAllWhitespace elegantly handles the null case and does not require us to add any special null logic.

Spark can view the internals of the bestLowerRemoveAllWhitespace function and optimize the physical plan accordingly. UDFs are a black box for the Spark engine whereas functions that take a Column argument and return a Column are not a black box for Spark.

Conclusion

Spark UDFs should be avoided whenever possible. If you need to write a UDF, make sure to handle the null case as this is a common cause of errors.

Spark coder, live in Colombia / Brazil / US, love Scala / Python / Ruby, working on empowering Latinos and Latinas in tech

Spark coder, live in Colombia / Brazil / US, love Scala / Python / Ruby, working on empowering Latinos and Latinas in tech