Spark User Defined Functions (UDFs)

Spark let’s you define custom SQL functions called user defined functions (UDFs). UDFs are great when built-in SQL functions aren’t sufficient, but should be used sparingly because they’re not performant.

This blog post will demonstrate how to define UDFs and will show how to avoid UDFs, when possible, by leveraging native Spark functions.

Simple UDF example

Let’s define a UDF that removes all the whitespace and lowercases all the characters in a string.

This code will unfortunately error out if the DataFrame column contains a value.

Let’s write a function that won’t error out when the DataFrame contains values.

We can use the method to demonstrate that UDFs are a black box for the Spark engine.

Spark doesn’t know how to convert the UDF into native Spark instructions. Let’s use the native Spark library to refactor this code and help Spark generate a physical plan that can be optimized.

Using Column Functions

Let’s define a function that takes a argument, returns a , and leverages native Spark functions to lowercase and remove all whitespace from a string.

Notice that the elegantly handles the case and does not require us to add any special logic.

Spark can view the internals of the function and optimize the physical plan accordingly. UDFs are a black box for the Spark engine whereas functions that take a argument and return a are not a black box for Spark.

Conclusion

Spark UDFs should be avoided whenever possible. If you need to write a UDF, make sure to handle the case as this is a common cause of errors.

Spark coder, live in Colombia / Brazil / US, love Scala / Python / Ruby, working on empowering Latinos and Latinas in tech

Spark coder, live in Colombia / Brazil / US, love Scala / Python / Ruby, working on empowering Latinos and Latinas in tech