The different type of Spark functions (custom transformations, column functions, UDFs)

Spark code can be organized in custom transformations, column functions, or user defined functions (UDFs).

Here’s how the different functions should be used in general:

  1. Use custom transformations when writing to adding / removing columns or rows from a DataFrame
  2. Use Column functions when you need a custom Spark SQL function that can be defined with the native Spark API
  3. Use native functions (aka Catalyst expressions) when you want high performance execution
  4. Use UDFs when the native Spark API isn’t sufficient and you can’t express the logic you need with Column functions

Custom Transformations

Let’s take a look at some Spark code that’s organized with order dependent variable assignments and then refactor the code with custom transformations.

Let’s refactor this code with custom transformations and see how these can be executed to yield the same result.

The custom transformations eliminate the order dependent variable assignments and create code that’s easily testable 😅

Here’s the generic method signature for custom transformations.

Custom transformations should be used when adding columns, removing columns, adding rows, or removing rows from a DataFrame.

This blog post discusses custom transformations in more detail.

Column functions

Column functions return Column objects, similar to the Spark SQL functions. Let’s look at the spark-daria removeAllWhitespace column function.

Column functions can be used like the Spark SQL functions.

Column functions are preferable to user defined functions, as discussed in this blog post.

Catalyst functions

As Sim mentioned in the comments, you can write high performance Spark native functions, also known as Catalyst expressions, if you’re interested in advanced Spark hacking.

Spark native functions are also a great way to learn about how Spark works under the hood.

See this blog post for more information on how to write Spark native functions.

User defined functions

User defined functions are similar to Column functions, but they use pure Scala instead of the Spark API.

Here’s a UDF to lowercase a string.

Let’s look at the toLower UDF in action.

This is a contrived example and it’s obviously better to simply use the built-in Spark lower function to downcase a string.

UDFs aren’t desirable because they require complicated null logic and are a black box, so they’re hard for the Spark compiler to optimize. See this blog post for more details.

Conclusion

Organize your Spark code as custom transformations and Column functions. Oftentimes, you’ll be used Column functions within your custom transformations.

I use the spark-daria functions combined with private Column functions in almost all of the production custom transformations I write.

Spark coder, live in Colombia / Brazil / US, love Scala / Python / Ruby, working on empowering Latinos and Latinas in tech

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store