Column predicate methods in Spark (isNull, isin, isTrue, isNullOrBlank, etc.)

The Spark Column class defines predicate methods that allow logic to be expressed consisely and elegantly (e.g. isNull, isNotNull, and isin).

spark-daria defines additional Column methods such as isTrue, isFalse, isNullOrBlank, isNotNullOrBlank, and isNotIn to fill in the Spark API gaps.

This blog post will demonstrate how to express logic with the available Column predicate methods.

A quick note on method signatures

By convention, methods with accessor-like names (i.e. methods that begin with "is") are defined as empty-paren methods. For example, the isTrue method is defined without parenthesis as follows:

isNull, isNotNull, isin

The Spark Column class defines four methods with accessor-like names. , but Let’s dive in and explore the isNull, isNotNull, and isin methods (isNaN isn’t frequently used, so we’ll ignore it for now).

The isNull method returns true if the column contains a null value and false otherwise.

The isNotNull method returns true if the column does not contain a null value, and false otherwise.

The isin method returns true if the column is contained in a list of arguments and false otherwise.

You will use the isNull, isNotNull, and isin methods constantly when writing Spark code. Let’s take a look at some spark-daria Column predicate methods that are also useful when writing Spark code.

spark-daria Column predicate methods

The spark-daria column extensions can be imported to your code with this command:

The isTrue methods returns true if the column is true and the isFalse method returns true if the column is false.

The isNullOrBlank method returns true if the column is null or contains an empty string.

isNotNullOrBlank is the opposite and returns true if the column does not contain null or the empty string.

The isNotIn method returns true if the column is not in a specified list and and is the oppositite of isin.

Falsy and truthy values

Scala does not have truthy and falsy values, but other programming languages do have the concept of different values that are true and false in boolean contexts.

According to Douglas Crawford, falsy values are one of the awful parts of the JavaScript programming language! So it is will great hesitation that I’ve added isTruthy and isFalsy to the spark-daria library.

isFalsy returns true if the value is null or false.

isTruthy is the opposite and returns true if the value is anything other than null or false.

I’m still not sure if it’s a good idea to introduce truthy and falsy values into Spark code, so use this code with caution.

Conclusion

Between Spark and spark-daria, you have a powerful arsenal of Column predicate methods to express logic in your Spark code.

Spark codebases that properly leverage the available methods are easy to maintain and read.

Spark coder, live in Colombia / Brazil / US, love Scala / Python / Ruby, working on empowering Latinos and Latinas in tech