Column predicate methods in Spark (isNull, isin, isTrue, isNullOrBlank, etc.)

Matthew Powers
4 min readDec 24, 2017

--

The Spark Column class defines predicate methods that allow logic to be expressed consisely and elegantly (e.g. isNull, isNotNull, and isin).

spark-daria defines additional Column methods such as isTrue, isFalse, isNullOrBlank, isNotNullOrBlank, and isNotIn to fill in the Spark API gaps.

This blog post will demonstrate how to express logic with the available Column predicate methods.

A quick note on method signatures

By convention, methods with accessor-like names (i.e. methods that begin with "is") are defined as empty-paren methods. For example, the isTrue method is defined without parenthesis as follows:

object ColumnExt {

implicit class ColumnMethods(col: Column) {

def isTrue: Column = {
when(col.isNull, false).otherwise(col === true)
}

}

}

isNull, isNotNull, isin

The Spark Column class defines four methods with accessor-like names. , but Let’s dive in and explore the isNull, isNotNull, and isin methods (isNaN isn’t frequently used, so we’ll ignore it for now).

The isNull method returns true if the column contains a null value and false otherwise.

val sourceDF = spark.createDF(
List(
(1, "shakira"),
(2, "sofia"),
(3, null)
), List(
("person_id", IntegerType, true),
("person_name", StringType, true)
)
)

sourceDF.withColumn(
"is_person_name_null",
col("person_name").isNull
).show()
+---------+-----------+-------------------+
|person_id|person_name|is_person_name_null|
+---------+-----------+-------------------+
| 1| shakira| false|
| 2| sofia| false|
| 3| null| true|
+---------+-----------+-------------------+

The isNotNull method returns true if the column does not contain a null value, and false otherwise.

sourceDF.withColumn(
"is_person_name_not_null",
col("person_name").isNotNull
).show()
+---------+-----------+-----------------------+
|person_id|person_name|is_person_name_not_null|
+---------+-----------+-----------------------+
| 1| shakira| true|
| 2| sofia| true|
| 3| null| false|
+---------+-----------+-----------------------+

The isin method returns true if the column is contained in a list of arguments and false otherwise.

val primaryColors = List("red", "yellow", "blue")

val sourceDF = spark.createDF(
List(
("rihanna", "red"),
("solange", "yellow"),
("selena", "purple")
), List(
("celebrity", StringType, true),
("color", StringType, true)
)
)

sourceDF.withColumn(
"is_primary_color",
col("color").isin(primaryColors: _*)
).show()
+---------+------+----------------+
|celebrity| color|is_primary_color|
+---------+------+----------------+
| rihanna| red| true|
| solange|yellow| true|
| selena|purple| false|
+---------+------+----------------+

You will use the isNull, isNotNull, and isin methods constantly when writing Spark code. Let’s take a look at some spark-daria Column predicate methods that are also useful when writing Spark code.

spark-daria Column predicate methods

The spark-daria column extensions can be imported to your code with this command:

import com.github.mrpowers.spark.daria.sql.ColumnExt._

The isTrue methods returns true if the column is true and the isFalse method returns true if the column is false.

val sourceDF = spark.createDF(
List(
("Argentina", false),
("Japan", true),
(null, null)
), List(
("country", StringType, true),
("in_asia", BooleanType, true)
)
)

sourceDF.withColumn(
"in_asia_is_true",
col("in_asia").isTrue
).withColumn(
"in_asia_is_false",
col("in_asia").isFalse
).show()
+---------+-------+---------------+----------------+
| country|in_asia|in_asia_is_true|in_asia_is_false|
+---------+-------+---------------+----------------+
|Argentina| false| false| true|
| Japan| true| true| false|
| null| null| false| false|
+---------+-------+---------------+----------------+

The isNullOrBlank method returns true if the column is null or contains an empty string.

val sourceDF = spark.createDF(
List(
("water"),
(" jellyfish"),
(""),
(" "),
(null)
), List(
("thing", StringType, true)
)
)

sourceDF.withColumn(
"is_thing_null_or_blank",
col("thing").isNullOrBlank
).show()
+-----------+----------------------+
| thing|is_thing_null_or_blank|
+-----------+----------------------+
| water| false|
| jellyfish| false|
| | true|
| | true|
| null| true|
+-----------+----------------------+

isNotNullOrBlank is the opposite and returns true if the column does not contain null or the empty string.

The isNotIn method returns true if the column is not in a specified list and and is the oppositite of isin.

val primaryColors = List("red", "yellow", "blue")

val sourceDF = spark.createDF(
List(
("rihanna", "red"),
("solange", "yellow"),
("selena", "purple")
), List(
("celebrity", StringType, true),
("color", StringType, true)
)
)

sourceDF.withColumn(
"is_not_primary_color",
col("color").isNotIn(primaryColors: _*)
).show()
+---------+------+--------------------+
|celebrity| color|is_not_primary_color|
+---------+------+--------------------+
| rihanna| red| false|
| solange|yellow| false|
| selena|purple| true|
+---------+------+--------------------+

Falsy and truthy values

Scala does not have truthy and falsy values, but other programming languages do have the concept of different values that are true and false in boolean contexts.

According to Douglas Crawford, falsy values are one of the awful parts of the JavaScript programming language! So it is will great hesitation that I’ve added isTruthy and isFalsy to the spark-daria library.

isFalsy returns true if the value is null or false.

val sourceDF = spark.createDF(
List(
(true),
(false),
(null)
), List(
("likes_cheese", BooleanType, true)
)
)

sourceDF.withColumn(
"is_likes_cheese_falsy",
col("likes_cheese").isFalsy
).show()
+------------+---------------------+
|likes_cheese|is_likes_cheese_falsy|
+------------+---------------------+
| true| false|
| false| true|
| null| true|
+------------+---------------------+

isTruthy is the opposite and returns true if the value is anything other than null or false.

val sourceDF = spark.createDF(
List(
(1),
(3),
(null)
), List(
("odd_number", IntegerType, true)
)
)

sourceDF.withColumn(
"is_odd_number_truthy",
col("odd_number").isTruthy
).show()
+----------+--------------------+
|odd_number|is_odd_number_truthy|
+----------+--------------------+
| 1| true|
| 3| true|
| null| false|
+----------+--------------------+

I’m still not sure if it’s a good idea to introduce truthy and falsy values into Spark code, so use this code with caution.

Conclusion

Between Spark and spark-daria, you have a powerful arsenal of Column predicate methods to express logic in your Spark code.

Spark codebases that properly leverage the available methods are easy to maintain and read.

--

--

Matthew Powers
Matthew Powers

Written by Matthew Powers

Spark coder, live in Colombia / Brazil / US, love Scala / Python / Ruby, working on empowering Latinos and Latinas in tech

Responses (1)