Column predicate methods in Spark (isNull, isin, isTrue, isNullOrBlank, etc.)

The Spark Column class defines predicate methods that allow logic to be expressed consisely and elegantly (e.g. isNull, isNotNull, and isin).

spark-daria defines additional Column methods such as isTrue, isFalse, isNullOrBlank, isNotNullOrBlank, and isNotIn to fill in the Spark API gaps.

This blog post will demonstrate how to express logic with the available Column predicate methods.

A quick note on method signatures

object ColumnExt {

implicit class ColumnMethods(col: Column) {

def isTrue: Column = {
when(col.isNull, false).otherwise(col === true)
}

}

}

isNull, isNotNull, isin

The isNull method returns true if the column contains a null value and false otherwise.

val sourceDF = spark.createDF(
List(
(1, "shakira"),
(2, "sofia"),
(3, null)
), List(
("person_id", IntegerType, true),
("person_name", StringType, true)
)
)

sourceDF.withColumn(
"is_person_name_null",
col("person_name").isNull
).show()
+---------+-----------+-------------------+
|person_id|person_name|is_person_name_null|
+---------+-----------+-------------------+
| 1| shakira| false|
| 2| sofia| false|
| 3| null| true|
+---------+-----------+-------------------+

The isNotNull method returns true if the column does not contain a null value, and false otherwise.

sourceDF.withColumn(
"is_person_name_not_null",
col("person_name").isNotNull
).show()
+---------+-----------+-----------------------+
|person_id|person_name|is_person_name_not_null|
+---------+-----------+-----------------------+
| 1| shakira| true|
| 2| sofia| true|
| 3| null| false|
+---------+-----------+-----------------------+

The isin method returns true if the column is contained in a list of arguments and false otherwise.

val primaryColors = List("red", "yellow", "blue")

val sourceDF = spark.createDF(
List(
("rihanna", "red"),
("solange", "yellow"),
("selena", "purple")
), List(
("celebrity", StringType, true),
("color", StringType, true)
)
)

sourceDF.withColumn(
"is_primary_color",
col("color").isin(primaryColors: _*)
).show()
+---------+------+----------------+
|celebrity| color|is_primary_color|
+---------+------+----------------+
| rihanna| red| true|
| solange|yellow| true|
| selena|purple| false|
+---------+------+----------------+

You will use the isNull, isNotNull, and isin methods constantly when writing Spark code. Let’s take a look at some spark-daria Column predicate methods that are also useful when writing Spark code.

spark-daria Column predicate methods

import com.github.mrpowers.spark.daria.sql.ColumnExt._

The isTrue methods returns true if the column is true and the isFalse method returns true if the column is false.

val sourceDF = spark.createDF(
List(
("Argentina", false),
("Japan", true),
(null, null)
), List(
("country", StringType, true),
("in_asia", BooleanType, true)
)
)

sourceDF.withColumn(
"in_asia_is_true",
col("in_asia").isTrue
).withColumn(
"in_asia_is_false",
col("in_asia").isFalse
).show()
+---------+-------+---------------+----------------+
| country|in_asia|in_asia_is_true|in_asia_is_false|
+---------+-------+---------------+----------------+
|Argentina| false| false| true|
| Japan| true| true| false|
| null| null| false| false|
+---------+-------+---------------+----------------+

The isNullOrBlank method returns true if the column is null or contains an empty string.

val sourceDF = spark.createDF(
List(
("water"),
(" jellyfish"),
(""),
(" "),
(null)
), List(
("thing", StringType, true)
)
)

sourceDF.withColumn(
"is_thing_null_or_blank",
col("thing").isNullOrBlank
).show()
+-----------+----------------------+
| thing|is_thing_null_or_blank|
+-----------+----------------------+
| water| false|
| jellyfish| false|
| | true|
| | true|
| null| true|
+-----------+----------------------+

isNotNullOrBlank is the opposite and returns true if the column does not contain null or the empty string.

The isNotIn method returns true if the column is not in a specified list and and is the oppositite of isin.

val primaryColors = List("red", "yellow", "blue")

val sourceDF = spark.createDF(
List(
("rihanna", "red"),
("solange", "yellow"),
("selena", "purple")
), List(
("celebrity", StringType, true),
("color", StringType, true)
)
)

sourceDF.withColumn(
"is_not_primary_color",
col("color").isNotIn(primaryColors: _*)
).show()
+---------+------+--------------------+
|celebrity| color|is_not_primary_color|
+---------+------+--------------------+
| rihanna| red| false|
| solange|yellow| false|
| selena|purple| true|
+---------+------+--------------------+

Falsy and truthy values

According to Douglas Crawford, falsy values are one of the awful parts of the JavaScript programming language! So it is will great hesitation that I’ve added isTruthy and isFalsy to the spark-daria library.

isFalsy returns true if the value is null or false.

val sourceDF = spark.createDF(
List(
(true),
(false),
(null)
), List(
("likes_cheese", BooleanType, true)
)
)

sourceDF.withColumn(
"is_likes_cheese_falsy",
col("likes_cheese").isFalsy
).show()
+------------+---------------------+
|likes_cheese|is_likes_cheese_falsy|
+------------+---------------------+
| true| false|
| false| true|
| null| true|
+------------+---------------------+

isTruthy is the opposite and returns true if the value is anything other than null or false.

val sourceDF = spark.createDF(
List(
(1),
(3),
(null)
), List(
("odd_number", IntegerType, true)
)
)

sourceDF.withColumn(
"is_odd_number_truthy",
col("odd_number").isTruthy
).show()
+----------+--------------------+
|odd_number|is_odd_number_truthy|
+----------+--------------------+
| 1| true|
| 3| true|
| null| false|
+----------+--------------------+

I’m still not sure if it’s a good idea to introduce truthy and falsy values into Spark code, so use this code with caution.

Conclusion

Spark codebases that properly leverage the available methods are easy to maintain and read.

Spark coder, live in Colombia / Brazil / US, love Scala / Python / Ruby, working on empowering Latinos and Latinas in tech