Column predicate methods in Spark (isNull, isin, isTrue, isNullOrBlank, etc.)
The Spark Column class defines predicate methods that allow logic to be expressed consisely and elegantly (e.g. isNull
, isNotNull
, and isin
).
spark-daria defines additional Column
methods such as isTrue
, isFalse
, isNullOrBlank
, isNotNullOrBlank
, and isNotIn
to fill in the Spark API gaps.
This blog post will demonstrate how to express logic with the available Column
predicate methods.
A quick note on method signatures
By convention, methods with accessor-like names (i.e. methods that begin with "is"
) are defined as empty-paren methods. For example, the isTrue
method is defined without parenthesis as follows:
object ColumnExt {
implicit class ColumnMethods(col: Column) {
def isTrue: Column = {
when(col.isNull, false).otherwise(col === true)
}
}
}
isNull, isNotNull, isin
The Spark Column class defines four methods with accessor-like names. , but Let’s dive in and explore the isNull
, isNotNull
, and isin
methods (isNaN
isn’t frequently used, so we’ll ignore it for now).
The isNull
method returns true
if the column contains a null
value and false
otherwise.
val sourceDF = spark.createDF(
List(
(1, "shakira"),
(2, "sofia"),
(3, null)
), List(
("person_id", IntegerType, true),
("person_name", StringType, true)
)
)
sourceDF.withColumn(
"is_person_name_null",
col("person_name").isNull
).show()+---------+-----------+-------------------+
|person_id|person_name|is_person_name_null|
+---------+-----------+-------------------+
| 1| shakira| false|
| 2| sofia| false|
| 3| null| true|
+---------+-----------+-------------------+
The isNotNull
method returns true
if the column does not contain a null
value, and false
otherwise.
sourceDF.withColumn(
"is_person_name_not_null",
col("person_name").isNotNull
).show()+---------+-----------+-----------------------+
|person_id|person_name|is_person_name_not_null|
+---------+-----------+-----------------------+
| 1| shakira| true|
| 2| sofia| true|
| 3| null| false|
+---------+-----------+-----------------------+
The isin
method returns true
if the column is contained in a list of arguments and false
otherwise.
val primaryColors = List("red", "yellow", "blue")
val sourceDF = spark.createDF(
List(
("rihanna", "red"),
("solange", "yellow"),
("selena", "purple")
), List(
("celebrity", StringType, true),
("color", StringType, true)
)
)
sourceDF.withColumn(
"is_primary_color",
col("color").isin(primaryColors: _*)
).show()+---------+------+----------------+
|celebrity| color|is_primary_color|
+---------+------+----------------+
| rihanna| red| true|
| solange|yellow| true|
| selena|purple| false|
+---------+------+----------------+
You will use the isNull
, isNotNull
, and isin
methods constantly when writing Spark code. Let’s take a look at some spark-daria Column predicate methods that are also useful when writing Spark code.
spark-daria Column predicate methods
The spark-daria column extensions can be imported to your code with this command:
import com.github.mrpowers.spark.daria.sql.ColumnExt._
The isTrue
methods returns true
if the column is true
and the isFalse
method returns true
if the column is false
.
val sourceDF = spark.createDF(
List(
("Argentina", false),
("Japan", true),
(null, null)
), List(
("country", StringType, true),
("in_asia", BooleanType, true)
)
)
sourceDF.withColumn(
"in_asia_is_true",
col("in_asia").isTrue
).withColumn(
"in_asia_is_false",
col("in_asia").isFalse
).show()+---------+-------+---------------+----------------+
| country|in_asia|in_asia_is_true|in_asia_is_false|
+---------+-------+---------------+----------------+
|Argentina| false| false| true|
| Japan| true| true| false|
| null| null| false| false|
+---------+-------+---------------+----------------+
The isNullOrBlank
method returns true if the column is null
or contains an empty string.
val sourceDF = spark.createDF(
List(
("water"),
(" jellyfish"),
(""),
(" "),
(null)
), List(
("thing", StringType, true)
)
)
sourceDF.withColumn(
"is_thing_null_or_blank",
col("thing").isNullOrBlank
).show()+-----------+----------------------+
| thing|is_thing_null_or_blank|
+-----------+----------------------+
| water| false|
| jellyfish| false|
| | true|
| | true|
| null| true|
+-----------+----------------------+
isNotNullOrBlank
is the opposite and returns true
if the column does not contain null
or the empty string.
The isNotIn
method returns true if the column is not in a specified list and and is the oppositite of isin
.
val primaryColors = List("red", "yellow", "blue")
val sourceDF = spark.createDF(
List(
("rihanna", "red"),
("solange", "yellow"),
("selena", "purple")
), List(
("celebrity", StringType, true),
("color", StringType, true)
)
)
sourceDF.withColumn(
"is_not_primary_color",
col("color").isNotIn(primaryColors: _*)
).show()+---------+------+--------------------+
|celebrity| color|is_not_primary_color|
+---------+------+--------------------+
| rihanna| red| false|
| solange|yellow| false|
| selena|purple| true|
+---------+------+--------------------+
Falsy and truthy values
Scala does not have truthy and falsy values, but other programming languages do have the concept of different values that are true and false in boolean contexts.
According to Douglas Crawford, falsy values are one of the awful parts of the JavaScript programming language! So it is will great hesitation that I’ve added isTruthy
and isFalsy
to the spark-daria library.
isFalsy
returns true
if the value is null
or false
.
val sourceDF = spark.createDF(
List(
(true),
(false),
(null)
), List(
("likes_cheese", BooleanType, true)
)
)
sourceDF.withColumn(
"is_likes_cheese_falsy",
col("likes_cheese").isFalsy
).show()+------------+---------------------+
|likes_cheese|is_likes_cheese_falsy|
+------------+---------------------+
| true| false|
| false| true|
| null| true|
+------------+---------------------+
isTruthy
is the opposite and returns true
if the value is anything other than null
or false
.
val sourceDF = spark.createDF(
List(
(1),
(3),
(null)
), List(
("odd_number", IntegerType, true)
)
)
sourceDF.withColumn(
"is_odd_number_truthy",
col("odd_number").isTruthy
).show()+----------+--------------------+
|odd_number|is_odd_number_truthy|
+----------+--------------------+
| 1| true|
| 3| true|
| null| false|
+----------+--------------------+
I’m still not sure if it’s a good idea to introduce truthy and falsy values into Spark code, so use this code with caution.
Conclusion
Between Spark and spark-daria, you have a powerful arsenal of Column predicate methods to express logic in your Spark code.
Spark codebases that properly leverage the available methods are easy to maintain and read.