Column predicate methods in Spark (isNull, isin, isTrue, isNullOrBlank, etc.)

A quick note on method signatures

By convention, methods with accessor-like names (i.e. methods that begin with "is") are defined as empty-paren methods. For example, the isTrue method is defined without parenthesis as follows:

object ColumnExt {

implicit class ColumnMethods(col: Column) {

def isTrue: Column = {
when(col.isNull, false).otherwise(col === true)
}

}

}

isNull, isNotNull, isin

The Spark Column class defines four methods with accessor-like names. , but Let’s dive in and explore the isNull, isNotNull, and isin methods (isNaN isn’t frequently used, so we’ll ignore it for now).

val sourceDF = spark.createDF(
List(
(1, "shakira"),
(2, "sofia"),
(3, null)
), List(
("person_id", IntegerType, true),
("person_name", StringType, true)
)
)

sourceDF.withColumn(
"is_person_name_null",
col("person_name").isNull
).show()
+---------+-----------+-------------------+
|person_id|person_name|is_person_name_null|
+---------+-----------+-------------------+
| 1| shakira| false|
| 2| sofia| false|
| 3| null| true|
+---------+-----------+-------------------+
sourceDF.withColumn(
"is_person_name_not_null",
col("person_name").isNotNull
).show()
+---------+-----------+-----------------------+
|person_id|person_name|is_person_name_not_null|
+---------+-----------+-----------------------+
| 1| shakira| true|
| 2| sofia| true|
| 3| null| false|
+---------+-----------+-----------------------+
val primaryColors = List("red", "yellow", "blue")

val sourceDF = spark.createDF(
List(
("rihanna", "red"),
("solange", "yellow"),
("selena", "purple")
), List(
("celebrity", StringType, true),
("color", StringType, true)
)
)

sourceDF.withColumn(
"is_primary_color",
col("color").isin(primaryColors: _*)
).show()
+---------+------+----------------+
|celebrity| color|is_primary_color|
+---------+------+----------------+
| rihanna| red| true|
| solange|yellow| true|
| selena|purple| false|
+---------+------+----------------+

spark-daria Column predicate methods

The spark-daria column extensions can be imported to your code with this command:

import com.github.mrpowers.spark.daria.sql.ColumnExt._
val sourceDF = spark.createDF(
List(
("Argentina", false),
("Japan", true),
(null, null)
), List(
("country", StringType, true),
("in_asia", BooleanType, true)
)
)

sourceDF.withColumn(
"in_asia_is_true",
col("in_asia").isTrue
).withColumn(
"in_asia_is_false",
col("in_asia").isFalse
).show()
+---------+-------+---------------+----------------+
| country|in_asia|in_asia_is_true|in_asia_is_false|
+---------+-------+---------------+----------------+
|Argentina| false| false| true|
| Japan| true| true| false|
| null| null| false| false|
+---------+-------+---------------+----------------+
val sourceDF = spark.createDF(
List(
("water"),
(" jellyfish"),
(""),
(" "),
(null)
), List(
("thing", StringType, true)
)
)

sourceDF.withColumn(
"is_thing_null_or_blank",
col("thing").isNullOrBlank
).show()
+-----------+----------------------+
| thing|is_thing_null_or_blank|
+-----------+----------------------+
| water| false|
| jellyfish| false|
| | true|
| | true|
| null| true|
+-----------+----------------------+
val primaryColors = List("red", "yellow", "blue")

val sourceDF = spark.createDF(
List(
("rihanna", "red"),
("solange", "yellow"),
("selena", "purple")
), List(
("celebrity", StringType, true),
("color", StringType, true)
)
)

sourceDF.withColumn(
"is_not_primary_color",
col("color").isNotIn(primaryColors: _*)
).show()
+---------+------+--------------------+
|celebrity| color|is_not_primary_color|
+---------+------+--------------------+
| rihanna| red| false|
| solange|yellow| false|
| selena|purple| true|
+---------+------+--------------------+

Falsy and truthy values

Scala does not have truthy and falsy values, but other programming languages do have the concept of different values that are true and false in boolean contexts.

val sourceDF = spark.createDF(
List(
(true),
(false),
(null)
), List(
("likes_cheese", BooleanType, true)
)
)

sourceDF.withColumn(
"is_likes_cheese_falsy",
col("likes_cheese").isFalsy
).show()
+------------+---------------------+
|likes_cheese|is_likes_cheese_falsy|
+------------+---------------------+
| true| false|
| false| true|
| null| true|
+------------+---------------------+
val sourceDF = spark.createDF(
List(
(1),
(3),
(null)
), List(
("odd_number", IntegerType, true)
)
)

sourceDF.withColumn(
"is_odd_number_truthy",
col("odd_number").isTruthy
).show()
+----------+--------------------+
|odd_number|is_odd_number_truthy|
+----------+--------------------+
| 1| true|
| 3| true|
| null| false|
+----------+--------------------+

Conclusion

Between Spark and spark-daria, you have a powerful arsenal of Column predicate methods to express logic in your Spark code.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Matthew Powers

Matthew Powers

Spark coder, live in Colombia / Brazil / US, love Scala / Python / Ruby, working on empowering Latinos and Latinas in tech