Adding ArrayType columns to Spark DataFrames with concat_ws and split

The concat_ws and split Spark SQL functions can be used to add ArrayType columns to DataFrames.

Let’s demonstrate the concat_ws / split approach by intepreting a StringType column and analyze when this approach is preferable to the array() function.

String interpretation with the array() method

val df = Seq(
("i like blue and red"),
("you pink and blue")
).toDF("word1")

val actualDF = df.withColumn(
"colors",
array(
when(col("word1").contains("blue"), "blue"),
when(col("word1").contains("red"), "red"),
when(col("word1").contains("pink"), "pink"),
when(col("word1").contains("cyan"), "cyan")
)
)

actualDF.show(truncate = false)
+-------------------+------------------------+
|word1 |colors |
+-------------------+------------------------+
|i like blue and red|[blue, red, null, null] |
|you pink and blue |[blue, null, pink, null]|
+-------------------+------------------------+

The array() function unfortunately includes null values in the colors column. We’ll solve the null problem shortly. Let’s refactor the code with a loop first.

val colors = Array("blue", "red", "pink", "cyan")

val actualDF = df.withColumn(
"colors",
array(
colors.map{ c: String =>
when(col("word1").contains(c), c)
}: _*
)
)
actualDF.show(truncate=false)+-------------------+------------------------+
|word1 |colors |
+-------------------+------------------------+
|i like blue and red|[blue, red, null, null] |
|you pink and blue |[blue, null, pink, null]|
+-------------------+------------------------+

map has helped us eliminate the duplication from our code, but it’s a bit harder to read now 😑

The loop cleanup can be skipped for smaller lists, but should be used if there is a lot of duplication.

Eliminating null from the arrays

val actualDF = df.withColumn(
"colors",
split(
concat_ws(
",",
when(col("word1").contains("blue"), "blue"),
when(col("word1").contains("red"), "red"),
when(col("word1").contains("pink"), "pink"),
when(col("word1").contains("cyan"), "cyan")
),
","
)
)

actualDF.show(truncate=false)
+-------------------+------------+
|word1 |colors |
+-------------------+------------+
|i like blue and red|[blue, red] |
|you pink and blue |[blue, pink]|
+-------------------+------------+

Awesome! Our output looks way better 🎊 😃

Nota bene: A single comma delimiter only works because there are not commas in our example data. Use a better delimiter in your production code!

I created an arrayExNull function in spark-daria to wrap the split / concat_ws code, so you can write code that’s a little cleaner.

val colors = Array("blue", "red", "pink", "cyan")

val actualDF = df.withColumn(
"colors",
arrayExNull(
colors.map{ c: String =>
when(col("word1").contains(c), c)
}: _*
)
)

actualDF.show(truncate=false)
+-------------------+------------+
|word1 |colors |
+-------------------+------------+
|i like blue and red|[blue, red] |
|you pink and blue |[blue, pink]|
+-------------------+------------+

Next steps

Also study on how the Scala map function can be used to remove duplication from your Spark code 😉

Spark coder, live in Colombia / Brazil / US, love Scala / Python / Ruby, working on empowering Latinos and Latinas in tech

Spark coder, live in Colombia / Brazil / US, love Scala / Python / Ruby, working on empowering Latinos and Latinas in tech