Adding ArrayType columns to Spark DataFrames with concat_ws and split

Matthew Powers
2 min readJan 15, 2018

The concat_ws and split Spark SQL functions can be used to add ArrayType columns to DataFrames.

Let’s demonstrate the concat_ws / split approach by intepreting a StringType column and analyze when this approach is preferable to the array() function.

String interpretation with the array() method

Let’s create a DataFrame with a StringType column and use the array() function to parse out all the colors in the string.

val df = Seq(
("i like blue and red"),
("you pink and blue")
).toDF("word1")

val actualDF = df.withColumn(
"colors",
array(
when(col("word1").contains("blue"), "blue"),
when(col("word1").contains("red"), "red"),
when(col("word1").contains("pink"), "pink"),
when(col("word1").contains("cyan"), "cyan")
)
)

actualDF.show(truncate = false)
+-------------------+------------------------+
|word1 |colors |
+-------------------+------------------------+
|i like blue and red|[blue, red, null, null] |
|you pink and blue |[blue, null, pink, null]|
+-------------------+------------------------+

The array() function unfortunately includes null values in the colors column. We’ll solve the null problem shortly. Let’s refactor the code with a loop first.

val colors = Array("blue", "red", "pink", "cyan")

val actualDF = df.withColumn(
"colors",
array(
colors.map{ c: String =>
when(col("word1").contains(c), c)
}: _*
)
)
actualDF.show(truncate=false)+-------------------+------------------------+
|word1 |colors |
+-------------------+------------------------+
|i like blue and red|[blue, red, null, null] |
|you pink and blue |[blue, null, pink, null]|
+-------------------+------------------------+

map has helped us eliminate the duplication from our code, but it’s a bit harder to read now 😑

The loop cleanup can be skipped for smaller lists, but should be used if there is a lot of duplication.

Eliminating null from the arrays

Let’s use concat_ws and split instead of array to append an ArrayType column without any null values.

val actualDF = df.withColumn(
"colors",
split(
concat_ws(
",",
when(col("word1").contains("blue"), "blue"),
when(col("word1").contains("red"), "red"),
when(col("word1").contains("pink"), "pink"),
when(col("word1").contains("cyan"), "cyan")
),
","
)
)

actualDF.show(truncate=false)
+-------------------+------------+
|word1 |colors |
+-------------------+------------+
|i like blue and red|[blue, red] |
|you pink and blue |[blue, pink]|
+-------------------+------------+

Awesome! Our output looks way better 🎊 😃

Nota bene: A single comma delimiter only works because there are not commas in our example data. Use a better delimiter in your production code!

I created an arrayExNull function in spark-daria to wrap the split / concat_ws code, so you can write code that’s a little cleaner.

val colors = Array("blue", "red", "pink", "cyan")

val actualDF = df.withColumn(
"colors",
arrayExNull(
colors.map{ c: String =>
when(col("word1").contains(c), c)
}: _*
)
)

actualDF.show(truncate=false)
+-------------------+------------+
|word1 |colors |
+-------------------+------------+
|i like blue and red|[blue, red] |
|you pink and blue |[blue, pink]|
+-------------------+------------+

Next steps

Use the arrayExNull function instead of the array function to append ArrayType columns to DataFrames without null values.

Also study on how the Scala map function can be used to remove duplication from your Spark code 😉

--

--

Matthew Powers

Spark coder, live in Colombia / Brazil / US, love Scala / Python / Ruby, working on empowering Latinos and Latinas in tech