Adding ArrayType columns to Spark DataFrames with concat_ws and split
The concat_ws
and split
Spark SQL functions can be used to add ArrayType
columns to DataFrames.
Let’s demonstrate the concat_ws
/ split
approach by intepreting a StringType
column and analyze when this approach is preferable to the array()
function.
String interpretation with the array() method
Let’s create a DataFrame with a StringType
column and use the array()
function to parse out all the colors in the string.
val df = Seq(
("i like blue and red"),
("you pink and blue")
).toDF("word1")
val actualDF = df.withColumn(
"colors",
array(
when(col("word1").contains("blue"), "blue"),
when(col("word1").contains("red"), "red"),
when(col("word1").contains("pink"), "pink"),
when(col("word1").contains("cyan"), "cyan")
)
)
actualDF.show(truncate = false)+-------------------+------------------------+
|word1 |colors |
+-------------------+------------------------+
|i like blue and red|[blue, red, null, null] |
|you pink and blue |[blue, null, pink, null]|
+-------------------+------------------------+
The array()
function unfortunately includes null
values in the colors
column. We’ll solve the null
problem shortly. Let’s refactor the code with a loop first.
val colors = Array("blue", "red", "pink", "cyan")
val actualDF = df.withColumn(
"colors",
array(
colors.map{ c: String =>
when(col("word1").contains(c), c)
}: _*
)
)actualDF.show(truncate=false)+-------------------+------------------------+
|word1 |colors |
+-------------------+------------------------+
|i like blue and red|[blue, red, null, null] |
|you pink and blue |[blue, null, pink, null]|
+-------------------+------------------------+
map
has helped us eliminate the duplication from our code, but it’s a bit harder to read now 😑
The loop cleanup can be skipped for smaller lists, but should be used if there is a lot of duplication.
Eliminating null from the arrays
Let’s use concat_ws
and split
instead of array
to append an ArrayType
column without any null
values.
val actualDF = df.withColumn(
"colors",
split(
concat_ws(
",",
when(col("word1").contains("blue"), "blue"),
when(col("word1").contains("red"), "red"),
when(col("word1").contains("pink"), "pink"),
when(col("word1").contains("cyan"), "cyan")
),
","
)
)
actualDF.show(truncate=false)+-------------------+------------+
|word1 |colors |
+-------------------+------------+
|i like blue and red|[blue, red] |
|you pink and blue |[blue, pink]|
+-------------------+------------+
Awesome! Our output looks way better 🎊 😃
Nota bene: A single comma delimiter only works because there are not commas in our example data. Use a better delimiter in your production code!
I created an arrayExNull
function in spark-daria to wrap the split
/ concat_ws
code, so you can write code that’s a little cleaner.
val colors = Array("blue", "red", "pink", "cyan")
val actualDF = df.withColumn(
"colors",
arrayExNull(
colors.map{ c: String =>
when(col("word1").contains(c), c)
}: _*
)
)
actualDF.show(truncate=false)+-------------------+------------+
|word1 |colors |
+-------------------+------------+
|i like blue and red|[blue, red] |
|you pink and blue |[blue, pink]|
+-------------------+------------+
Next steps
Use the arrayExNull
function instead of the array
function to append ArrayType
columns to DataFrames without null
values.
Also study on how the Scala map
function can be used to remove duplication from your Spark code 😉