Whitespace data munging with Spark

Matthew Powers
4 min readDec 23, 2017

--

The Spark SQL API and spark-daria provide a variety of methods to manipulate whitespace in your DataFrame StringType columns. Inline whitespace data munging with regexp_replace() increases code complexity. You’ll have code that’s easier to read and maintain you if rely on existing data munging functions.

trim(), ltrim(), and rtrim()

Spark provides functions to eliminate leading and trailing whitespace. The trim() function removes both leading and trailing whitespace as shown in the following example.

val sourceDF = spark.createDF(
List(
(" a "),
("b "),
(" c"),
(null)
), List(
("word", StringType, true)
)
)

val actualDF = sourceDF.withColumn(
"trimmed_word",
trim(col("word"))
)
actualDF.show()+----------+------------+
| word|trimmed_word|
+----------+------------+
|" a "| "a"|
| "b "| "b"|
| " c"| "c"|
| null| null|
+----------+------------+

The createDF() method is defined in spark-daria and is superior to the toDF() and createDataFrame() methods offered by Spark, as described in this blog post.

Let’s use the same sourceDF and demonstrate how the ltrim() method removes the leading whitespace.

val sourceDF = spark.createDF(
List(
(" a "),
("b "),
(" c"),
(null)
), List(
("word", StringType, true)
)
)

val actualDF = sourceDF.withColumn(
"ltrimmed_word",
ltrim(col("word"))
)
actualDF.show()+----------+-------------+
| word|ltrimmed_word|
+----------+-------------+
|" a "| "a "|
| "b "| "b "|
| " c"| "c"|
| null| null|
+----------+-------------+

The rtrim() method removes all trailing whitespace from a string - you can easily figure that one out by yourself 😉

singleSpace()

The spark-daria project defines a singleSpace() method that removes all leading and trailing whitespace and replaces all inner whitespace with a single space.

Let’s take a look at an example.

val sourceDF = spark.createDF(
List(
("i like cheese"),
(" the dog runs "),
(null)
), List(
("words", StringType, true)
)
)

val actualDF = sourceDF.withColumn(
"single_spaced",
singleSpace(col("words"))
)
actualDF.show()+-------------------+---------------+
| words| single_spaced|
+-------------------+---------------+
|"i like cheese"|"i like cheese"|
|" the dog runs "| "the dog runs"|
| null| null|
+-------------------+---------------+

Here’s how the singleSpace() function is defined in the spark-daria source code.

def singleSpace(col: Column): Column = {
trim(regexp_replace(col, " +", " "))
}

Defining functions that take a Column argument and return a Column are a powerful way to create custom functions that leverage the Spark functions library.

removeAllWhitespace()

spark-daria defines a removeAllWhitespace() method that removes all whitespace from a string as shown in the following example.

val sourceDF = spark.createDF(
List(
("i like cheese"),
(" the dog runs "),
(null)
), List(
("words", StringType, true)
)
)

val actualDF = sourceDF.withColumn(
"no_whitespace",
removeAllWhitespace(col("words"))
)
actualDF.show()+-------------------+-------------+
| words|no_whitespace|
+-------------------+-------------+
|"i like cheese"|"ilikecheese"|
|" the dog runs "| "thedogruns"|
| null| null|
+-------------------+-------------+

Here is how the removeAllWhitespace() function is defined in spark-daria.

def removeAllWhitespace(col: Column): Column = {
regexp_replace(col, "\\s+", "")
}

antiTrim()

The antiTrim() method keeps all the leading and trailing whitespace, but removes all of the inner whitespace.

val sourceDF = spark.createDF(
List(
("i like cheese"),
(" the dog runs "),
(null)
), List(
("words", StringType, true)
)
)

val actualDF = sourceDF.withColumn(
"anti_trimmed",
antiTrim(col("words"))
)
actualDF.show()+-------------------+-----------------+
| words| anti_trimmed|
+-------------------+-----------------+
|"i like cheese"| "ilikecheese"|
|" the dog runs "|" thedogruns "|
| null| null|
+-------------------+-----------------+

antiTrim() doesn’t have many practical uses, but is included in spark-daria for completeness.

rpad(), lpad()

The rpad() method can be used to add whitespace to the right side of a string. In the following example, we’ll make strings that are 10 characters long by filling in whitespace to the right of the string.

val sourceDF = spark.createDF(
List(
("hi"),
(" bye"),
(null)
), List(
("word", StringType, true)
)
)

val actualDF = sourceDF.withColumn(
"rpadded",
rpad(col("word"), 10, " ")
)
actualDF.show()+-------+------------+
| word| rpadded|
+-------+------------+
| "hi"|"hi "|
|" bye"|" bye "|
| null| null|
+-------+------------+

The lpad() method works similarly, but adds whitespace to the left of a string.

How quotes were displayed in this blogs DataFrame printouts

The show() method doesn’t display quotes around strings, which would have made it hard to understand the whitespace examples in this blog. See the following example.

val sourceDF = spark.createDF(
List(
(" flying"),
("sleeping "),
(null)
), List(
("status", StringType, true)
)
)

sourceDF.show()
+----------+
| status|
+----------+
| flying|
| sleeping |
| null|
+----------+

A showQuotes() helper method was defined to surround the strings in quotes, so the whitespace would be more visible.

def showQuotes(colName: String): Column = {
concat(lit("""""""), col(colName), lit("""""""))
}

lit(""""""") wraps a single double quote character in triple quotes (you can count the seven " characters). This code looks absolutely crazy 😖

The DataFrame printouts in this blog used the showQuotes() method like so:

sourceDF.withColumn(
"status",
showQuotes("status")
).show()
+------------+
| status|
+------------+
|" flying"|
| "sleeping "|
| null|
+------------+

Conclusion

Whitespace data munging in Spark is a solved problem! If you can think of any other whitespace manipulations, let me know or send a pull request to spark-daria.

Resist the temptation to roll your own whitespace data munging solutions. You should be focusing on your business domain, not fighting whitespace!

--

--

Matthew Powers
Matthew Powers

Written by Matthew Powers

Spark coder, live in Colombia / Brazil / US, love Scala / Python / Ruby, working on empowering Latinos and Latinas in tech

Responses (4)