Spark’s Treatment of Empty Strings and Blank Values in CSV Files

Matthew Powers
2 min readJan 19, 2017

--

Blank CSV values were incorrectly loaded into Spark 2.0.0 DataFrames as empty strings and this was fixed in Spark 2.0.1. This post describes the bug fix, explains the correct treatment per the CSV spec, and questions if Spark is following semantic versioning best practices.

CSV Spec

Suppose we have a ~/Desktop/flowers.csv file with the following data.

name,color,is_pretty
rose,red,true
sunflower,,true
lilac,"",true

In this blog, we refer to the color of the sunflower row as “blank” and the color of the lilac row as an “empty string”.

The CSV spec doesn’t differentiate between blank and empty string values. According to the CSV spec, “each field may or may not be enclosed in double quotes” so blank values and empty strings should be treated equally.

The terms “blank value” and “empty string” are used because Spark 2.0.0 treats the values differently. In reality, we shouldn’t use two different terms because the CSV spec considers blank values and empty strings equivalent.

Spark 2.0.0

In Spark 2.0.0, blank values were read into DataFrames as empty strings and empty strings were read into DataFrames as null. No, I didn’t say that backwards — Spark 2.0.0 is backwards!

Here’s a little script to demonstrate this odd behavior.

val homePath = sys.env.get("HOME").getOrElse(None)val flowersPath = s"$homePath/Desktop/flowers.csv"val flowersDf = spark.read
.format("csv")
.option("header", "true")
.option("charset", "UTF8")
.load(flowersPath)
flowersDf.show()

Here’s the output:

+---------+-----+---------+
| name|color|is_pretty|
+---------+-----+---------+
| rose| red| true|
|sunflower| | true|
| lilac| null| true|
+---------+-----+---------+

The color of the sunflower row was blank in the CSV file and is ready into the DataFrame as the empty string.

The color of the lilac row was the empty string in the CSV file and is read into the DataFrame as null.

Per the CSV spec, blank values and empty strings should be treated equally, so the Spark 2.0.0 csv library is wrong!

Spark 2.0.1

Spark 2.0.1 reads in both blank values and the empty string as null values.

Here is the output of the same script we ran earlier, but with Spark 2.0.1:

+---------+-----+---------+
| name|color|is_pretty|
+---------+-----+---------+
| rose| red| true|
|sunflower| null| true|
| lilac| null| true|
+---------+-----+---------+

The color of the sunflower row was blank in the CSV file and is null in the DataFrame.

The color of the lilac row was the empty string in the CSV file and is null in the DataFrame.

Spark 2.0.1 correctly treats blank values and empty strings equally, so it fixes the Spark 2.0.0 bug.

Semantic Versioning

Spark decided to not follow semantic versioning when bumping to version 2.0.1. Bumping the patch version is for “backwards-compatible bug fixes”, not for changes that can break production code!

Onwards

Don’t use Spark 2.0.0 anymore!

--

--

Matthew Powers

Spark coder, live in Colombia / Brazil / US, love Scala / Python / Ruby, working on empowering Latinos and Latinas in tech