Documenting Spark Code with Scaladoc

Matthew Powers
5 min readFeb 18, 2018

You can use Scaladoc to generate nicely formatted documentation for your Spark projects, just like the official Spark documentation.

Documentation encourages you to write code with clearly defined public interfaces and makes it easier for others to use your code.

The spark-daria project is a good example of an open source project that’s easy to use because the documentation follows Spark’s best practices. This blog will show you how to add documentation to your Spark projects.

How to generate documentation

The sbt doc command generates HTML documentation in the target/scala-2.11/api/ directory. You can open the documentation locally with the open target/scala-2.11/api/index.html command.

Here’s an example of the documentation that’s generated.

spark-daria transformations documentation

Each method that’s documented can be uncollapsed to see additional details about the function.

Uncollapsed documentation view

Here’s the code that generates the beautiful documentation for the removeAllWhitespace() function.

/**
* Removes all whitespace in a string
*
{{{
* val actualDF = sourceDF.withColumn(
* "some_string_without_whitespace",
* removeAllWhitespace(col("some_string"))
* )
*
}}}
*
* Removes all whitespace in a string (e.g. changes
`"this has some"` to `"thishassome"`.
*
*
@group string_funcs
*
@since 0.16.0
*/
def removeAllWhitespace(col: Column): Column = {
regexp_replace(col, "\\s+", "")
}

Scaladoc uses different markup conventions than GitHub flavored markdown that you’re used to, so you’ll need to study up on these Scaladoc conventions:

`monospace`
''italic text''
'''bold text'''
__underline__
{{{
val whatIsThis = "a code snippet!"
}}}

The @since annotation indicates that the removeAllWhitespace function has been available in spark-daria since version 0.16.0.

Let’s take a deeper look at the @group annotation that collects related functions to make the documentation more readable.

Grouping functions

The spark-daria functions are grouped as collection functions, date time functions, misc functions, and string functions.

spark-daria function groupings

The spark-daria function groupings are identical to the groups that Spark uses for its SQL functions.

Here is how to define groupnames with Scaladoc.

/**
*
@groupname datetime_funcs Date time functions
*
@groupname string_funcs String functions
*
@groupname collection_funcs Collection functions
*
@groupname misc_funcs Misc functions
*
@groupname Ungrouped Support functions for DataFrames
*/

Let’s revisit how the removeAllWhitespace function sets the group.

/**
*
@group string_funcs
*/
def removeAllWhitespace(col: Column): Column = {
regexp_replace(col, "\\s+", "")
}

You’ll need to add this line to your build.sbt file to properly generate documentation with the groups you’ve specified.

scalacOptions in (Compile, doc) += "-groups"

I recommend studying the Spark codebase and copying their grouping conventions to the extent possible. Set your own conventions for public interfaces that don’t have a precedent set by Spark.

Limiting the public interface with the private keyword

Scaladoc assumes that any class, object, or method that’s not flagged as private and should be included in the documentation. In other words, anything that’s not explicitly flagged as private is assumed to be part of the public API.

It’s easy to be lazy, forget to use the private keyword, and accidentally expose private methods as part of your public interface. Hosting documentation helps me stay disciplined about flagging private stuff, so my documentation isn’t cluttered with implementation details.

Let’s take a look at the spark-daria DataFrameColumnsAbsence class that’s private to the sql package.

package com.github.mrpowers.spark.daria.sql

import org.apache.spark.sql._

case class ProhibitedDataFrameColumnsException(smth: String) extends Exception(smth)

private[sql] class DataFrameColumnsAbsence(df: DataFrame, prohibitedColNames: Seq[String]) {

val extraColNames = (df.columns.toSeq).intersect(prohibitedColNames)

def extraColumnsMessage(): String = {
val extra = extraColNames.mkString(", ")
val all = df.columns.mkString(", ")
s"The [${extra}] columns are not allowed to be included in the DataFrame with the following columns [${all}]"
}

def validateAbsenceOfColumns(): Unit = {
if (extraColNames.nonEmpty) {
throw new ProhibitedDataFrameColumnsException(extraColumnsMessage())
}
}

}

This file exposes the ProhibitedDataFrameColumnsException case class, so API consumers can catch this custom exception. The DataFrameColumnsAbsence class is marked as private to the sql package because we don’t want our API consumers to use this class directly.

In other words, the DataFrameColumnsAbsence class will not be included in the Scaladoc API documentation because it’s been marked as private to the sql package. Hiding implementation details from API users makes your software easier to use.

Individual methods can also be marked as private, so they’re not exposed in the API documentation.

private def toSnakeCase(str: String): String = {
str
.replaceAll("\\s+", "_")
.toLowerCase
}

Private methods are annoying to test with Scala, but it’s possible with the PrivateMethodTester trait. I use package level privacy whenever possible, so it’s easier to test the private methods.

Hosting the documentation

You can easily create a GitHub page to host the documentation for your open source projects. Here’s the repo for my GitHub page and here’s where I store the spark-daria HTML files.

I run the cp -r $SPARK_DARIA_PATH/target/scala-2.11/api/ docs/spark_daria command to copy the HTML files from spark-daria to the mrpowers.github.io repo. The wesite is automatically updated whenever code is pushed to master.

You can add some authentication and host a static website on S3 for your private repos. Definitely host your documentation on a website and make it accessible for non-technical stakeholders.

Well documented public interfaces help product and project management folks learn more about the software development process.

Next steps

A new developer should be able to clone your Spark project, run sbt doc, and easily view the latest public API of your project. Exposing methods that should be private as part of your public API is a huge code smell.

Start documenting your projects by flagging all of the private methods, so they aren’t exposed in the documentation.

Then add the documentation for each public method with clear descriptions, code snippets, and since / group annotations.

Host the documentation and send the link to all technical and non-technical stakeholders. Documentation will help you communicate better, write code that’s easier to use, and make better abstractions.

--

--

Matthew Powers

Spark coder, live in Colombia / Brazil / US, love Scala / Python / Ruby, working on empowering Latinos and Latinas in tech