Documenting Spark Code with Scaladoc

You can use Scaladoc to generate nicely formatted documentation for your Spark projects, just like the official Spark documentation.

Documentation encourages you to write code with clearly defined public interfaces and makes it easier for others to use your code.

The spark-daria project is a good example of an open source project that’s easy to use because the documentation follows Spark’s best practices. This blog will show you how to add documentation to your Spark projects.

How to generate documentation

The command generates HTML documentation in the directory. You can open the documentation locally with the command.

Here’s an example of the documentation that’s generated.

spark-daria transformations documentation

Each method that’s documented can be uncollapsed to see additional details about the function.

Uncollapsed documentation view

Here’s the code that generates the beautiful documentation for the function.

/**
* Removes all whitespace in a string
*
{{{
* val actualDF = sourceDF.withColumn(
* "some_string_without_whitespace",
* removeAllWhitespace(col("some_string"))
* )
*
}}}
*
* Removes all whitespace in a string (e.g. changes
`"this has some"` to `"thishassome"`.
*
*
@group string_funcs
*
@since 0.16.0
*/
def removeAllWhitespace(col: Column): Column = {
regexp_replace(col, "\\s+", "")
}

Scaladoc uses different markup conventions than GitHub flavored markdown that you’re used to, so you’ll need to study up on these Scaladoc conventions:

`monospace`
''italic text''
'''bold text'''
__underline__
{{{
val whatIsThis = "a code snippet!"
}}}

The annotation indicates that the function has been available in spark-daria since version .

Let’s take a deeper look at the annotation that collects related functions to make the documentation more readable.

Grouping functions

The spark-daria functions are grouped as collection functions, date time functions, misc functions, and string functions.

spark-daria function groupings

The spark-daria function groupings are identical to the groups that Spark uses for its SQL functions.

Here is how to define with Scaladoc.

/**
*
@groupname datetime_funcs Date time functions
*
@groupname string_funcs String functions
*
@groupname collection_funcs Collection functions
*
@groupname misc_funcs Misc functions
*
@groupname Ungrouped Support functions for DataFrames
*/

Let’s revisit how the function sets the .

/**
*
@group string_funcs
*/
def removeAllWhitespace(col: Column): Column = {
regexp_replace(col, "\\s+", "")
}

You’ll need to add this line to your file to properly generate documentation with the groups you’ve specified.

scalacOptions in (Compile, doc) += "-groups"

I recommend studying the Spark codebase and copying their grouping conventions to the extent possible. Set your own conventions for public interfaces that don’t have a precedent set by Spark.

Limiting the public interface with the private keyword

Scaladoc assumes that any class, object, or method that’s not flagged as private and should be included in the documentation. In other words, anything that’s not explicitly flagged as private is assumed to be part of the public API.

It’s easy to be lazy, forget to use the private keyword, and accidentally expose private methods as part of your public interface. Hosting documentation helps me stay disciplined about flagging private stuff, so my documentation isn’t cluttered with implementation details.

Let’s take a look at the spark-daria class that’s private to the package.

package com.github.mrpowers.spark.daria.sql

import org.apache.spark.sql._

case class ProhibitedDataFrameColumnsException(smth: String) extends Exception(smth)

private[sql] class DataFrameColumnsAbsence(df: DataFrame, prohibitedColNames: Seq[String]) {

val extraColNames = (df.columns.toSeq).intersect(prohibitedColNames)

def extraColumnsMessage(): String = {
val extra = extraColNames.mkString(", ")
val all = df.columns.mkString(", ")
s"The [${extra}] columns are not allowed to be included in the DataFrame with the following columns [${all}]"
}

def validateAbsenceOfColumns(): Unit = {
if (extraColNames.nonEmpty) {
throw new ProhibitedDataFrameColumnsException(extraColumnsMessage())
}
}

}

This file exposes the case class, so API consumers can catch this custom exception. The class is marked as private to the package because we don’t want our API consumers to use this class directly.

In other words, the class will not be included in the Scaladoc API documentation because it’s been marked as private to the package. Hiding implementation details from API users makes your software easier to use.

Individual methods can also be marked as private, so they’re not exposed in the API documentation.

private def toSnakeCase(str: String): String = {
str
.replaceAll("\\s+", "_")
.toLowerCase
}

Private methods are annoying to test with Scala, but it’s possible with the PrivateMethodTester trait. I use package level privacy whenever possible, so it’s easier to test the private methods.

Hosting the documentation

You can easily create a GitHub page to host the documentation for your open source projects. Here’s the repo for my GitHub page and here’s where I store the spark-daria HTML files.

I run the command to copy the HTML files from spark-daria to the repo. The wesite is automatically updated whenever code is pushed to master.

You can add some authentication and host a static website on S3 for your private repos. Definitely host your documentation on a website and make it accessible for non-technical stakeholders.

Well documented public interfaces help product and project management folks learn more about the software development process.

Next steps

A new developer should be able to clone your Spark project, run , and easily view the latest public API of your project. Exposing methods that should be private as part of your public API is a huge code smell.

Start documenting your projects by flagging all of the private methods, so they aren’t exposed in the documentation.

Then add the documentation for each public method with clear descriptions, code snippets, and / annotations.

Host the documentation and send the link to all technical and non-technical stakeholders. Documentation will help you communicate better, write code that’s easier to use, and make better abstractions.

Spark coder, live in Colombia / Brazil / US, love Scala / Python / Ruby, working on empowering Latinos and Latinas in tech