This blog post shows how to write some Spark code with the Java API and run a simple test.
The code snippets in this post are from this GitHub repo.
Start by creating a
pom.xml file for Maven.
<?xml version="1.0" encoding="UTF-8"?>
Dependency injection is a design pattern that let’s you write Spark code that’s more flexible and easier to test.
This blog post introduces code that has a dependency, shows how to inject the path as a dependency, and then shows how to inject an entire DataFrame.
Let’s create a
You can speak Slack notifications to alert stakeholders when an important job is done running or even speak counts from a Spark DataFrame.
This blog post will also show…
You can use Scaladoc to generate nicely formatted documentation for your Spark projects, just like the official Spark documentation.
Documentation encourages you to write code with clearly defined public interfaces and makes it easier for others to use your code.
The spark-daria project is a good example of an open…
Spark code can be organized in custom transformations, column functions, or user defined functions (UDFs).
Here’s how the different functions should be used in general:
Let’s start with an overview of StructType objects and then demonstrate how StructType columns can be added to DataFrame schemas…
Let’s create a DataFrame with a
StringType column and use the…
Spark let’s you define custom SQL functions called user defined functions (UDFs). UDFs are great when built-in SQL functions aren’t sufficient, but should be used sparingly because they’re not performant.
This blog post will demonstrate how to define UDFs and will show how to avoid UDFs, when possible, by leveraging…
The Spark Column class defines predicate methods that allow logic to be expressed consisely and elegantly (e.g.
spark-daria defines additional
Column methods such as
isNotIn to fill in the Spark API gaps.
This blog post will demonstrate how to express logic…