Working with S3 and Spark Locally
Spark is used for big data analysis and developers normally need to spin up multiple machines with a company like databricks for production computations. It is nice to work with Spark locally when doing exploratory work or when working with a small data set.
If Spark is configured properly, you can work directly with files in S3 without downloading them.
Configuring the Spark Shell
Start the Spark shell with the dataframes spark-csv package.
$ ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.3.0
Spark give the following information message while the shell is loaded: “Spark context available as sc.” The sc variable points to the Spark context:
scala> sc
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext
Set the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables, so Spark can communicate with S3.
Add the AWS keys to the~/.bash_profile file:
export AWS_ACCESS_KEY_ID=redacted
export AWS_SECRET_ACCESS_KEY=redacted
Once the environment variables are set, restart the Spark shell and enter the following commands.
The System.getenv() method is used to retreive environment variable values.
Reading Data From S3 into a DataFrame
This example will use three files that are stored in the s3://some_bucket/data/states/ folder.
This is the content of the s3://some_bucket/data/states/texas.csv file.
city,state
austin,tx
dallas,tx
houston,tx
This is the content of the s3://some_bucket/data/states/new_york.csv file.
city,state
new york,ny
buffalo,ny
albany,ny
This is the content of the s3://some_bucket/data/states/california.csv file.
city,state
los angeles,ca
san francisco,ca
sacramento,ca
The following code can be used to load these three CSV files into a single DataFrame:
Notice that the load() method includes a path with s3n, not s3. This Stackoverflow answer discusses the difference between s3n and s3.
The show() method can be used to demostrate that the data in the three CSV files is loaded into the DataFrame:
scala> df.show()
+ — — — — — — -+ — — -+
| city| state|
+ — — — — — — -+ — — -+
| los angeles| ca|
| san francisco| ca|
| sacramento| ca|
| new york| ny|
| buffalo| ny|
| albany| ny|
| austin| tx|
| dallas| tx|
| houston| tx|
+ — — — — — — -+ — — -+
Writing a DataFrame to a S3 Folder
The DataFrame results can be written to a folder in S3.
The s3://some_bucket/data/states/all_states/part-00000 file will contains all of the rows in the DataFrame.