Example Batch Application

Overview

The example batch application shows an example of an application that can be deployed using the PNDA Deployment Manager. (See the platform-deployment-manager project for details.)

The application is a tar file containing binaries and configuration files required to perform batch processing.

This example application converts the PNDA master data set into parquet format that can be efficiently queried using Impala.

The PNDA master data set is the data automatically copied into HDFS from Kafka by Gobblin.

There are two versions of the application:

workflow, designed to run once
coordinator, designed to run regularly on a schedule

The application expects avro encoded events with 3 generic integer fields and a ms since 1970 timestamp, a b c and gen_ts: a=1;b=2;c=3;gen_ts=1466166149000. These are generated by the sample data source, as detailed in the example spark streaming project.

Requirements

Maven 3.0.5
Java JDK 1.8

Build

To build the example applications use:

mvn clean package

This command should be run at the root of the repository and will build the application binary, and both application packages. Both application packages do the same job but the "-c" package runs on a regular schedule as an oozie coordintor application, and "-wf" runs once as an oozie workflow application. It will create package files in the app-package-c/target and app-package-wf/target directories. They will be called spark-batch-example-app-c-{version}.tar.gz and spark-batch-example-app-wf-{version}.tar.gz respectively.

Spark Job

Both versions of the application run the same Spark code.

This is a very basic job that loads the avro row based data from the master data set and writes it out as column based parquet.

Files in the package

hdfs.json: creates the destination folder in HDFS, specifies not to delete it when the application is destroyed.
log4j.properties: defines the log level and behaviour for the application.
properties.json: contains default properties that may be overriden at application creation time.
workflow.xml: Oozie workflow definition that run the spark job. See Oozie documenation for more information.
coordinator.xml: Oozie coordinator definition that sets the schedule for running the job. See Oozie documenation for more information.

Deploying the package and creating an application

The PNDA console can be used to deploy the application package to a cluster and then to create an application instance. The console is available on port 80 on the edge node.

When creating an application in the console, ensure that the input_data property is set to a folder that contains data. If any data has been published to kafka, this will be found in HDFS under /user/pnda/PNDA_datasets/datasets/ (after 30 minutes has passed and gobblin has run to import it).

input_data: /user/pnda/PNDA_datasets/datasets/source=test-src/year=*

To make the package available for deployment it must be uploaded to a package repository. The default implementation is an OpenStack Swift container. The package may be uploaded via the PNDA repository manager which abstracts the container used, or by manually uploading the package to the container.

spark-batch