Example Streaming Application

Overview

The example streaming application shows an example of an application that can be deployed using the PNDA Deployment Manager. (See the platform-deployment-manager project for details.)

The application is a tar file containing binaries and configuration files required to perform some stream processing.

This example application reads events from Kafka and writes them to HBase.

A JDBC client is provided that can query the results using Impala.

The application expects avro encoded events with 3 generic integer fields and a ms since 1970 timestamp, a b c and gen_ts: a=1;b=2;c=3;gen_ts=1466166149000. These are generated by the sample data source.

Requirements

Maven 3.0.5
Java JDK 1.8

Build

To build the example applications use:

mvn clean package

This command should be run at the root of the repository and will build the application binary, and the application package. It will create a package file in the app-package/target directory. It will be called spark-streaming-to-hbase-example-app-{version}.tar.gz. Note This does not build the impala client used to look at the output data, to build that also use mvn clean package but from within the impala-jdbc-client directory and see usage instructions below.

Files in the package

application.properties: config file used by the Spark Streaming scala application.
log4j.properties: defines the log level and behaviour for the application.
hbase.json: contains commands to create an HBase table and Impala metadata.
properties.json: contains default properties that may be overriden at application creation time.

Deploying the package and creating an application

The PNDA console can be used to deploy the application package to a cluster and then to create an application instance. The console is available on port 80 on the edge node.

When creating an application in the console, ensure that the input_topic property is set to a real Kafka topic.

"input_topic": "avro.events",

To make the package available for deployment it must be uploaded to a package repository. The default implementation is an OpenStack Swift container. The package may be uploaded via the PNDA repository manager which abstracts the container used, or by manually uploading the package to the container.

JDBC Client

The supplied sample client executes the following SQL query to compute the average value of cf:c between two timestamps x and y:

select round(avg(cast(col as int)),2) as average from example_table where id between x and y

A suitable JDBC driver for Impala must be downloaded and used. Please refer to the impala-jdbc-client directory in this repository for details of an open source driver that we've tested.

Edit the file src/main/resources/application.properties and change the IP address in the connection.url property to the IP address of a node running an Impala daemon.

Metadata is defined in the hive metastore which maps the HBase columns onto SQL fields. This metadata is provisioned as part of the application package in the hbase.json file.

Only the SQL field that corresponds to the HBase rowkey can be considered indexed, and it is generally good practice to limit the quantity of data being considered with a where clause on that field.

Run sample data source

If you want to produce test data and see how the ingest pipeline works, there is a script in data-source/src/main/resources/src_tcp_ksh.py which produces random events and sends it over TCP port 20518.

To run the test script:

cd data-source/src/main/resources
python src_tcp_ksh.py

Next, install logstash as per the instructions here. Be sure to substitute any fields such as bootstrap_servers and topic_id in the kafka output config.

spark-streaming