Platform Libraries

Overview
- Data analytics on PNDA
- Package overview
Quick start
Build from source
API Specifications
Custom implementations

Overview

Data analytics on PNDA

When streaming raw event data to PNDA cluster via Kafka, raw data are encoded using an Avro schema with extra metadata information. These metadata include timestamp, data source, host ip. The raw data that are encoded in bytes along with metadata are persistent in HDFS and are organized on a per source, per year, per month, per day, and per hour basis.

In order to make sense of data, an application developer will need to extract data from a specific folder, decode against PNDA avro schema, and incorpate the data into data analytics processes. Instead of duplicate these common processes for each application, one can use platform libraries, a python library that hides low-level data operations by providing ready-to-use functions to access PNDA data and perform data analytics using Spark. Most importantly, the platform library can be used as an extension framework for custom data preprocessing and custom data operations.

As the diagram below, the platform libraries place an abstract layer upon low-level HDFS interfaces and allows you to rapidly develop PySpark applications or interactive notebooks (empowered by Jupyter or Zeppelin).

Package overview

+-- Platformlibs
|   +-- common_helpers: defines pnda-specific utility functions
|   +-- config_helper: define configuration utility (currently loading and parsing INI file only) 
|   +-- data_handler
|       +-- DataHandler: abstract data handler class
|   +-- simple_data_handler
|       +-- SimpleDataHandler: default data handler class that provides loading data form hdfs as Spark RDD only. (no business logics implemented)
|   +-- json_data_handler
|       +-- JsonDataHandler: Json data handler class that assumes 'rawdata' is encoded in JSON format. 
|   +-- xr_data_handler.
        +-- class XrDataHandler: a sample implementation for XR telemetry data handler.

Quick start

By default for every successfully provisioned PNDA cluster, the library is pre-installed on a 'edge node'. Follow the steps to start loading data:

$ sudo -u hdfs pyspark --master yarn-client --jars /opt/cloudera/parcels/CDH/lib/spark/lib/spark-examples.jar 
>>> from platformlibs.simple_data_handler import SimpleDataHandler
>>> source = 'your data source name' # is the folder name under /user/pnda/PNDA_datasets/datasets/source=?
>>> path = 'year=?/month=?/hour=?' # any sub-directory under /user/pnda/PNDA_datasets/datasets/source=?
>>> handler = SimpleDataHandler(sc, source, path)
>>> rdd = handler.rdd
>>> ...

Build from source

If your PNDA cluster does not have this library pre-installed, or you would like to use this platform on another PNDA cluster node, Follow these steps to get started from the edge node:

Build

Once you clone the repsitory, run a build process as follows:

  sudo pip install -r requirements.txt
    python setup.py bdist_egg

The build process generates an 'eggy' distribution at ./dist/platformlibs-<VERSION>-py2.7.egg.

Run tests

Running unittests on either PNDA cluster or your local cluster requires these dependencies. Run the followings to install:

    sudo pip -r test_requirements.txt
    source install-spark.sh

and run unittests:

    nosetests tests

Deploy

Until this is automated, these steps are mandatory for deploying eggs. The egg must be installed on the Zeppelin node in the cluster using easy_install.

    sudo easy_install /path/to/platformlibs-<VERSION>-py2.7.egg

Configure

Platform libaries requires loading configuration file from /etc/platformlibs/platformlibs.ini, which maintains access details to Cloudera Manager. The default configurations are defined along with PNDA cluster provisioning. If the username and password have been changed, you will need update the configuration file accordingly. An example configuration file is given below:

[cm]
cm_host=192.168.0.105
cm_user=admin
cm_pass=admin

Interactive notebooks usage

Configure in the Zeppelin/Jupyter's pyspark interpreter settings with

    --py-files=/path/to/platformlibs-<VERSION>-py2.7.egg

Note: You may need to restart pyspark interpreter before using this library. If you are using this library in your Spark App, simply copy and paste this library to your app.

You can use the egg along with pyspark interactive console or Zeppelin/Jupyter by injecting the library into runtime by adding the following statement in your code:

    sc.addPyFile('/path/to/platformlibs-<VERSION>-py2.7.egg')

API Specifications

platformlibs.common_helpers module

A collection of builtin helper functions

platformlibs.common_helpers.flatten_dict
- Description: return a flattened dictionary
- Args:
  - input_d: read-only source dictionary
  - result: output dictionary
```
>>> from platformlibs import common_helpers
>>> common_helpers.flatten_dict({'a':[{'b':1}, {'c':[{'d':2}]}]})
{'a-c-d': 2, 'a-b': 1}
```
platformlibs.common_helpers.get_hdfs_uri
- Description: return a hdfs root path URI
- Args:
  - hostname: cloudera manager host name of a PNDA cluster
```
>>> from platformlibs import common_helpers
>>> common_helpers.get_hdfs_uri("test-dev-cm", "admin", "admin")
hdfs://HDFS-HA
```

platformlibs.config_helper module

platformlibs.config_helper.get_config

Description: parse configuration file and return a configuration dictionary

Args:

filename: INI configuration file path

>>> from platformlibs import config_helper
>>> config_helper.get_config('/etc/platformlibs/platformlibs.init)
{'cm_host':'pnda-cluster-cdh-cm', 'cm_user':'admin', 'cm_pass':'admin'}

platformlibs.data_handler module

platformlibs.data_handler.DataHandler(spark_context, data_source, file_path)
- Abstract DataHandler Class
- rdd
  - Description: Decode avro datasets and return an RDD instance
  - Args:
    - datasource : data source type (e.g. netflow or telemetry)
    - path: hdfs related directory or file path
- preprocess
  - Description: Preprocessing abstract interface
  - Args:
    - raw_data: the input RDD
- list_host_ips
  - Description: abstract interface that returns list of (:total_stats) pairs
- list_metric_ids
  - Description: abstract interface that returns list of (, [(, total_stats)) pairs aggregated by descending order on per host basis
  - Args:
    - limit: the limit of numer of metrics of each host
    - filters: filtering rules
- execute_query
  - Description: return time-series data
  - Args:
    - filters: filtering rules

platformlibs.simple_data_handler module

platformlibs.simple_data_handler.SimpleDataHandler(spark_context, data_source, file_path)
- Description: default data handler class that provides loading data form hdfs as Spark RDD only. (no business logics implemented)
- preprocess
  - Description: decode avro-formatted data
  - Args:
    - raw_data: the input RDD
```
>>> from platformlibs.simple_data_handler import JsonDataHandler
>>> handler = SimpleDataHandler(sc, 'test', 'year=2015/month=11/day=02/hour=11')
>>> rdd = handler.rdd
>>> rdd.count()
3451792
```

platformlibs.json_data_handler module

platformlibs.json_data_handler.JsonDataHandler(spark_context, data_source, file_path)

Description: Json data handler that assumes 'rawdata' is encoded in JSON format

preprocess

Description: decode avro-formatted data and decode json-formatted raw data value

Args:

raw_data: the input RDD

>>> from platformlibs.json_data_handler import JsonDataHandler
>>> handler = JsonDataHandler(sc, 'netflow', 'year=2015/month=11/day=02/hour=11')
>>> rdd = handler.rdd
>>> rdd.take(1)
[{u'src': u'netflow', 'rawdata': {u'flow_sampler_id': u'0', u'in_bytes': u'46', u'protocol': u'6', u'first_switched': u'2015-11-02T10:59:44.999Z', u'flowset_id': u'256', u'ipv4_dst_addr': u'128.107.253.30', u'src_mask': u'0', u'src_as': u'0', u'ipv4_src_addr': u'64.39.105.23', u'last_switched': u'2015-11-02T10:59:44.999Z', u'version': u'9', u'flow_seq_num': u'1255692034', u'direction': u'0', u'l4_src_port': u'48965', u'ipv4_next_hop': u'128.107.80.58', u'dst_as': u'0', u'in_pkts': u'1', u'dst_mask': u'30', u'l4_dst_port': u'21064', u'output_snmp': u'3', u'src_tos': u'0', u'tcp_flags': u'0', u'input_snmp': u'106'}, u'timestamp': 1446462002000, u'host_ip': u'172.17.153.41'}]

list_host_ips

Description: Returns list of (:total_stats) pairs

>>> handler.list_host_ips()
[('10.33.64.11', 292829), ('10.33.76.2', 269228), ('10.33.76.1', 158514), ('172.17.153.41', 156710), ('172.30.194.7', 67920), ('172.30.194.8', 25140), ('172.30.194.51', 5477), ('172.17.153.27', 4902), ('10.33.69.19', 1838), ('172.30.194.117', 359)]

list_metric_ids
- Description: Returns list of (, [(, total_stats)) pairs aggregated by descending order on per host basis
- Args:
  - limit: the limit of numer of metrics of each host
    - filters: filtering rules (supported filters include 'host_ips':[], 'start_ts':int, 'end_ts':int, 'data_sources':[])
```
>>> handler.list_metric_ids(limit=1, filters={'host_ips':['10.33.64.11', '172.17.153.27'], 'start_ts':1446462002000, 'end_ts':1446462004000, 'data_sources':['netflow']})
[('10.33.64.11', [(u'mul_dst_bytes', 246)]), ('172.17.153.27', [(u'mul_dst_pkts', 2117)])]
```

execute_query

Description: return time-series data

Args:

filters: filtering rules (supported those as list_metric_ids plus 'metrics':[])

>>> handler.execute_query(filters={'host_ips':['10.33.64.11'], 'start_ts':1446462002000, 'end_ts':1446462004000, 'data_sources':['netflow'], 'metrics':['in_bytes']})
[((u'in_bytes', u'10.33.64.11'), [(1446462002000, u'1616'), (1446462004000, u'60')])]

platformlibs.xr_data_handler module

platformlibs.xr_data_handler.XrDataHandler(spark_context, data_source, file_path, cm_host, cm_user='admin', cm_pass='admin')

Custom implementation of XR Telemetry data handler

list_metric_ids

Description: Returns list of (, [(, total_stats)) pairs aggregated by descending order on per host basis

Args:

limit: the limit of numer of metrics of each host

filters: filtering rules (supported filters inherited from SimpleDataHandler class plus 'metric_type':'')

>>> from platformlibs.xr_data_handler import XrDataHandler
>>> handler = XrDataHandler(sc, 'telemetry', 'year=2015/month=08/day=14/hour=06', 'test-dev-cm', 'admin', 'admin')
>>> handler.list_host_ips()
[('192.168.0.4', 5226), ('192.168.0.8', 5226), ('192.168.0.9', 5226), ('192.168.0.1', 5148), ('192.168.0.6', 5148), ('192.168.0.10', 5148)]
>>> handler.list_metric_ids(limit=1, filters={'host_ips':['192.168.0.4'], 'metric_type':'ipsla'})
[{'metrics': [(u'ipslastats.Aggregated.54.NonDistributed.Target.SpecificStats.UDPJitterStats.PacketLossDS', 402)], 'host_ip': '192.168.0.4'}]
>>> handler.list_metric_ids(limit=1, filters={'host_ips':['192.168.0.4'], 'metric_type':'mpls'})
[{'metrics': [(u'mplstesummary.BidirTailLSPs', 67)], 'host_ip': '192.168.0.4'}]
>>> handler.list_metric_ids(limit=1, filters={'host_ips':['192.168.0.4'], 'metric_type':'infra'})
[{'metrics': [(u'ribprefix.RoutePath-Flags64', 2010)], 'host_ip': '192.168.0.4'}]

Custom implementations

By default, platform libraries assumes the raw event data are encoded in json format. If your data come with different encoding formats, you will need to provide your own implementations.

Two options are avaialble:

Custom "preprocess" implementation and render your raw data to be json formatted. With this option, you can reuse operations implemented by default.
Custom implemenation of opertional interfaces for your custom data formats.

platform-libraries

Platform Libraries

Overview

Data analytics on PNDA

Package overview

Quick start

Build from source

Build

Run tests

Deploy

Configure

Interactive notebooks usage

API Specifications

platformlibs.common_helpers module

platformlibs.config_helper module

platformlibs.data_handler module

platformlibs.simple_data_handler module

platformlibs.json_data_handler module

platformlibs.xr_data_handler module

Custom implementations

results matching ""

No results matching ""