Questions & Answers

why need avro schema?

0 votes
asked Apr 27 by yusong0926 (240 points)
I just have a question regarding the input schema required for PNDA? I am wondering why avro is requireted for PNDA, can I use my own schema just like the general way to use kafka?  And then use spark streaming to read, process and consumer data and save to HBASE, since I already know the schema.

1 Answer

0 votes
answered Apr 27 by James Clarke (1,630 points)
Part of the advantage of PNDA is following the baked in best-practices we have set up so we would recommend using the PNDA schema. Doing this gets the advantage of a Gobblin map reduce job automatically archiving all data from Kafka into HDFS. The avro schema that is defined for PNDA just contains enough information to allow the data to be archived in HDFS in a time series and defines a raw bytes data field that can contain any payload you like. This is an application of schema-on-read where the detailed format of the payload is not constrained but interpreted at processing time by an application. So I would put your existing data format inside this PNDA envelope schema.

Having said all that, you could disable Gobblin, or blacklist the topics that don't use the PNDA schema, then write the application logic to handle your own schema.