Details
-
Sub-task
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.3.2
-
None
-
None
Description
Add s3a tool to convert S3 server logs to avro/csv files
With S3A Auditing, we have code in hadoop-aws to parse s3 log entries, including splitting up the referrer into its fields.
But we don't have an easy way of using it. I've done some early work in spark but as well as that code not working (https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/com/cloudera/spark/cloud/s3/S3LogRecordParser.scala), it doesn't do the audit splitting.
And, given that the S3 audit logs can be small on a lightly loaded store, not always justified.
Proposed
we add
- utility parser class to take a row and split it into a record
- which can be saved to avro through a schema we define
- or exported to CSV with/without headers. (with: easy to understand, without: can cat files)
- add a mapper so this can be used in MR jobs (could even make it committer test ..)
- and a "hadoop s3guard/hadoop s3" entry point so you can do it on the cli
hadoop s3 parselogs -format avro -out s3a://dest/path -recursive s3a://stevel-london/logs/bucket1/*
would take all files under the path, load, parse and emit the output.
design issues
- would you combine all files, or emit a new .avro or .csv file for each one?
- what's a good avro schema to cope with new context attributes
- CSV nuances: tabs vs spaces, use opencsv or implement the (escaping?) writer ourselves.
me: TSV and do a minimal escaping and quoting emitter. Can use opencsv in the test suite. - would you want an initial filter during processing? especially for exit codes?
me: no, though I could see the benefit for 503s. Best to let you load it into a notebook or spreadsheet and go from there.