[HADOOP-17943] Add s3a tool to convert S3 server logs to avro/csv files - ASF JIRA

XML

Word

Printable

JSON

Add s3a tool to convert S3 server logs to avro/csv files

With S3A Auditing, we have code in hadoop-aws to parse s3 log entries, including splitting up the referrer into its fields.

And, given that the S3 audit logs can be small on a lightly loaded store, not always justified.

Proposed

we add

utility parser class to take a row and split it into a record
which can be saved to avro through a schema we define
or exported to CSV with/without headers. (with: easy to understand, without: can cat files)
add a mapper so this can be used in MR jobs (could even make it committer test ..)
and a "hadoop s3guard/hadoop s3" entry point so you can do it on the cli

hadoop s3 parselogs -format avro -out s3a://dest/path -recursive s3a://stevel-london/logs/bucket1/*

would take all files under the path, load, parse and emit the output.

design issues

would you combine all files, or emit a new .avro or .csv file for each one?
what's a good avro schema to cope with new context attributes
CSV nuances: tabs vs spaces, use opencsv or implement the (escaping?) writer ourselves.
me: TSV and do a minimal escaping and quoting emitter. Can use opencsv in the test suite.
would you want an initial filter during processing? especially for exit codes?
me: no, though I could see the benefit for 503s. Best to let you load it into a notebook or spreadsheet and go from there.