Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.8.0
-
None
-
None
Description
KUDU-2597 tracks adding a tool to parse the metrics logs. We should also add (probably as Python scripts) some tools for analyzing the metrics logs:
- Finding tablets with unusual performance characteristics: longest apply|prepare|replicate|write|updateconsensus times
- Finding servers with the most disk activity
- Finding servers with slows scanners
- Finding replicas that are largest on disk or in-memory
- Characterizing workloads of tables and tablets (insert/upserts/deletes/updates/duplicate key inserts/pk lookups + times/op)
- Compaction (average height, delta size, compaction times)
- Log performance (append latency, sync latency, throughput)
Some of this can be rules-based, i.e. if metric X is > constant A) and some should be pattern-based (most tablets' histograms look different than this tablet's).