Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
None
-
None
-
None
-
linux
Description
We're running Samza 0.9.1 with kafka 0.8.2.1, which has a default setting of log.cleaner.enable=false. We didn't think we needed to enable this, as we never created any topics with cleanup.policy=compact. However, this morning we had a disk alert, and when I took a look on the broker that triggered the alert, one of the Samza checkpoint topics was consuming 29GB within the /logs folder.
Long story short, I eventually figured out that all of the checkpoint topics were created with cleanup.policy=compact, and were growing unbounded. I set log.cleaner.enable=true on each broker, and restarted them. Within minutes, the 29GB was reduced to a 200-300KB.
I thought I must have missed this when I created our jobs with checkpointing enabled, so I went and scoured the docs. There's no mention of the log.cleaner.enable setting within the documentation (unless I missed it again).
I should add that we've been running most of these jobs for about a year, and I noticed that each time we would deploy, it would take longer and longer to transition from ACCEPTED to RUNNING in the YARN cluster. Eventually, it was taking 10-15 minutes per job, and we didn't understand why. After bouncing our staging cluster with log.cleaner.enable=true (and letting the log cleaner finish its work), I redeployed one of our jobs, and it once again took 15-20 seconds from ACCEPTED to RUNNING.
Please mention in the documentation that log.cleaner.enable must be set to true for checkpointing to work correctly.