Details
-
Task
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
None
-
None
-
Twitter Aurora Q1'15 Sprint 2
-
2
Description
We have noticed that during scheduler startup, the operation, there can be a significant amount of time spent between the following log lines:
Performing shard uniqueness sanity check. storage state machine transition PREPARED -> READY
Looking at what happens in the scheduler between those points, the expensive operation seems to be guaranteeShardUniqueness.
This operation aims to validate the integrity of the storage, but its value is dubious. There are many other things that could be done to validate integrity, but they should probably not be done every time the scheduler loads its database.
If the operation is kept, it can be dramatically optimized. It currently performs an O(n^2) scan of tasks, and this could trivially be reduced to O(n).