[AURORA-1615] Preemptor crashes scheduler during host maintenance - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.13.0
Component/s: Scheduler
Labels:
None

Description

We have noticed an occasional scheduler failover when host maintenance is in effect:

To index multiple values under a key, use Multimaps.index.
        at com.google.common.collect.Maps.uniqueIndex(Maps.java:1215)
        at com.google.common.collect.Maps.uniqueIndex(Maps.java:1173)
        at org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.lambda$run$224(PendingTaskProcessor.java:130)
        at org.apache.aurora.scheduler.storage.db.DbStorage.read(DbStorage.java:138)
        at org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:101)
        at org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
        at org.apache.aurora.scheduler.storage.log.LogStorage.read(LogStorage.java:570)
        at org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.read(CallOrderEnforcingStorage.java:113)
        at org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.run(PendingTaskProcessor.java:119)

Diffing colliding HostOffer objects revealed the only difference is in HostAttributes maintenance mode value:
mode=NONE vs. mode=DRAINING

Upon examination it appears that it's quite possible to have duplicate HostOffer instances (same offer, same slave, different maintenance mode) due to the way offers are accessed as unmodifiable view over underlying ConcurrentSkipListSet. Here is the possible sequence:

Pending task processor starts building unique index and the offers iterator pulls OfferA with mode=None
A host drain operation is initiated, a HostAttributesChanged event is raised
OfferManager processes HostAttributeChanged event and atomically swaps OfferA with OfferA' (mode=DRAINING)
iterator.next() inside of the uniqueIndex routine pulls OfferA' and the error is raised.

We should either copy inside a synchronized getOffers() implementation or deal with possible duplicates at call site. I tend to think copying on access is a better approach. The only consumer of getOffers() is PendingTaskProcessor with a relatively infrequent run loop (1 minute), so the perf impact of making a copy of all offers within a synchronized method should be acceptable. The alternative implies leaking the abstraction of host maintenance mode into the preemptor, which is less than ideal.

Attachments

Activity

People

Assignee:: Maxim Khutornenko

Reporter:: Maxim Khutornenko

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 11/Feb/16 17:53

Updated:: 12/Apr/16 13:58

Resolved:: 12/Feb/16 01:31