Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
1.10.0
-
None
-
1
Description
Stacktrace:
2020-04-03 08:34:58.007285 +0000 UTC F0403 08:34:58.003100 2717 hierarchical.cpp:2461] Check failed: 'getFramework(frameworkId)' Must be SOME 2020-04-03 08:34:58.007563 +0000 UTC *** Check failure stack trace: *** 2020-04-03 08:34:58.007827 +0000 UTC I0403 08:34:58.003136 2713 master.cpp:1721] Sending register ACK to: overlay-agent@172.16.39.81:5051 2020-04-03 08:34:58.008064 +0000 UTC I0403 08:34:58.003142 2715 master.cpp:9963] Adding framework b4fd9630-674e-4dea-b072-c3c48ccfdd42-0000 (marathon) with roles { } suppressed 2020-04-03 08:34:58.008305 +0000 UTC I0403 08:34:58.004185 2714 master.cpp:7635] Ignoring update on agent b4fd9630-674e-4dea-b072-c3c48ccfdd42-S38 at slave(1)@172.16.6.89:5051 (172.16.6.89) as it reports no changes 2020-04-03 08:34:58.008568 +0000 UTC @ 0x7fb70eda72ad google::LogMessage::Fail() 2020-04-03 08:34:58.010292 +0000 UTC @ 0x7fb70eda9508 google::LogMessage::SendToLog() 2020-04-03 08:34:58.010583 +0000 UTC @ 0x7fb70eda6e43 google::LogMessage::Flush() 2020-04-03 08:34:58.012035 +0000 UTC @ 0x7fb70eda9e49 google::LogMessageFatal::~LogMessageFatal() 2020-04-03 08:34:58.013252 +0000 UTC @ 0x7fb70d94748d _check_not_none<>() 2020-04-03 08:34:58.014963 +0000 UTC @ 0x7fb70d940f84 mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::generateInverseOffers() 2020-04-03 08:34:58.016681 +0000 UTC @ 0x7fb70d9414a1 mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::_generateOffers() 2020-04-03 08:34:58.017498 +0000 UTC @ 0x7fb70d94ee32 _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchI7NothingN5mesos8internal6master9allocator8internal28HierarchicalAllocatorProcessEEENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSL_FSI_vEEUlSt10unique_ptrINS1_7PromiseISA_EESt14default_deleteIST_EES3_E_ISW_St12_PlaceholderILi1EEEEEEclEOS3_ 2020-04-03 08:34:58.020673 +0000 UTC @ 0x7fb70ecf34b1 process::ProcessBase::consume() 2020-04-03 08:34:58.022404 +0000 UTC @ 0x7fb70ed0812b process::ProcessManager::resume() 2020-04-03 08:34:58.023133 +0000 UTC @ 0x7fb70ed0eb36 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv 2020-04-03 08:34:58.023782 +0000 UTC @ 0x7fb70a9772b0 (unknown) 2020-04-03 08:34:58.024105 +0000 UTC @ 0x7fb70a195e65 start_thread 2020-04-03 08:34:58.024669 +0000 UTC @ 0x7fb709ebe88d __clone
This immediately follows re-adding an agent after master failover.
The issue was introduced by this patch:
https://reviews.apache.org/r/71428
which didn't account for the fact that `addSlave()` takes as an argument per-framework used resources that potentially can contain frameworks that were not added to allocator yet.
(Note that when master re-registers an agent, it first calls addSlave(), and only then calls addFramework() for the frameworks recovered from the agent.)