Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
1.5.0
-
None
-
Mesosphere Sprint 70
-
2
Description
A resource provider might resubscribe while its old HTTP connection wasn't properly closed. In that case an agent will crashm with, e.g., the following log:
I1219 13:33:51.937295 128610304 manager.cpp:570] Subscribing resource provider {"id":{"value":"8e71beef-796e-4bde-9257-952ed0f230a5"},"name":"test","type":"org.apache.mesos.rp.test"} I1219 13:33:51.937443 128610304 manager.cpp:134] Terminating resource provider 8e71beef-796e-4bde-9257-952ed0f230a5 I1219 13:33:51.937760 128610304 manager.cpp:134] Terminating resource provider 8e71beef-796e-4bde-9257-952ed0f230a5 E1219 13:33:51.937851 129683456 http_connection.hpp:445] End-Of-File received I1219 13:33:51.937865 131293184 slave.cpp:7105] Handling resource provider message 'DISCONNECT: resource provider 8e71beef-796e-4bde-9257-952ed0f230a5' I1219 13:33:51.937968 131293184 slave.cpp:7347] Forwarding new total resources cpus:2; mem:1024; disk:1024; ports:[31000-32000] F1219 13:33:51.938052 132366336 manager.cpp:606] Check failed: resourceProviders.subscribed.contains(resourceProviderId) *** Check failure stack trace: *** E1219 13:33:51.938583 130756608 http_connection.hpp:445] End-Of-File received I1219 13:33:51.938987 129683456 hierarchical.cpp:669] Agent 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 (172.18.8.13) updated with total resources cpus:2; mem:1024; disk:1024; ports:[31000-32000] @ 0x1125380ef google::LogMessageFatal::~LogMessageFatal() @ 0x112534ae9 google::LogMessageFatal::~LogMessageFatal() I1219 13:33:51.939131 129683456 hierarchical.cpp:1517] Performed allocation for 1 agents in 61830ns I1219 13:33:51.945793 2646795072 slave.cpp:927] Agent terminating I1219 13:33:51.945955 129146880 master.cpp:1305] Agent 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 (172.18.8.13) disconnected I1219 13:33:51.945979 129146880 master.cpp:3364] Disconnecting agent 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 (172.18.8.13) I1219 13:33:51.946022 129146880 master.cpp:3383] Deactivating agent 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 (172.18.8.13) I1219 13:33:51.946081 131293184 hierarchical.cpp:766] Agent 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 deactivated @ 0x115f2761d mesos::internal::ResourceProviderManagerProcess::subscribe()::$_2::operator()() @ 0x115f2977d _ZN5cpp176invokeIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS2_14HttpConnectionERKNS1_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEEEEEDTclclsr3stdE7forwardIT_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSG_DpOSH_ @ 0x115f29740 _ZN6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS3_14HttpConnectionERKNS2_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEEEE13invoke_expandISC_NSt3__15tupleIJSG_EEENSK_IJEEEJLm0EEEEDTclsr5cpp17E6invokeclsr3stdE7forwardIT_Efp_Espcl6expandclsr3stdE3getIXT2_EEclsr3stdE7forwardIT0_Efp0_EEclsr3stdE7forwardIT1_Efp2_EEEEOSN_OSO_N5cpp1416integer_sequenceImJXspT2_EEEEOSP_ @ 0x115f296bb _ZNO6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS3_14HttpConnectionERKNS2_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEEEEclIJEEEDTcl13invoke_expandclL_ZNSt3__14moveIRSC_EEONSJ_16remove_referenceIT_E4typeEOSN_EdtdefpT1fEclL_ZNSK_IRNSJ_5tupleIJSG_EEEEESQ_SR_EdtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0EEEE_Eclsr3stdE16forward_as_tuplespclsr3stdE7forwardIT_Efp_EEEEDpOSY_ @ 0x115f2965d _ZN5cpp176invokeIN6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS5_14HttpConnectionERKNS4_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEEEEEJEEEDTclclsr3stdE7forwardIT_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSK_DpOSL_ @ 0x115f29631 _ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS6_14HttpConnectionERKNS5_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEEEEEJEEEvOT_DpOT0_ @ 0x115f29526 _ZNO6lambda12CallableOnceIFvvEE10CallableFnINS_8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS7_14HttpConnectionERKNS6_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEEEEEEclEv @ 0x10b6ca690 _ZNO6lambda12CallableOnceIFvvEEclEv @ 0x10be09295 _ZZN7process8internal8DispatchIvEclIN6lambda12CallableOnceIFvvEEEEEvRKNS_4UPIDEOT_ENKUlOS7_PNS_11ProcessBaseEE_clESD_SF_ @ 0x10be09180 _ZN5cpp176invokeIZN7process8internal8DispatchIvEclIN6lambda12CallableOnceIFvvEEEEEvRKNS1_4UPIDEOT_EUlOS9_PNS1_11ProcessBaseEE_JS9_SH_EEEDTclclsr3stdE7forwardISD_Efp_Espclsr3stdE7forwardIT0_Efp0_EEESE_DpOSJ_ @ 0x10be0912b _ZN6lambda8internal7PartialIZN7process8internal8DispatchIvEclINS_12CallableOnceIFvvEEEEEvRKNS2_4UPIDEOT_EUlOS9_PNS2_11ProcessBaseEE_JS9_NSt3__112placeholders4__phILi1EEEEE13invoke_expandISI_NSJ_5tupleIJS9_SM_EEENSP_IJOSH_EEEJLm0ELm1EEEEDTclsr5cpp17E6invokeclsr3stdE7forwardISD_Efp_Espcl6expandclsr3stdE3getIXT2_EEclsr3stdE7forwardIT0_Efp0_EEclsr3stdE7forwardIT1_Efp2_EEEESE_OST_N5cpp1416integer_sequenceImJXspT2_EEEEOSU_ @ 0x10be0905f _ZNO6lambda8internal7PartialIZN7process8internal8DispatchIvEclINS_12CallableOnceIFvvEEEEEvRKNS2_4UPIDEOT_EUlOS9_PNS2_11ProcessBaseEE_JS9_NSt3__112placeholders4__phILi1EEEEEclIJSH_EEEDTcl13invoke_expandclL_ZNSJ_4moveIRSI_EEONSJ_16remove_referenceISD_E4typeESE_EdtdefpT1fEclL_ZNSP_IRNSJ_5tupleIJS9_SM_EEEEESU_SE_EdtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0ELm1EEEE_Eclsr3stdE16forward_as_tuplespclsr3stdE7forwardIT_Efp_EEEEDpOS11_ @ 0x10be08f4d _ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8internal8DispatchIvEclINS1_12CallableOnceIFvvEEEEEvRKNS4_4UPIDEOT_EUlOSB_PNS4_11ProcessBaseEE_JSB_NSt3__112placeholders4__phILi1EEEEEEJSJ_EEEDTclclsr3stdE7forwardISF_Efp_Espclsr3stdE7forwardIT0_Efp0_EEESG_DpOSQ_ @ 0x10be08f11 _ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8internal8DispatchIvEclINS_12CallableOnceIFvvEEEEEvRKNS5_4UPIDEOT_EUlOSC_PNS5_11ProcessBaseEE_JSC_NSt3__112placeholders4__phILi1EEEEEEJSK_EEEvSH_DpOT0_ @ 0x10be08d36 _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchIvEclINS0_IFvvEEEEEvRKNS1_4UPIDEOT_EUlOSE_S3_E_JSE_NSt3__112placeholders4__phILi1EEEEEEEclEOS3_ @ 0x11fd64bc9 _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEEclES3_ @ 0x11fd64a69 process::ProcessBase::consume() @ 0x11fe20ac4 _ZNO7process13DispatchEvent7consumeEPNS_13EventConsumerE @ 0x113c77819 process::ProcessBase::serve() @ 0x11fd5b8c9 process::ProcessManager::resume() @ 0x11fe8260b process::ProcessManager::init_threads()::$_1::operator()() @ 0x11fe82190 _ZNSt3__114__thread_proxyINS_5tupleIJNS_10unique_ptrINS_15__thread_structENS_14default_deleteIS3_EEEEZN7process14ProcessManager12init_threadsEvE3$_1EEEEEPvSB_ @ 0x7fff64da56c1 _pthread_body @ 0x7fff64da556d _pthread_start @ 0x7fff64da4c5d thread_start Abort trap: 6
This is due to a race condition in resource_provider/manager.cpp when handling closed HTTP connections of resource providers. If a resource provider resubscribes and its old HTTP connection is still open, the resource provider manager will close it. This is unexpected and will trigger closing the new HTTP connection which results in a failed CHECK.