Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
None
Description
When an agent registers, there is currently a somewhat subtle difference in behavior between the cases when it does and does not authenticate:
- In the case that a credential IS NOT passed to the agent, we will choose a random time between 0 and the agent `registration_backoff_factor` to initiate registration. The reason for this is to avoid every agent hitting the master at once during master failover. (We also employ backoff to help this.) See: [1]
- In the case that a credential IS passed to the agent, we always attempt to authenticate and register the Agent immediately. So currently in authenticated clusters, after failover, all agents will immediately try to register with a master upon failover; though, this is helped somewhat by the fact that the authenticated codepath still uses backoff. See: [2]
It is important to resolve this disparity, not only to make the system more resilient, but also because it directly blocks us from passing many tests on platforms where authentication is not supported at all (Windows in particular). Note that there are several solutions that won't work in this case:
- The default credential in the agent tests is CRAM-MD5. Windows doesn't support this, so passing CRAM-MD5 as default to Windows will cause these tests to fail on Windows. So the only sensible default, at least if we want the Windows tests to pass, here is `none`.
- We can't remove the `delay` from the non-authenticated codepath. This provides real value to large-scale users.
- Setting `registration_backoff_factor` to be 0 or -1 will change the semantics of backoff for tests, specifically, it will cause all attempts to `delay` registration to execute immediately. It is highly undesirable to exercise a different registration backoff in tests.
So, the best long-term solution is probably to just fix the tests to work in both the `delay`'d and non-`delay`'d cases.
For some time, we have meant to make both the authenticated and unauthenticated codepaths use a random `delay` to begin. See Adam's TODO in [3]. Historically, people seem to have had a few problems with this:
1. Deep in the bowels of git history, Vinod notes[4] that the Agent might end up trying to authenticate twice, if a new master is detected before the auth is processed. It seems to me that this should not be an issue (or at least, not any more).
2. Many of our tests depend on authenticated registration happening even if `Clock::pause()` has been called; that is, because our first attempt at authentication and Agent registration are dispatched for immediate execution, even when we pause the clock, these events should still happen. If we use a `delay`, then they are scheduled to happen in the future, and any tests employing `Clock::pause` during this time will fail.
SUCCESS CRITERIA: The resolution of this bug, at minimum, involves:
- Fixing the semantics of the above tests to pass when `HAS_AUTHENTICATION` is set to false.
- Adding `delay` to the authentication codepath as well.
- Making sure that all tests pass with `delay` and with `HAS_AUTHENTICATION` set to both true AND false.
In terms of resolution, it is useful to know the specific tests that will fail if we add a `delay` and `HAS_AUTHENTICATION` is set to `true`:
[ FAILED ] ExamplesTest.V1JavaFramework [ FAILED ] ExamplesTest.PythonFramework [ FAILED ] FaultToleranceTest.FrameworkReregister [ FAILED ] MasterAllocatorTest/0.RebalancedForUpdatedWeights, where TypeParam = mesos::internal::master::allocator::MesosAllocator<mesos::internal::master::allocator::HierarchicalAllocatorProcess<mesos::internal::master::allocator::DRFSorter, mesos::internal::master::allocator::DRFSorter, mesos::internal::master::allocator::DRFSorter> > [ FAILED ] MasterAllocatorTest/1.RebalancedForUpdatedWeights, where TypeParam = mesos::internal::tests::Module<mesos::allocator::Allocator, (mesos::internal::tests::ModuleID)6> [ FAILED ] MasterTest.EndpointsForHalfRemovedSlave [ FAILED ] MasterTest.UnreachableTaskAfterFailover [ FAILED ] MasterTest.CancelRecoveredSlaveRemoval [ FAILED ] MasterTest.RecoveredFramework [ FAILED ] OversubscriptionTest.RescindRevocableOfferWithIncreasedRevocable [ FAILED ] OversubscriptionTest.RescindRevocableOfferWithDecreasedRevocable [ FAILED ] OversubscriptionTest.Reregistration [ FAILED ] PartitionTest.ReregisterSlavePartitionAware [ FAILED ] PartitionTest.ReregisterSlaveNotPartitionAware [ FAILED ] PartitionTest.PartitionedSlaveReregistrationMasterFailover [ FAILED ] PartitionTest.PartitionedSlaveOrphanedTask [ FAILED ] PartitionTest.SpuriousSlaveReregistration [ FAILED ] PartitionTest.PartitionedSlaveStatusUpdates [ FAILED ] PartitionTest.RegistryGcByCount [ FAILED ] PartitionTest.RegistryGcByAge [ FAILED ] PartitionTest.RegistryGcRace [ FAILED ] OneWayPartitionTest.MasterToSlave [ FAILED ] ReconciliationTest.ReconcileStatusUpdateTaskState [ FAILED ] ReservationTest.ACLMultipleOperations [ FAILED ] ReservationTest.WithoutAuthenticationWithoutPrincipal [ FAILED ] ReservationTest.WithoutAuthenticationWithPrincipal [ FAILED ] SlaveTest.DuplicateTerminalUpdateBeforeAck [ FAILED ] SlaveTest.StateEndpoint [ FAILED ] SlaveTest.PingTimeoutNoPings [ FAILED ] SlaveTest.PingTimeoutSomePings [ FAILED ] SlaveTest.ReregisterWithStatusUpdateTaskState [ FAILED ] SlaveTest.MaxCompletedExecutorsPerFrameworkFlag [ FAILED ] ContentType/AgentAPITest.NestedContainerLaunchFalse/0, where GetParam() = application/x-protobuf [ FAILED ] ContentType/AgentAPITest.NestedContainerLaunchFalse/1, where GetParam() = application/json [ FAILED ] ContentType/AgentAPITest.NestedContainerLaunch/0, where GetParam() = application/x-protobuf [ FAILED ] ContentType/AgentAPITest.NestedContainerLaunch/1, where GetParam() = application/json [ FAILED ] ContentType/AgentAPITest.LaunchNestedContainerSessionAttachFailure/0, where GetParam() = application/x-protobuf [ FAILED ] ContentType/AgentAPITest.LaunchNestedContainerSessionAttachFailure/1, where GetParam() = application/json [ FAILED ] DiskResource/PersistentVolumeTest.MasterFailover/0, where GetParam() = 0 [ FAILED ] DiskResource/PersistentVolumeTest.AccessPersistentVolume/0, where GetParam() = 0 [ FAILED ] DiskResource/PersistentVolumeTest.AccessPersistentVolume/1, where GetParam() = 1 [ FAILED ] DiskResource/PersistentVolumeTest.SharedPersistentVolumeRescindOnDestroy/0, where GetParam() = 0 [ FAILED ] DiskResource/PersistentVolumeTest.SharedPersistentVolumeRescindOnDestroy/1, where GetParam() = 1 [ FAILED ] MountDiskResource/PersistentVolumeTest.AccessPersistentVolume/0, where GetParam() = 2 [ FAILED ] MountDiskResource/PersistentVolumeTest.SharedPersistentVolumeRescindOnDestroy/0, where GetParam() = 2
[1] https://github.com/apache/mesos/blob/c5c5c13deab834e6db7e1f9d687b8cc0f6a0641f/src/slave/slave.cpp#L948
[2] https://github.com/apache/mesos/blob/c5c5c13deab834e6db7e1f9d687b8cc0f6a0641f/src/slave/slave.cpp#L942
[3] https://github.com/apache/mesos/blob/c5c5c13deab834e6db7e1f9d687b8cc0f6a0641f/src/slave/slave.cpp#L938
[4] https://github.com/apache/mesos/commit/09b1dc3e95955aa187458fcb61e1d66b04ec3af2#diff-01648193f4029dc9fc1e024949f6ea28R562
Attachments
Issue Links
- relates to
-
MESOS-7650 Timer::cancel doesn't completely prevent spurious agent reregister loops
- Open