Details
Description
There are a few flakey test failures for connect_distributed_test in which one of the workers does not join the group quickly, and the test fails in the following manner:
- The test starts each of the connect workers, and waits for their REST APIs to become available
- All workers start up, complete plugin scanning, and start their REST API
- At least one worker kicks off an asynchronous job to join the group that hangs for a yet unknown reason (30s timeout)
- The test continues without all of the members joined
- The test makes a call to the REST api that it expects to succeed, and gets an error
- The test fails without the worker ever joining the group
Instead of allowing the test to fail in this manner, we could wait for each worker to join the group with the existing 60s startup timeout. This change would go into effect for all system tests using the ConnectDistributedService, currently just connect_distributed_test and connect_rest_test.
Alternatively we could retry the operation that failed, or ensure that we use a known-good worker to continue the test, but these would require more involved code changes. The existing wait-for-startup logic is the most natural place to fix this issue.
Attachments
Issue Links
- links to