Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
9.0
-
None
-
None
Description
Ever since early December 2021 the doTestIndexFetchOnLeaderRestart test has been failing around 3% of the time. It looks like this was introduced by SOLR-15590. When drilling into why the test fails, it looks like the replication never happens in the follower (no logging whatsoever of the replication handler or the index fetcher). This indicates that there is something that is hanging in the first replication call request. The indexFetcher start the fetching thread at a random interval between 1 ms and 1000 ms. After the follower is started, the leader is restarted. It generally (from my observation) takes around 30 ms for this to happen. Meaning that 3% of the tests will have the first indexFetcher request sent while the leader is restarting, which is in line with the failure rate we are seeing.
Mike Drob and I could not get the hanging indexFetcher request to replicate locally, so this is still conjecture, and we are unsure as to how SOLR-15590 would be affecting it.
Side note: When looking at the history of the test, it looks like the original purpose of the test is no longer tested for as well. Originally the last part of the test was to make sure that there was only 1 successful index replication, that test has now been moved to before the leader is started up again. This no longer checks that a full replication happens after the leader starts. So we just need to add that check in at the back of the test. (This was changed in SOLR-13577)
Attachments
Issue Links
- fixes
-
SOLR-13577 TestReplicationHandler.doTestIndexFetchOnMasterRestart failures
- Closed
- is broken by
-
SOLR-15590 Start up Core Container via ServletContextListener
- Closed
- relates to
-
SOLR-17118 Solr deadlock during servlet container start
- Resolved
- links to