Description
Below is an observation from a live system:
On a large cluster with occasional topology changes, there is a sporadic hang which manifests itself with "Failed to evict partition message" for one of the caches with enabled cache store. I managed to take a heap dump and found out that on the hanging node there was a single entry with IS_EVICT_DISABLED flag set and no other threads were doing store load operation. Earlier in the logs I saw that the cache store threw a CacheLoaderException due to interrupted connection with a database.
Currently, the flag is set before the cache store load and it is cleared after the load.
Looks like if the store throws an exception, this leads to the leaked flag set and the entry cannot be cleared from the partition. As a result, on the next topology change partition exchange will be freezed with "Failed to wait for partition eviction" error message.
Attached is the test reproducing this issue (note that the message appears after one minute)
Attachments
Attachments
Issue Links
- is related to
-
IGNITE-5759 IgniteCache 6 suite timed out by GridCachePartitionEvictionDuringReadThroughSelfTest.testPartitionRent
- Resolved
- links to