Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.2.0
-
None
-
None
Description
The method NettyClient.getNextChannel has a mechanism to detect when a channel is no longer active. In this case, it removes it from the ChannelRotator while it tries to reconnect, then re-adds it once successful.
When there are more client threads than channels, it is possible for a client thread to call ChannelRotator.nextChannel it is empty because all channels are trying to reconnect. This throws IllegalArgumentException("nextChannel: No channels exist!"), which kills the worker.
Instead, the thread should have some way of knowing that there's a channel currently reconnecting so that it can wait for it. If the reconnection fails after the specified number of retries, the thread that is trying to reconnect it will throw an exception and fail the worker, so there's no concern about hanging here.
A workaround is to ensure that giraph.channelsPerServer >= giraph.nettyClientThreads, but this is often not desirable in cases with many workers.