Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Won't Fix
-
0.9.3
-
Ubuntu 10.04/10.10, 8GB RAM, ~ 750GB Disc, thrift 0.5
Description
We're using an rpcSource which has an rpcSink like this one:
< rpcSink( "rpcserver", 9090 ) ? { diskFailover => { insistentAppend => { stubbornAppend =>
{ insistentOpen => rpcSink( "rpcserver", 9090 ) }} } } >
When many flume nodes writes to this "rpcserver" in parallel and the rpcserver isn't quick enough to handle all incoming events as quick as they appear, the network buffer are running full so that with tcpdump/wireshark you see "TCP WindowFull" (see http://wiki.wireshark.org/TCP_Analyze_Sequence_Numbers). The problem: the flume node doesn't recognize this really quick and two problems appears:
1. the flume node seems to send some time to the full node and it takes a while until it closes the connection and some events are lost.
2. before the flume node restart the connection like this one:
2011-04-23 23:34:20,940 INFO com.cloudera.flume.handlers.debug.StubbornAppendSink: Append failed java.net.SocketException: Broken pipe
2011-04-23 23:34:20,940 INFO com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink on port 9090 closed
2011-04-23 23:34:20,940 INFO com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink open on port 9090 opened
2011-04-23 23:34:20,940 INFO com.cloudera.flume.handlers.debug.InsistentOpenDecorator: Opened ThriftEventSink on try 0it needs much more time to receive events so our rpc clients which are connected to the flume node instance have many timeouts (we need it really quick).
So maybe we're using flume wrong or the mechanism doesn't queue events but tries to send it directly through the pipe which isn't possible because of the
slower rpc server. This blocking makes it unusable for us. Did we do something wrong or is it a flume related bug?
Simon