Uploaded image for project: 'Flume'
  1. Flume
  2. FLUME-649

flume looses events and blocks when (maybe) thrift rpc sink is too slow

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Won't Fix
    • 0.9.3
    • 0.9.5
    • Node
    • Ubuntu 10.04/10.10, 8GB RAM, ~ 750GB Disc, thrift 0.5

    Description

      We're using an rpcSource which has an rpcSink like this one:

      < rpcSink( "rpcserver", 9090 ) ? { diskFailover => { insistentAppend => { stubbornAppend =>

      { insistentOpen => rpcSink( "rpcserver", 9090 ) }

      } } } >

      When many flume nodes writes to this "rpcserver" in parallel and the rpcserver isn't quick enough to handle all incoming events as quick as they appear, the network buffer are running full so that with tcpdump/wireshark you see "TCP WindowFull" (see http://wiki.wireshark.org/TCP_Analyze_Sequence_Numbers). The problem: the flume node doesn't recognize this really quick and two problems appears:

      1. the flume node seems to send some time to the full node and it takes a while until it closes the connection and some events are lost.
      2. before the flume node restart the connection like this one:
      2011-04-23 23:34:20,940 INFO com.cloudera.flume.handlers.debug.StubbornAppendSink: Append failed java.net.SocketException: Broken pipe
      2011-04-23 23:34:20,940 INFO com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink on port 9090 closed
      2011-04-23 23:34:20,940 INFO com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink open on port 9090 opened
      2011-04-23 23:34:20,940 INFO com.cloudera.flume.handlers.debug.InsistentOpenDecorator: Opened ThriftEventSink on try 0

      it needs much more time to receive events so our rpc clients which are connected to the flume node instance have many timeouts (we need it really quick).
      So maybe we're using flume wrong or the mechanism doesn't queue events but tries to send it directly through the pipe which isn't possible because of the
      slower rpc server. This blocking makes it unusable for us. Did we do something wrong or is it a flume related bug?

      Simon

      Attachments

        Activity

          People

            Unassigned Unassigned
            flume_se Disabled imported user
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: