Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Won't Fix
-
0.9.3
Description
I've found a case when a collector using collectorSink (in e2e mode)
becomes unstable if hdfs enters safeMode (and leaves it after a
while).
My collectorSink looks like:
exec config collector02.data data-flow autoCollectorSource
'collectorSink("hdfs://host.namel/user/flume/input", "data-",
30000)'
The situation would be:
1- The collector is saving data to hdfs properly .
2- I turn on safe mode on hdfs.
3- Collector starts failing, this is normal as can't write to hdfs.
4- After the collector has been failing trying to create 2 files (more
than 30000 mili * 2) I make hdfs leave the safe mode.
5- Since now, even hdfs is not in safemode anymore, I will sometimes
be getting errors like:
NFO endtoend.AckChecksumChecker: Starting checksum group called log.
00000033.20110309-172500877+0100.1299687900877700000.seq
2011-03-09 17:26:12,415 [logicalNode collector02.data-25] INFO
endtoend.AckChecksumChecker: initial checksum is 12e9b6f86ce
2011-03-09 17:26:12,415 [logicalNode collector02.data-25] INFO
hdfs.EscapedCustomDfsSink: Closing
hdfs://sturlese.local/user/flume/input/data-log.00000025.20110309-172540945 +0100.1299687940945069000.seq
2011-03-09 17:26:12,415 [logicalNode collector02.data-25] INFO
hdfs.CustomDfsSink: Closing HDFS file:
hdfs://sturlese.local/user/flume/input/data-log.00000025.20110309-172540945 +0100.1299687940945069000.seq
2011-03-09 17:26:12,415 [logicalNode collector02.data-25] INFO
hdfs.CustomDfsSink: done writing raw file to hdfs
2011-03-09 17:26:12,435 [logicalNode collector02.data-25] INFO
hdfs.EscapedCustomDfsSink: Closing
hdfs://sturlese.local/user/flume/input/data-log.00000044.20110309-172430963 +0100.1299687870963688000.seq
2011-03-09 17:26:12,435 [logicalNode collector02.data-25] INFO
hdfs.CustomDfsSink: Closing HDFS file:
hdfs://sturlese.local/user/flume/input/data-log.00000044.20110309-172430963 +0100.1299687870963688000.seq
2011-03-09 17:26:12,435 [logicalNode collector02.data-25] INFO
hdfs.CustomDfsSink: done writing raw file to hdfs
2011-03-09 17:26:12,435 [logicalNode collector02.data-25] ERROR
rolling.RollSink: Failure when attempting to rotate and open new sink:
org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot add
block to /user/flume/input/data-log.
00000044.20110309-172430963+0100.1299687870963688000.seq. Name node is
in safe mode.
Use "hadoop dfsadmin -safemode leave" to turn safe mode off.
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNa mesystem.java:
1318)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:
469)
at sun.reflect.Generated
MethodAccessor11.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImp l.java:
25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962)
2011-03-09 17:26:12,435 [logicalNode collector02.data-25] INFO
hdfs.EscapedCustomDfsSink: Opening
hdfs://host.name/user/flume/input/data-log.00000025.20110309-172612415+0100 .1299687972415285000.seq
If I refresh the collector using the shell (exec refresh
collector02.data) the error will disappear and all data that couldn't
be written will suddenly be (because of the e2e).
Reproducing the situation many times from a
fresh installation every time, some of them it gets recovered without
the need of refreshing, but most of them not.
Jonathan Hsieh tip:
At the highest level, it seems like there is some bad state sticking
around that doesn't get cleaned out properly when recovering.
Internally the escaping code ends up holding many hdfs file handles
open. My guess is that we are in a situation with multiple handles
open and the error occurs. We properly handle one error but the other
broken file handles (based on connections that were initiated in safe
mode) are not cleaned up. The bad handles eventually get closed when
a roll tries to flush them, and this is when the actual error message
appears.
The refresh closes everything down and the restarts it, so all the new
connections are all good.