Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Invalid
-
0.9.5
-
None
Description
Hi,
we are running storm-mesos cluster and occassionaly workers die or are "lost" in mesos. When this happens it often coincides with errors in logs related to supervisors local state.
By looking at the storm code it seems this might be caused by the way how multiple supervisor processes access the local state in the same directory via VersionedStore.
For example: https://github.com/apache/storm/blob/master/storm-core/src/clj/backtype/storm/daemon/supervisor.clj#L434
Here every supervisor does this concurrently:
1. reads latest state from FS
2. possibly updates the state
3. writes the new version of the state
Some updates could be lost if there are 2+ supervisors and they execute above steps concurrently - then only the updates from last supervisor would remain on the last state version on the disk.
We observed local state changes quite often (seconds), so the likelihood of this concurrency issue occurring is high.
Some examples of exeptions:
------------------------------------------
java.lang.RuntimeException: Version already exists or data already exists
at backtype.storm.utils.VersionedStore.createVersion(VersionedStore.java:85) ~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.utils.VersionedStore.createVersion(VersionedStore.java:79) ~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.utils.LocalState.persist(LocalState.java:101) ~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.utils.LocalState.put(LocalState.java:82) ~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.utils.LocalState.put(LocalState.java:76) ~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.daemon.supervisor$mk_synchronize_supervisor$this7400.invoke(supervisor.clj:382) ~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.event$event_manager$fn2625.invoke(event.clj:40) ~[storm-core-0.9.5.jar:0.9.5]
at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
---------------------------------------
java.io.FileNotFoundException: File '/var/lib/storm/supervisor/localstate/1441034838231' does not exist
at org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:299) ~[commons-io-2.4.jar:2.4]
at org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1763) ~[commons-io-2.4.jar:2.4]
at backtype.storm.utils.LocalState.deserializeLatestVersion(LocalState.java:61) ~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.utils.LocalState.snapshot(LocalState.java:47) ~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.utils.LocalState.get(LocalState.java:72) ~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:234) ~[storm-core-0.9.5.jar:0.9.5]
at clojure.lang.AFn.applyToHelper(AFn.java:161) [clojure-1.5.1.jar:na]
at clojure.lang.AFn.applyTo(AFn.java:151) [clojure-1.5.1.jar:na]
at clojure.core$apply.invoke(core.clj:619) ~[clojure-1.5.1.jar:na]
at clojure.core$partial$fn4190.doInvoke(core.clj:2396) ~[clojure-1.5.1.jar:na]
at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.5.1.jar:na]
at backtype.storm.event$event_manager$fn2625.invoke(event.clj:40) ~[storm-core-0.9.5.jar:0.9.5]
at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
-----------------------------------------