Details
Description
I just made an attempt to serialise lambdas and send them via the RemoteGraph. I didn't quite get there, but wanted to share my findings:
- it's possible to serialise lambdas on the jvm by just extending `Serializable`:
http://stackoverflow.com/questions/22807912/how-to-serialize-a-lambda/22808112#22808112
- sending a normal predicate doesn't work (this is a Scala REPL but it should be pretty easy to convert this to java/groovy)
val g = RemoteGraph.open("conf/remote-graph.properties").traversal()
val pred1 = new java.util.function.Predicate[Traverser[Vertex]] { def test(v: Traverser[Vertex]) = true }g.V().filter(pred1).toList
// java.lang.RuntimeException: java.io.NotSerializableException: $anon$1
// on server: nothing
- simply adding Serializable let's us send it over the wire, but the server doesn't deserialise it
val pred2 = new java.util.function.Predicate[Traverser[Vertex]] with Serializable { def test(v: Traverser[Vertex]) = true }g.V().filter(pred2).toList
// on server: [WARN] OpExecutorHandler - Could not deserialize the Traversal instance
org.apache.tinkerpop.gremlin.server.op.OpProcessorException: Could not deserialize the Traversal instance
at org.apache.tinkerpop.gremlin.server.op.traversal.TraversalOpProcessor.iterateOp(TraversalOpProcessor.java:135)
at org.apache.tinkerpop.gremlin.server.handler.OpExecutorHandler.channelRead0(OpExecutorHandler.java:68)
// on client: org.apache.tinkerpop.gremlin.driver.exception.ResponseException: $anon$1
Attachments
Issue Links
- is related to
-
TINKERPOP-1278 Implement Gremlin-Python and general purpose language variant test infrastructure
- Closed
- relates to
-
TINKERPOP-575 Implement RemoteGraph
- Closed
Activity
I am not saying this will work for all cases, but hoping that we can make it work for simple lambdas (which by definition does not close over variables outside of it's parameter list). Can you describe what exactly the roadblock is? That way other people can jump in and help.
I've created a simple java project that demonstrates that it's possible to serialise a lambda and execute it on a different jvm. The serialised lambda is actually checked into git, and you should be able to run it on your machine:
git clone https://github.com/mpollmeier/jvm-lambda-serialisation.git
mvn clean compile exec:java -Dexec.mainClass="LambdaDeserialisation"
This is definitely possible as Spark does serialize the Java lambdas across the cluster. I think we should definitely figure this out for Gremlin. Unfortunately, I don't think it will be possible Gremlin-Groovy.
gremlin> inputRDD.flatMapValues{x -> x.get().edges(OUT)}.count() Task not serializable Display stack trace? [yN]
I think we tried this in the past. The problem was that on the same machine it works, but then you move it to a different machine and it doesn't deserialize correctly. ... I forget the specifics but I think that was the problem with this "simple serialization" technique.
okram did you try my example above? I used the 'simple' technique to serialise a lambda, committed it to the repo and you will be able to deserialise and execute it on your machine. This proves that it's feasible, however obviously somethings more complicated in gremlin-groovy, and that's what I'd like to understand. If other peeps worked on this it would be good to hear what they found out.
I did some looking into this for Groovy Closures and found a few resources that seem to highlight the problem with trying to serialize a groovy closure and a workable solution to it. The issues:
1) Some probably important internal fields (e.g. owner, delegate, thisObject) are not guaranteed to implement Serializable or be serializable and must be stripped - groovy provides dehydrate()/rehydrate() for this (see https://issues.apache.org/jira/browse/GROOVY-5151)
2) As Stephen pointed out it seems the big problem is getting the bytecode for the closure to go across the wire too and then be deserialized and loaded properly at the other end without exploding.
These two blog posts, where the latter picked up on the work of the former, build towards the solution.
1) http://seeallhearall.blogspot.ca/2012/01/remoting-groovy-with-generated-closures.html
- Difficulty of acquiring the closure byte code and potential red herrings out there on the internet
- Retrieve a closure's bytecode using java.lang.instrument.ClassFileTransformer
- Attach a java agent to a running JVM not started with -javaagent -> access instrumentation instance
- Overall, getting a closure's bytecode reduced to this:
import org.helios.gmx.classloading.* foo = {message -> println message} bytes = ByteCodeRepository.getInstance().getByteCode(foo.getClass()) println "Class:${foo.getClass().getName()} ByteCode:${bytes.length} bytes"
2) Expanding on (1), http://thegridman.com/uncategorized/groovy-oracle-coherence-yeah-baby/
- Using the ByteCodeRepository class from GroovyMX mentioned above to get Closure bytecode
- Implements a GroovyClosureSerializer class (POF, KryoSerializer would be similar right?)
- Implements a GroovyClosureClassLoader ensuring bytecode is used when the closure is deserialized
A slightly abridged passage from article (2), because it sums up the whole approach and I'm certainly not going to put it in my own words any better:
"In effect what the GroovyMX does is use the Java Agent API to attach to the current process and then use the instrumentation API to be able to intercept class loading and see byte code. Rather than copy the techniques or pull out bits of code it was easier at this point to just include the GroovyMX jar as a dependency of my code and use the couple of classes I needed directly. To obtain the byte code of a class GroovyMX contains a class called org.helios.gmx.util.ByteCodeNet which has a method called getClassBytes. This method returns a Map keyed on class name with values as byte[] which are then easy to POF serialize. As you can see it is pretty simple with the addition of two lines to the serialize method. We can also change the deserialize method to deserialize the byte code. But we still have a problem as what do we do with the byte code to make sure it gets used. The obvious answer is we need a special ClassLoader that we can pass this byte code to and that will be used when we deserialize the Closure. Now we can add this [GroovyClosureClassLoader] to our derserialize method and make our DefaultSerializer use this ClassLoader instead of the context ClassLoader."
He goes on to show this working in action. Hope this helps somewhat.
A method for lambda serialization was implemented on TINKERPOP-1278 as part of the work on gremlin-python. This issue will close with that one.
spmallette hmm, really? 1278 is mainly about generating groovy strings from other languages if I read it right? And the tutorial (http://tinkerpop.apache.org/docs/3.2.1-SNAPSHOT/tutorials/gremlin-language-variants/) states "Lambdas are not supported". Can you clarify please?
mpollmeier There is a standard method for serializing lambdas to bytecode that will apply for all remoting over GLVs. I don't think we'll implement any further than that:
http://tinkerpop.apache.org/docs/3.2.2-SNAPSHOT/reference/#_the_lambda_solution
I guess this issue was about serializing purely on the JVM. I looked at your sample repo a while back, but I must have forgotten to comment. That approach will work, because the code for the predicate you serialized will deserialized by the same program - even on two different JVMs. In other words the code for the predicate is available to both JVMs in the LambdaSerialisation.java file, so - no problem. But if you think about how Gremlin Server works, if you used that program to generate the .ser file and then send that file to Gremlin Server, it can try to deserialize, but it won't have the predicate code and it will fail. The bottom line is that java serialization does not serialize "the code" just "data".
You can simulate that pretty easily if you run your project to generate lambd.ser, copy the whole project, edit LambdaSerialisation.java and delete the Predicate, the run LambdaDeserialisation from that project on the .ser file. It will fail with:
[INFO] --- exec-maven-plugin:1.4.0:java (default-cli) @ jvm-serialisation --- reading lambda from lambda.ser [WARNING] java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:293) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassNotFoundException: MyPredicate at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at java.io.ObjectInputStream.resolveClass(ObjectInputStream.java:628) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1620) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373) at LambdaDeserialisation.main(LambdaSerialisation.java:10) ... 6 more
So unless you are suggesting as part of your solution here that Gremlin Server have the "Predicate" on the server and on the client or something along those lines, I'm not sure I see how this will work. Or perhaps i'm missing something else?
okram can you update the GLV tutorial about lambdas please?
What does the GLV tutorial need updating with? Please see 3.2.2-SNAPSHOT tutorial.
What does the GLV tutorial need updating with? Please see 3.2.2-SNAPSHOT tutorial.
ah - never mind - i misread the version mpollmeier was pointing at
ah - never mind - i misread the version mpollmeier was pointing at
Sorry for mixing up the versions of documentation. The 'lambda' solution is sending code as strings over the wire, that's not quite what I had in mind. It doesn't really matter for dynamic languages, but if used with a statically compiled language like Scala or Java you lose all the help from the compiler / IDE.
I see what you're saying re my lambda serialisation project - not ideal. Another option would be to allow to extend gremlin-server with some endpoints and additional libraries - the current plugin system is not powerful enough for that task, is it?
One of the things I'm trying to figure out is how I can use Gremlin-Scala with DSE graph. I'm having a hard time to see how that could work, since DSE is closed source and doesn't really allow to extend it with additional libraries. But since DSE doesn't have a remote protocol (unlike Titan) and I can't really send remote queries to the integrated Gremlin Server (at least not without losing all the help from the compiler), I don't currently see a way to integrate with it. If this is not the right place to discuss such matters, just let me know - maybe chat is better?
yes - i think deploying additional libs (with the lambdas in them) to gremlin server and to the client would work from the serialization perspective. Of course, we had a discussion on the dev mailing list a while back about dropping JVM serialization of a Traversal for communication with Gremlin Server (in favor of just Bytecode), so technically it won't help now. That code is easy to resurrect though and could be implemented as a custom OpProcessor without too much trouble.
With respect to DSE Graph, you would have to use the withRemote() feature of TinkerPop. Of course, that still doesn't make the transition seamless or solve all the problems you would likely have. You should be able to put libs in DSE's path so that Gremlin Server could pick them up. So you could at least move your logic to the server where the graph instances live. I imagine it might mean you have to refactor a bit to get that all to work nicely though.....
Thanks for your thoughts Stephen. Dropping additional libs into DSE sounds doable yet dangerous. If I did so, would I be able to call my code though?
I'd need some plugin system, e.g. to define an endpoint that's being exposed that then calls my code. If I read the OpProcessor interface right then this could do it for websocket? Is there a similar way for http?
Thinking on this some more, I'm not sure I know everything will work the way I was envisioning. There may yet be some holes in how lambdas are evaluated that would prevent this from working perfectly - that may be generally true in TinkerPop and not just a DSE issue. Once we get this 3.2.2/3.1.4 release out I will be adding a number of tickets based on the "remote" work. I've made a note to look into this issue more carefully and to create tickets as needed.
Many TinkerPop man-hours have been piled into this by many different people and i'm not aware of any changes in java that will allow this to work. Serialization of a lambda is easy within the same JVM, but as soon as you ship those bytes to a new JVM it will explode. It is my understanding that Java serialization does not serialize "code" - it serializes data in objects and that isn't enough to allow you to deserialize on the server for all cases.
It would be amazing if you could figure out how this would work, but it sounds like you've reached the same roadblocks that everyone else has.