[TINKERPOP-319] BulkLoaderVertexProgram for generalized batch loading across graphs - ASF JIRA

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 3.0.1-incubating
Fix Version/s: 3.1.0-incubating, 3.0.2-incubating
Component/s: process
Labels:
None

Description

After working on BulkLoaderVertexProgram for Titan, it is trivial to add this generally to TinkerPop – equivalent to BlueprintsOutputFormat (or whatever the bulk loader was known that was blueprints specific). However, given that Titan and TinkerPop have the same data model, Titan having its own BulkLoaderVertexProgram isn't necessary as there is no longer a data model alignment issue. The difference would be that instead of:

g.V.compute().program(BulkLoaderVertexProgram.build().titan(propertiesFile).create()).submit()

It would simply be:

g.V.compute().program(BulkLoaderVertexProgram.build().factory(propertiesFile).create()).submit()

...and BulkLoaderVertexProgram would use GraphFactory.open() to instantiate the connection to the graph. Moreover, (and spmallette will need to clear my head here), if the factory opened up a Gremlin Server connection, then we get parallel writing to embedded graph databases like Neo4j.

BulkLoaderVertexProgram is simply a vertex program that parallel loads a graph (with a graph computer) to any other graph that can be accessed via GraphFactory (which is every TP3 graph).

dalaro @mbroecheler dkuppitz

EXTENDED NOTES:

SchemaInference would be a MapReduce job executed prior to BulkLoaderVertexProgram
Titan and Neo4j can each have their own SchemaInference implementations.
Incremental loading .... I forget how this worked.
Bulk mutations ... this is possible at the TP3 level with hidden properties and smart add/remove/etc.

Attachments

Activity

Ascending order - Click to sort in descending order

Stephen Mallette added a comment - 02/Mar/15 16:41

Note that completion of this issue will essentially factor out the BatchGraph implementation.

Stephen Mallette added a comment - 02/Mar/15 16:41 Note that completion of this issue will essentially factor out the BatchGraph implementation.

Stephen Mallette added a comment - 20/Mar/15 12:49

Added this to GA otherwise BatchGraph gets out into the wild.

Stephen Mallette added a comment - 20/Mar/15 12:49 Added this to GA otherwise BatchGraph gets out into the wild.

Marko A. Rodriguez added a comment - 20/Mar/15 13:54

Smart. Yea. dkuppitz and I will build this.

Marko A. Rodriguez added a comment - 20/Mar/15 13:54 Smart. Yea. dkuppitz and I will build this.

Stephen Mallette added a comment - 01/Apr/15 13:20

Just realized that BatchGraph runs deep in IO - use it to load graphs within GraphReader.readGraph . Might be complicated to unravel that.

Stephen Mallette added a comment - 01/Apr/15 13:20 Just realized that BatchGraph runs deep in IO - use it to load graphs within GraphReader.readGraph . Might be complicated to unravel that.

Stephen Mallette added a comment - 22/Apr/15 17:17

If possible, I'd like to try to make sure that the IO interfaces will work with this OLAP loader as described in ~~TINKERPOP3-550~~ (in time for GA). this 550 issue extends into the notion of a "dumper" as well as a "loader" which isn't really discussed in this issue, but I guess should exist.

Will we see this functionality in place for M9? That would be ideal as it would mean the IO interfaces settled completely by then as well. thoughts?

Stephen Mallette added a comment - 22/Apr/15 17:17 If possible, I'd like to try to make sure that the IO interfaces will work with this OLAP loader as described in TINKERPOP3-550 (in time for GA). this 550 issue extends into the notion of a "dumper" as well as a "loader" which isn't really discussed in this issue, but I guess should exist. Will we see this functionality in place for M9? That would be ideal as it would mean the IO interfaces settled completely by then as well. thoughts?

Daniel Kuppitz added a comment - 01/Sep/15 14:57

It's all implemented. Things like SchemaInference, that are specific to Titan (or any other vendor implementation), are not part of the TinkerPop BulkLoaderVertexProgram implementation.

Daniel Kuppitz added a comment - 01/Sep/15 14:57 It's all implemented. Things like SchemaInference , that are specific to Titan (or any other vendor implementation), are not part of the TinkerPop BulkLoaderVertexProgram implementation.

Stephen Mallette added a comment - 02/Sep/15 14:11

this issue isn't documented yet and therefore not fully complete.

Stephen Mallette added a comment - 02/Sep/15 14:11 this issue isn't documented yet and therefore not fully complete.

Matthias Broecheler added a comment - 19/Sep/15 03:07 - edited

The following optimizations should be implemented to improve the performance of BLVP:

In line 212, BLVP should get the information whether the vertex was created or retrieved. If it was created (i.e. it did not exist before) then we are guaranteed that it cannot have any vertex properties. As such, the BLVP should then just create the vertex properties without checking for their existence first - this will be significantly faster.
Similarly, when loading edges in the second iteration, it should first compute this boolean variable requiresIncremental = sourceVertex.edges(OUT).hasNext() && outV.edges(OUT).hasNext() and then only do incremental loading on edges if this variable is true. If it is not true incremental loading (i.e. checking for edge existence) isn't necessary.

Both improvement together should dramatically improve the performance of BLVP since it will require a read per edge/vertex property only in those cases where a previous job failed. Under "normal" operational conditions it only requires one read per vertex per iteration. That is, the reads scale in O(|V|) and not O(|E|).

In addition, there should be an option for IncrementalBulkLoader so that it does not attempt to update edges and vertex properties when those already exist. In most cases, the edge will be identical when it has been loaded in a previous job (since edge and property mutations are atomic in most graph databases) and hence this check is unnecessary and being able to make it optional can save time.

Note, that these are important optimizations for large scale graph databases where bulk loading is necessary to get started.

Matthias Broecheler added a comment - 19/Sep/15 03:07 - edited The following optimizations should be implemented to improve the performance of BLVP: In line 212, BLVP should get the information whether the vertex was created or retrieved. If it was created (i.e. it did not exist before) then we are guaranteed that it cannot have any vertex properties. As such, the BLVP should then just create the vertex properties without checking for their existence first - this will be significantly faster. Similarly, when loading edges in the second iteration, it should first compute this boolean variable requiresIncremental = sourceVertex.edges(OUT).hasNext() && outV.edges(OUT).hasNext() and then only do incremental loading on edges if this variable is true. If it is not true incremental loading (i.e. checking for edge existence) isn't necessary. Both improvement together should dramatically improve the performance of BLVP since it will require a read per edge/vertex property only in those cases where a previous job failed. Under "normal" operational conditions it only requires one read per vertex per iteration. That is, the reads scale in O(|V|) and not O(|E|). In addition, there should be an option for IncrementalBulkLoader so that it does not attempt to update edges and vertex properties when those already exist. In most cases, the edge will be identical when it has been loaded in a previous job (since edge and property mutations are atomic in most graph databases) and hence this check is unnecessary and being able to make it optional can save time. Note, that these are important optimizations for large scale graph databases where bulk loading is necessary to get started.

Daniel Kuppitz added a comment - 19/Sep/15 03:29

Great suggestions, these can be implemented for 3.1. Also, for the protocol, I have yet another option in mind: dropAbsentProperties. Setting it true means: drop properties in the target graph if they don't exist in the source graph.

Daniel Kuppitz added a comment - 19/Sep/15 03:29 Great suggestions, these can be implemented for 3.1. Also, for the protocol, I have yet another option in mind: dropAbsentProperties . Setting it true means: drop properties in the target graph if they don't exist in the source graph.

Stephen Mallette added a comment - 21/Sep/15 12:48

dkuppitz can you talk about what you think blvp will be for 3.0.2? i just think it would be good to understand what is expected there so that we know when this ticket is complete. we can then open new issues for 3.1.x features as needed.

Stephen Mallette added a comment - 21/Sep/15 12:48 dkuppitz can you talk about what you think blvp will be for 3.0.2? i just think it would be good to understand what is expected there so that we know when this ticket is complete. we can then open new issues for 3.1.x features as needed.

Daniel Kuppitz added a comment - 21/Sep/15 15:04

BLVP in 3.0.2 will not contain any code changes. I will only add a paragraph in the docs and done.
3.1.0 will contain the features suggested by Matthias + extended docs.

Daniel Kuppitz added a comment - 21/Sep/15 15:04 BLVP in 3.0.2 will not contain any code changes. I will only add a paragraph in the docs and done. 3.1.0 will contain the features suggested by Matthias + extended docs.

Daniel Kuppitz added a comment - 21/Sep/15 17:02 - edited

Need to correct my previous comment. For 3.0.2 we also need to get Neo4j ("normal" mode, not HA) and TinkerGraph working. I guess TinkerGraph as a target was still an open discussion (whether we should support persistence or not), but we will at least need Neo4j (source and target) and TinkerGraph (source).

Daniel Kuppitz added a comment - 21/Sep/15 17:02 - edited Need to correct my previous comment. For 3.0.2 we also need to get Neo4j ("normal" mode, not HA) and TinkerGraph working. I guess TinkerGraph as a target was still an open discussion (whether we should support persistence or not), but we will at least need Neo4j (source and target) and TinkerGraph (source).

Stephen Mallette added a comment - 21/Sep/15 17:31

We slated that TinkerGraph feature for 3.0.2 - ~~TINKERPOP3-828~~ - i started working on it the other day and hit a problem or two with how to configure all the features of the different io implementations. Still thinking about how to get it completed.

Stephen Mallette added a comment - 21/Sep/15 17:31 We slated that TinkerGraph feature for 3.0.2 - TINKERPOP3-828 - i started working on it the other day and hit a problem or two with how to configure all the features of the different io implementations. Still thinking about how to get it completed.

Stephen Mallette added a comment - 07/Oct/15 15:21

~~TINKERPOP3-828~~ is complete now - should have updated this ticket.

Stephen Mallette added a comment - 07/Oct/15 15:21 TINKERPOP3-828 is complete now - should have updated this ticket.

Daniel Kuppitz added a comment - 12/Oct/15 23:21

Done. I've tested the following scenarios:

HadoopGraph to Neo4j
HadoopGraph to TinkerGraph
TinkerGraph to Neo4j
Neo4j to TinkerGraph

Furthermore I reworked the tests and BLVP will now be executed (previously I was only able to invoke some methods using reflection).
Last but not least, the docs contain 2 (3) nice examples.

Changes are pushed to https://github.com/apache/incubator-tinkerpop/tree/TINKERPOP3-319.

Daniel Kuppitz added a comment - 12/Oct/15 23:21 Done. I've tested the following scenarios: HadoopGraph to Neo4j HadoopGraph to TinkerGraph TinkerGraph to Neo4j Neo4j to TinkerGraph Furthermore I reworked the tests and BLVP will now be executed (previously I was only able to invoke some methods using reflection). Last but not least, the docs contain 2 (3) nice examples. Changes are pushed to https://github.com/apache/incubator-tinkerpop/tree/TINKERPOP3-319 .

Daniel Kuppitz added a comment - 15/Oct/15 19:45

All changes for 3.0.2 merged into tp30.

Daniel Kuppitz added a comment - 15/Oct/15 19:45 All changes for 3.0.2 merged into tp30 .

Daniel Kuppitz added a comment - 21/Oct/15 12:48

To be continued in https://issues.apache.org/jira/browse/TINKERPOP3-904.

Daniel Kuppitz added a comment - 21/Oct/15 12:48 To be continued in https://issues.apache.org/jira/browse/TINKERPOP3-904 .

People

Assignee:: Daniel Kuppitz

Reporter:: Marko A. Rodriguez

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 04/Nov/14 17:26

Updated:: 21/Oct/15 12:48

Resolved:: 21/Oct/15 12:48