Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Done
-
Jena 4.4.0
-
None
Description
Using a GeoSPARQL query with a geospatial property function, e.g.
SELECT * { :x geo:hasGeometry ?geo1 . ?s2 geo:hasGeometry ?geo2 . ?geo1 geo:sfContains ?geo2 }
leads to heavy memory consumption for larger datasets - and we're not talking about big data at all. Imagine given a polygon and checking for millions of geometries for containment in the polygon.
In the QueryRewriteIndex class for caching a key will be generated, but this is horribly expensive given that the string representation of Geometries is called millions of times leading millions of Byte arrays being created leading a to a possible OOM exception - we got it with 8GB assigned.
The key generation for reference:
String key = subjectGeometryLiteral.getLiteralLexicalForm() + KEY_SEPARATOR + predicate.getURI() + KEY_SEPARATOR + objectGeometryLiteral.getLiteralLexicalForm();
My suggestion is to use a separate Node -> Integer (or Long?) Guava cache and use the long values instead to generate the cache key. Or any other more efficient datastructure, not even sure if a String is necessary?
We tried some fix which works for us and keeps the memory consumption stable:
private LoadingCache<Node, Integer> nodeIDCache; private AtomicInteger cacheCounter; ... cacheCounter = new AtomicInteger(0); CacheBuilder<Object, Object> builder = CacheBuilder.newBuilder(); if (maxSize > 0) { builder = builder.maximumSize(maxSize); } if (expiryInterval > 0) { builder = builder.expireAfterWrite(expiryInterval, TimeUnit.MILLISECONDS); } nodeIDCache = builder.build( new CacheLoader<>() { public Integer load(Node key) { return cacheCounter.incrementAndGet(); } });
Attachments
Issue Links
- links to