[TINKERPOP-814] ConnectionPool can fill with dead Connections - ASF JIRA

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 3.0.1-incubating
Fix Version/s: 3.1.0-incubating, 3.0.2-incubating
Component/s: driver
Labels:
None
Environment:
"Cluster" of a single gremlin-server/Titan node. Single client. minConnectionPoolSize = 2; maxConnectionPoolSize = 10.

Description

Client gets a GryoMessageSerializerV1d0 decoding error on a couple of requests (different unfiled issue). ConnectionPool gets into the situation where there are 2 dead connections being maintained in the pool, but no new connections are being issued.

A sample of the log output...

2015-08-27 09:57:18,122 [Grizzly-worker(15)] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Borrowing connection from pool on Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin} - timeout in 3000 MILLISECONDS
2015-08-27 09:57:18,122 [Grizzly-worker(15)] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Return least used Connection{host=Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin}, isDead=false, borrowed=1, pending=0} on Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin}
2015-08-27 09:57:18,122 [Grizzly-worker(15)] DEBUG org.apache.tinkerpop.gremlin.driver.Client  - Submitted RequestMessage{, requestId=0349c1ce-19dc-4d54-9ca9-30965c9314ae, op='eval', processor='', args={<redacted>}} to - Connection{host=Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin}}
2015-08-27 09:57:18,125 [gremlin-driver-worker-6] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Attempting to return Connection{host=Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin}} on Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin}
2015-08-27 09:57:18,126 [gremlin-driver-worker-6] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool  - On Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin} pool size of 3 > minPoolSize 2 and borrowed of 0 <= minSimultaneousUsagePerConnection 8 so destroy Connection{host=Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin}, isDead=false, borrowed=0, pending=0}
2015-08-27 09:57:18,126 [gremlin-driver-worker-6] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Connection{host=Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin}, isDead=false, borrowed=0, pending=0} destroyed
2015-08-27 09:57:18,126 [Grizzly-worker(15)] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Borrowing connection from pool on Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin} - timeout in 3000 MILLISECONDS
2015-08-27 09:57:18,126 [Grizzly-worker(15)] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Pool was initialized but a connection could not be selected earlier - waiting for connection on Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin}
2015-08-27 09:57:18,126 [Grizzly-worker(15)] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Wait 3000 MILLISECONDS for an available connection on Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin} with Thread[Grizzly-worker(15),5,main]
2015-08-27 09:57:21,127 [Grizzly-worker(15)] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Continue to wait for connection on -1 if {} > 0
2015-08-27 09:57:21,127 [Grizzly-worker(15)] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Timed-out waiting for connection on Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin} - possibly unavailable
2015-08-27 09:57:21,127 [gremlin-driver-worker-3] DEBUG org.apache.tinkerpop.gremlin.driver.Host  - Trying to reconnect to dead host at Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin}
2015-08-27 09:57:21,128 [Grizzly-worker(15)] DEBUG org.apache.tinkerpop.gremlin.driver.Client  - Submitted RequestMessage{, requestId=67761973-7ae7-4a6a-8e62-443fdb55142d, op='eval', processor='', args=<redacted>} to - connection not initialized
2015-08-27 09:57:21,128 [gremlin-driver-worker-3] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Trying to re-establish connection on Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin}
2015-08-27 09:57:21,128 [Grizzly-worker(15)] ERROR EdnaMetaDataServiceGremlinImpl  - Failed man lookupMerchantAccountName
java.lang.RuntimeException: java.lang.RuntimeException: java.util.concurrent.TimeoutException

This behavior continues with each subsequent request. The gremlin-server is available at all times.

As a workaround I have applied the following patch, but I am pretty sure it is not the correct solution

diff --git a/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ConnectionPool.java b/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ConnectionPool.java
index 30cd71f..cb8ebfa 100644
--- a/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ConnectionPool.java
+++ b/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ConnectionPool.java
@@ -352,7 +352,7 @@ final class ConnectionPool {
             }

             remaining = to - TimeUtil.timeSince(start, unit);
-            logger.debug("Continue to wait for connection on {} if {} > 0", remaining);
+            logger.debug("Continue to wait for connection on {} if {} > 0", host, remaining);
         } while (remaining > 0);

         logger.debug("Timed-out waiting for connection on {} - possibly unavailable", host);
@@ -371,13 +371,16 @@ final class ConnectionPool {

         // let the load-balancer know that the host is acting poorly
         this.cluster.loadBalancingStrategy().onUnavailable(host);
-
     }

     private boolean tryReconnect(final Host h) {
         logger.debug("Trying to re-establish connection on {}", host);

         try {
+            // remove dead connections
+            for (Connection c : connections)
+                if (c.isDead())
+                    definitelyDestroyConnection(c);
             connections.add(new Connection(host.getHostUri(), this, cluster, settings().maxInProcessPerConnection));
             this.open.set(connections.size());

Attachments

Issue Links

relates to

TINKERPOP-816 Gryo deserialization of error response with null message causes NPE and protocol desync

Closed

TINKERPOP-817 Gryo serialization of large responses fails and causes protocol desync

Closed

Activity

Ascending order - Click to sort in descending order

Stephen Mallette added a comment - 27/Aug/15 17:40

Thanks for reporting this. Is there an "easy" way to get the driver into this state? You mentioned the "gryo" serialization failure...anything I can easily/readily reproduce?

Stephen Mallette added a comment - 27/Aug/15 17:40 Thanks for reporting this. Is there an "easy" way to get the driver into this state? You mentioned the "gryo" serialization failure...anything I can easily/readily reproduce?

Kieran Sherlock added a comment - 27/Aug/15 19:46

Hmm. I was thinking that it was any error on the connection, e.g. a client gremlin script syntax error, but that is not the case.

The client error I'm getting (when accessing a particular V in my DB) is

Caused by: org.apache.tinkerpop.gremlin.driver.ser.SerializationException: java.lang.IndexOutOfBoundsException: Index: 82125230, Size: 0
	at org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV1d0.deserializeResponse(GryoMessageSerializerV1d0.java:183)
	at org.apache.tinkerpop.gremlin.driver.handler.WebSocketGremlinResponseDecoder.decode(WebSocketGremlinResponseDecoder.java:49)
	at org.apache.tinkerpop.gremlin.driver.handler.WebSocketGremlinResponseDecoder.decode(WebSocketGremlinResponseDecoder.java:36)
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:89)

I tried adding this to the driver....

diff --git a/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ser/GryoMessageSerializerV1d0.java b/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ser/GryoMessageSerializerV1d0.java
index 1c63046..3f5b812 100644
--- a/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ser/GryoMessageSerializerV1d0.java
+++ b/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ser/GryoMessageSerializerV1d0.java
@@ -170,6 +170,11 @@ public final class GryoMessageSerializerV1d0 implements MessageSerializer {
                 final Object result = kryo.readClassAndObject(input);
                 final Map<String,Object> metaAttributes = (Map<String,Object>) kryo.readClassAndObject(input);

+                if (true) {
+                    logger.warn("Throwing indexOOBE for TINKERPOP3-814");
+                    throw new IndexOutOfBoundsException("TINKERPOP3-814");
+                }
+
                 return ResponseMessage.build(requestId)
                         .code(ResponseStatusCode.getFromValue(status))
                         .statusMessage(statusMsg.intern())

and then running a test like

    @Test
    public void driverbug3() throws Exception {
        GryoMapper kryo = GryoMapper.build().addRegistry(TitanIoRegistry.INSTANCE).create();
        MessageSerializer serializer = new GryoMessageSerializerV1d0(kryo);
        Client client = Cluster.build("localhost")
                .port(18182)
                .maxConnectionPoolSize(10)
                .reconnectIntialDelay(0)
                .workerPoolSize(10)
                .serializer(serializer)
                .create().connect();

        Map<String, Object> params = new HashMap<>();
        params.put("id", 1);
        try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {};
        Thread.sleep(500);
        try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {};
        Thread.sleep(0);
        try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {};
        Thread.sleep(0);
        try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {};
        Thread.sleep(0);
        try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {};
        Thread.sleep(0);
        try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {};
        Thread.sleep(2000);
        try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {e.printStackTrace();};
        try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {e.printStackTrace();};
        try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {e.printStackTrace();};
        try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {e.printStackTrace();};
        try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {e.printStackTrace();};
        try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {e.printStackTrace();};
    }

BUT I am NOT seeing the same Timeout due to

2015-08-27 09:57:18,126 [Grizzly-worker(15)] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Wait 3000 MILLISECONDS for an available connection on Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin} with Thread[Grizzly-worker(15),5,main]

I'll try some more and let you know if I come up with something.

Kieran Sherlock added a comment - 27/Aug/15 19:46 Hmm. I was thinking that it was any error on the connection, e.g. a client gremlin script syntax error, but that is not the case. The client error I'm getting (when accessing a particular V in my DB) is Caused by: org.apache.tinkerpop.gremlin.driver.ser.SerializationException: java.lang.IndexOutOfBoundsException: Index: 82125230, Size: 0 at org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV1d0.deserializeResponse(GryoMessageSerializerV1d0.java:183) at org.apache.tinkerpop.gremlin.driver.handler.WebSocketGremlinResponseDecoder.decode(WebSocketGremlinResponseDecoder.java:49) at org.apache.tinkerpop.gremlin.driver.handler.WebSocketGremlinResponseDecoder.decode(WebSocketGremlinResponseDecoder.java:36) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:89) I tried adding this to the driver.... diff --git a/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ser/GryoMessageSerializerV1d0.java b/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ser/GryoMessageSerializerV1d0.java index 1c63046..3f5b812 100644 --- a/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ser/GryoMessageSerializerV1d0.java +++ b/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ser/GryoMessageSerializerV1d0.java @@ -170,6 +170,11 @@ public final class GryoMessageSerializerV1d0 implements MessageSerializer { final Object result = kryo.readClassAndObject(input); final Map<String,Object> metaAttributes = (Map<String,Object>) kryo.readClassAndObject(input); + if (true) { + logger.warn("Throwing indexOOBE for TINKERPOP3-814"); + throw new IndexOutOfBoundsException("TINKERPOP3-814"); + } + return ResponseMessage.build(requestId) .code(ResponseStatusCode.getFromValue(status)) .statusMessage(statusMsg.intern()) and then running a test like @Test public void driverbug3() throws Exception { GryoMapper kryo = GryoMapper.build().addRegistry(TitanIoRegistry.INSTANCE).create(); MessageSerializer serializer = new GryoMessageSerializerV1d0(kryo); Client client = Cluster.build("localhost") .port(18182) .maxConnectionPoolSize(10) .reconnectIntialDelay(0) .workerPoolSize(10) .serializer(serializer) .create().connect(); Map<String, Object> params = new HashMap<>(); params.put("id", 1); try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {}; Thread.sleep(500); try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {}; Thread.sleep(0); try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {}; Thread.sleep(0); try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {}; Thread.sleep(0); try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {}; Thread.sleep(0); try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {}; Thread.sleep(2000); try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {e.printStackTrace();}; try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {e.printStackTrace();}; try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {e.printStackTrace();}; try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {e.printStackTrace();}; try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {e.printStackTrace();}; try { client.submit("g.V(id).count()", params); } catch (RuntimeException e) {e.printStackTrace();}; } BUT I am NOT seeing the same Timeout due to 2015-08-27 09:57:18,126 [Grizzly-worker(15)] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool - Wait 3000 MILLISECONDS for an available connection on Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin} with Thread[Grizzly-worker(15),5,main] I'll try some more and let you know if I come up with something.

Geoff Reedy added a comment - 28/Aug/15 14:51

Here are two issues that can cause protocol synchronization issues and lead to dead/broken connections.

Geoff Reedy added a comment - 28/Aug/15 14:51 Here are two issues that can cause protocol synchronization issues and lead to dead/broken connections.

Stephen Mallette added a comment - 29/Sep/15 20:23

The solution in the description wasn't too far off from the heart of the problem in some sense. First of all, the Host was getting marked unavailable too early for issues that might have been repairable in a sense. For example, a serialization problem might occur for one query but not for another (e.g. one query might return a Titan RelationIdentifier which requires a serializer and another might not). These cases where there is a chance that future requests can be made on the Host it's probably not a good idea to kill the channel and mark the Host dead. So that was resolved ~~TINKERPOP3-816~~

Then, when the Host was marked unavailable, nothing was being done to clear the connections in the pool and then (for some reason) the driver wasn't doing much to respect the pool size and new connections were just being created and added to the pool when a reconnect to the Host was attempted. So, the driver now destroys connections when the Host is marked as unavailable. On reconnect attempts it looks to create a new connection (through standard methods that respect pool size/timeouts/etc) and then issues a "ping" to see if the host is online. The "ping" is kinda dumpy, just a gremlin script with an empty string, which should just evaluate back as an empty string, but it does the job. Unfortunately, it's not easy rely on the websocket ping given the way the driver is setup now, plus we have to rely on the subprotcol capabilities since this code will need to work over the NIO channel or other means so doing websocket specific things won't be so good. Perhaps we could build a "ping" built into the subprotocol itself.

It would be nice if I had a clean unit test here to prevent regressions, but I'm not completely sure how to write it - our integration testing infrastructure isn't advanced enough for the scenario involved i don't think.

Stephen Mallette added a comment - 29/Sep/15 20:23 The solution in the description wasn't too far off from the heart of the problem in some sense. First of all, the Host was getting marked unavailable too early for issues that might have been repairable in a sense. For example, a serialization problem might occur for one query but not for another (e.g. one query might return a Titan RelationIdentifier which requires a serializer and another might not). These cases where there is a chance that future requests can be made on the Host it's probably not a good idea to kill the channel and mark the Host dead. So that was resolved TINKERPOP3-816 Then, when the Host was marked unavailable, nothing was being done to clear the connections in the pool and then (for some reason) the driver wasn't doing much to respect the pool size and new connections were just being created and added to the pool when a reconnect to the Host was attempted. So, the driver now destroys connections when the Host is marked as unavailable. On reconnect attempts it looks to create a new connection (through standard methods that respect pool size/timeouts/etc) and then issues a "ping" to see if the host is online. The "ping" is kinda dumpy, just a gremlin script with an empty string, which should just evaluate back as an empty string, but it does the job. Unfortunately, it's not easy rely on the websocket ping given the way the driver is setup now, plus we have to rely on the subprotcol capabilities since this code will need to work over the NIO channel or other means so doing websocket specific things won't be so good. Perhaps we could build a "ping" built into the subprotocol itself. It would be nice if I had a clean unit test here to prevent regressions, but I'm not completely sure how to write it - our integration testing infrastructure isn't advanced enough for the scenario involved i don't think.

TinkerPop

ConnectionPool can fill with dead Connections

Details

Description

Attachments

Issue Links

Activity

People

Dates