Orientdb: Server deadlock with multiple concurrent binary clients

Created on 11 Mar 2016  Â·  38Comments  Â·  Source: orientechnologies/orientdb

Using version 2.1.12, I have multiple concurrent threads, each using its own graph instance, reading, creating and updating records in the DB. After about 20.000 cycles, all clients hang. I debugged the server and noticed that_ all _binary clients threads (at the server) where stuck at:

"OrientDB <- BinaryClient (/10.1.100.22:61097)"
    at com.orientechnologies.common.concur.lock.OReadersWriterSpinLock.acquireWriteLock(OReadersWriterSpinLock.java:132)
    at com.orientechnologies.common.concur.lock.OModificationLock.prohibitModifications(OModificationLock.java:96)
    at com.orientechnologies.common.concur.lock.OModificationLock.prohibitModifications(OModificationLock.java:84)
    at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.synch(OAbstractPaginatedStorage.java:1276)
    at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.checkLowDiskSpaceAndFullCheckpointRequests(OAbstractPaginatedStorage.java:2894)
    at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.commit(OAbstractPaginatedStorage.java:1150)
    at com.orientechnologies.orient.core.tx.OTransactionOptimistic.doCommit(OTransactionOptimistic.java:606)
    at com.orientechnologies.orient.core.tx.OTransactionOptimistic.commit(OTransactionOptimistic.java:156)
    at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.commit(ODatabaseDocumentTx.java:2679)
    at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.commit(ODatabaseDocumentTx.java:2648)
    at com.tinkerpop.blueprints.impls.orient.OrientTransactionalGraph.commit(OrientTransactionalGraph.java:161)
    at com.orientechnologies.orient.graph.sql.OGraphCommandExecutorSQLFactory.runInConfiguredTxMode(OGraphCommandExecutorSQLFactory.java:262)
    at com.orientechnologies.orient.graph.sql.OCommandExecutorSQLCreateEdge.execute(OCommandExecutorSQLCreateEdge.java:153)
    at com.orientechnologies.orient.core.sql.OCommandExecutorSQLDelegate.execute(OCommandExecutorSQLDelegate.java:90)
    at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.executeCommand(OAbstractPaginatedStorage.java:1530)
    at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.command(OAbstractPaginatedStorage.java:1511)
    at com.orientechnologies.orient.core.command.OCommandRequestTextAbstract.execute(OCommandRequestTextAbstract.java:67)
    at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.command(ONetworkProtocolBinary.java:1323)
    at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.executeRequest(ONetworkProtocolBinary.java:400)
    at com.orientechnologies.orient.server.network.protocol.binary.OBinaryNetworkProtocolAbstract.execute(OBinaryNetworkProtocolAbstract.java:223)
    at com.orientechnologies.common.thread.OSoftThread.run(OSoftThread.java:77)

I tried to reduce the transaction sizes to an absolute minimum without luck. In this situation I even cannot shutdown the server cleanly but have to do a kill -9 and restart to re-enable binary clients to connect to the server.

This might relate to #4613

bug

Most helpful comment

Gentlemen, 2.1.x works with more than 14.000.000 edges and about 200.000 iterations over the above-mentioned code very well. Beforehand, the error occurred after about 20.000 iterations at the latest.

I think it is fixed. Thank you for your efforts!!

All 38 comments

Hi that is single stack trace, it does not show reason of deadlock.
Could you provide full stack trace ?

Hi, any updates ?

In my test case, I'm creating (and selecting) edges the following way:

public Edge addOrUpdateEdgeByParams(Vertex origin, String label, Vertex destination, Map<String, Object> params) {

        // select from HAS_TOKEN WHERE out = #16:0 and in = #15:0
        Iterable<Edge> result = new Query<Edge>(graph).SQLQuery("SELECT FROM " + label + " WHERE out = ? and in = ? AND " + getWhereClause(params), origin.getId(), destination.getId());
        Iterator iter = result.iterator();
        if (iter.hasNext()) {
            return (Edge) iter.next();
        } else {
            for (int i = 0; i < MAXTRIES; i++) {
                try {
                    Iterable<Edge> newvertexresult = (Iterable<Edge>) new Query<Edge>(graph).SQLCommand("CREATE EDGE " + label + " FROM ? TO ?", origin.getId(), destination.getId());
                    OrientEdge e = (OrientEdge) newvertexresult.iterator().next();
                    graph.commit();

                    setProperties(e, params);
                    return e;
                } catch (OConcurrentModificationException ex) {
                    reload(origin);
                    reload(destination);
                }
            }

        }

        throw new RuntimeException("Could not resolve OConcurrentModificationException");

    }
    public Iterable<T> SQLQuery(String sql_cmd, Object... params) {
        OSQLSynchQuery<T> qr = new OSQLSynchQuery<>(sql_cmd);
        Iterable<T> i = graph.command(qr).execute(params);
        return i;
    }
 public String getWhereClause(Map<String, Object> params) {
        List<String> where = new ArrayList<>();
        for (Map.Entry<String, Object> p : params.entrySet()) {
            String value = p.getValue().toString().replace("'", "\\'");
            where.add(p.getKey() + " = '" + value + "'");
        }

        return String.join(" AND ", where);
    }
public void setProperties(Element e, Map<String, Object> params) {
        OrientElement oe = (OrientElement) e;

        for (int i = 0; i < MAXTRIES; i++) {
            try {
                if (!oe.getProperties().equals(params)) {
                    oe.setProperties(params);
                }
                graph.commit();
                break;
            } catch (OConcurrentModificationException ex) {
                reload(oe);
            }
        }
    }

The edges in re have a unique index on in, out and the where params.

This function is executed multiple times by 8 concurrent threads.

After some time (about 10000 runs), the server hangs with the following thread dump: https://gist.github.com/gerdhub/7459ddda0565d2104892 (Server + Client)

As the server is running on a remote server, I cannot take a complete memory dump at the moment.

I meanwhile also updated to Version 2.1.13...

Does this help?

Hi, yes.
That is the same as https://github.com/orientechnologies/orientdb/issues/5875 . But I will not close it to double check. Also as with issue mentioned above we need the same data.

If the server keeps running in this state, I get the following errors after several hours - this happens after a "ordered" shutdown, however the server won't shut down and has to be killed -9 manually:

Exception in thread "OrientDB WAL Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB Write Cache Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB Write Cache Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB WAL Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB Write Cache Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB Write Cache Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB WAL Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB Write Cache Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB Write Cache Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB Write Cache Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB Write Cache Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB WAL Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB Write Cache Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB Write Cache Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB WAL Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB Write Cache Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB Write Cache Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB Write Cache Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB Write Cache Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB WAL Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB Write Cache Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded

The db is about 260 GB with max. 6 GB heap memory.

Not on this state.
We need server logs before this state and memory dump if possible.

Here is another thread dump of the server: https://gist.github.com/gerdhub/0ca623a6a46bbdba06fd

How may I share the heap dump (jmap, binary)?

Good, could you send us server logs, that is very important. About heap dump, if I get you correctly jmap will be good to go.

This is the server log: https://gist.github.com/gerdhub/cf77bf38b5fc214bfc11

The last line is a query that is executed during the start of the client application, there are no entries when the deadlock happens...

The heap dump is about 280 MB - I sent it via we transfer to your email.

well if you do not mind google drive would be good one, if you do not have space may share it with you.

Sorry about server log, is that content of files from server's log directory ?

Yes - but it is the same as the console output... By the way, I just added the corresponding client thread dump to the server thread dump gist.

@gerdhub, I inspected the server heap dump you have sent and was unable to find any deadlocked threads in it. Seem like it is a wrong heap dump or it was captured at a wrong time due to miscommunication. Andrey said "We need server logs before this state and memory dump if possible." meaning the heap dump should be captured after the deadlock happened.

Quick question: do you have a composite index on edges?

@lvca if you look at thread dump , you will find that it has nothing to do with index key locks. One of the reason is that we do not acquire index key locks inside of transaction. It is not needed.

Yes, there is a composite index, one on in and out vertices, not unique and one on in, out + properties, which is unique. So there are several edges between two nodes.

@taburet: Maybe we should define the term "deadlock" - there is no deadlock in the "classical" sense, but all binary clients (BinaryClient Threads) of the Server are in java.lang.Thread.State: WAITING at com.orientechnologies.common.concur.lock.OReadersWriterSpinLock.acquireReadLock(OReadersWriterSpinLock.java:77); with Locked ownable synchronizers:
- locked <6ebd4989> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
- locked <771816d3> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

However you call it, the clients wait for a response of the server (which is not delivered), and the server does not allow any new (binary) client connections. Studio works as long as no queries are made.

The Server cannot be shut down cleanly in this state, as it will not react to the sigint.

That is all I can say...

Btw: I just noticed that the same happens also with only one binary client in one single thread.

@gerdhub could you repeat the same for embedded database ? I think I am starting to guess what is the problem.

I will try and report later on.

@gerdhub Anyway, I can't find instances of the Thread class named like "BinaryClient" in the provided heap dump, so I can't inspect the state of the "locks" associated with them.

@taburet There are 2 thread dumps in gist https://gist.github.com/gerdhub/0ca623a6a46bbdba06fd, one for the client (first) and one for the server (second). In the second one, there are several

OrientDB - BinaryClient (/10.1.90.16:42874)" - Thread t@29
    java.lang.Thread.State: WAITING

@gerdhub Yep, that is the _thread_ dump, but I'm talking about the _heap_ dump :) All I can tell from the thread dumps is that for some reason one of the spin locks has some potentially invalid state so the busy waiting loop spins forever. It will be helpful to see the actual state of the spin lock from a heap dump, that may help to diagnose the problem.

@taburet Sorry, I got you wrong. I will try to get a new heap dump.

@laa Provided that one updates the jna reference to the newest version :) - embedded version seems to work like a charm until now, even with 8 concurrent threads... Do you have a clue? ;)

Sorry, but it is the same with embedded version, although it happens later. Please see the following thread dump: https://gist.github.com/gerdhub/fb5ecf21623bc28a0669845fb33bd398
I will send the heap dump via we transfer.

@gerdhub we found reason of issue and started to fix it. Thank you very much for bug report , your help and patience.

You're welcome!

@gerdhub @TeamXceleratorDev @ygoraly fixed. could you try ?

On latest 2.1.x branch

Sure thing. Thanks :)

I cannot build 2.1.x branch to test on the db server:

[INFO] ------------------------------------------------------------------------
[INFO] Building OrientDB Server 2.1.15-SNAPSHOT
[INFO] ------------------------------------------------------------------------

...

2016-04-01 16:11:43:288 SEVER Internal server error:
java.lang.IllegalStateException: Pool factory is closed [ONetworkProtocolHttpDb]
TOTAL 30 - MAX 10 - GOOD 30 - BAD 0
2016-04-01 16:11:43:416 INFO  {db=httpclass} Creating database memory:httpclass [OServerCommandPostDatabase]
2016-04-01 16:11:43:518 WARNI ->127.0.0.1: Command not found: GET.commandNotfound [ONetworkProtocolHttpDb]
2016-04-01 16:11:43:523 WARNI ->127.0.0.1: Command not found: POST.listDatabases [ONetworkProtocolHttpDb]Tests run: 47, Failures: 3, Errors: 0, Skipped: 0, Time elapsed: 35.011 sec <<< FAILURE! - in TestSuite
batchUpdate(com.orientechnologies.orient.server.network.http.HttpBatchTest)  Time elapsed: 0.183 sec  <<< FAILURE!
java.lang.AssertionError: expected [200] but found [500]
    at com.orientechnologies.orient.server.network.http.HttpBatchTest.batchWithEmpty(HttpBatchTest.java:83)
    at com.orientechnologies.orient.server.network.http.HttpBatchTest.batchUpdate(HttpBatchTest.java:65)

testConnectAutoDisconnectKeepAlive(com.orientechnologies.orient.server.network.http.HttpConnectionTest)  Time elapsed: 14.71 sec  <<< FAILURE!
java.lang.AssertionError: expected [200] but found [500]
    at com.orientechnologies.orient.server.network.http.HttpConnectionTest.testConnectAutoDisconnect(HttpConnectionTest.java:83)
    at com.orientechnologies.orient.server.network.http.HttpConnectionTest.testConnectAutoDisconnectKeepAlive(HttpConnectionTest.java:57)

testConnectAutoDisconnectNoKeepAlive(com.orientechnologies.orient.server.network.http.HttpConnectionTest)  Time elapsed: 11.995 sec  <<< FAILURE!
java.lang.AssertionError: expected [200] but found [500]
    at com.orientechnologies.orient.server.network.http.HttpConnectionTest.testConnectAutoDisconnect(HttpConnectionTest.java:83)
    at com.orientechnologies.orient.server.network.http.HttpConnectionTest.testConnectAutoDisconnectNoKeepAlive(HttpConnectionTest.java:63)


Results :

Failed tests: 
  HttpBatchTest.batchUpdate:65->batchWithEmpty:83 expected [200] but found [500]
  HttpConnectionTest.testConnectAutoDisconnectKeepAlive:57->testConnectAutoDisconnect:83 expected [200] but found [500]
  HttpConnectionTest.testConnectAutoDisconnectNoKeepAlive:63->testConnectAutoDisconnect:83 expected [200] but found [500]

Tests run: 47, Failures: 3, Errors: 0, Skipped: 0

hm that is latest 2.1.x build which is caused by this fix http://helios.orientdb.com/job/orientdb-2.1.x/164 .How did you build it ?

via

git clone https://github.com/orientechnologies/orientdb
cd orientdb/
git checkout 2.1.x
mvn clean install

Well )) could you skip tests for this build we need only check specific
fix.
Any way issue in http client it will not affect us.

On Fri, Apr 1, 2016 at 6:06 PM Gerd Hübscher [email protected]
wrote:

via

git clone https://github.com/orientechnologies/orientdb
cd orientdb/
git checkout 2.1.x
mvn clean install

—
You are receiving this because you were mentioned.

Reply to this email directly or view it on GitHub
https://github.com/orientechnologies/orientdb/issues/5821#issuecomment-204432061

Best regards,
Andrey Lomakin, R&D lead.
OrientDB Ltd

twitter:@Andrey_Lomakin linkedin:https://ua.linkedin.com/in/andreylomakin

Sure ;) I just wanted to make sure to have a valid build. I will give it a try and report later on.

Gentlemen, 2.1.x works with more than 14.000.000 edges and about 200.000 iterations over the above-mentioned code very well. Beforehand, the error occurred after about 20.000 iterations at the latest.

I think it is fixed. Thank you for your efforts!!

I am really appreciate help of everyone who participated in fix of this issue. We will release new version today or tomorrow.

Was this page helpful?
0 / 5 - 0 ratings