Orientdb: Stored row in cluster shows different record count after ODB hangs

Created on 17 Aug 2016  Â·  24Comments  Â·  Source: orientechnologies/orientdb

OrientDB Version, operating system, or hardware.

  • v2.2.7

    Operating System

  • [x] Linux

    Expected behavior and actual behavior

Orient DB running normally but after the some hours , hangs with cpu usage 100% .
the following trace ,

com.orientechnologies.orient.core.exception.OPaginatedClusterException: Error during record creation DB name="BSNTS_DB" Component Name="fiscaltransaction" at com.orientechnologies.orient.core.storage.impl.local.paginated.OPaginatedCluster.createRecord(OPaginatedCluster.java:494) at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.doCreateRecord(OAbstractPaginatedStorage.java:3091) at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.createRecord(OAbstractPaginatedStorage.java:960) at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.executeSaveRecord(ODatabaseDocumentTx.java:2075) at com.orientechnologies.orient.core.tx.OTransactionNoTx.saveNew(OTransactionNoTx.java:246) at com.orientechnologies.orient.core.tx.OTransactionNoTx.saveRecord(OTransactionNoTx.java:179) at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.save(ODatabaseDocumentTx.java:2598) at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.save(ODatabaseDocumentTx.java:103) at com.orientechnologies.orient.core.record.impl.ODocument.save(ODocument.java:1800) at com.orientechnologies.orient.core.record.impl.ODocument.save(ODocument.java:1791) at com.orientechnologies.orient.core.record.impl.ODocument.save(ODocument.java:63) at com.orientechnologies.orient.core.sql.OCommandExecutorSQLInsert.saveRecord(OCommandExecutorSQLInsert.java:330) at com.orientechnologies.orient.core.sql.OCommandExecutorSQLInsert.execute(OCommandExecutorSQLInsert.java:231) at com.orientechnologies.orient.core.sql.OCommandExecutorSQLDelegate.execute(OCommandExecutorSQLDelegate.java:72) at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.executeCommand(OAbstractPaginatedStorage.java:2680) at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.command(OAbstractPaginatedStorage.java:2626) at com.orientechnologies.orient.core.command.OCommandRequestTextAbstract.execute(OCommandRequestTextAbstract.java:69) at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.command(ONetworkProtocolBinary.java:1515) at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.executeCommand(OAbstractPaginatedStorage.java:2680) at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.command(OAbstractPaginatedStorage.java:2626) at com.orientechnologies.orient.core.command.OCommandRequestTextAbstract.execute(OCommandRequestTextAbstract.java:69) at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.command(ONetworkProtocolBinary.java:1515) at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.executeRequest(ONetworkProtocolBinary.java:577) at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.sessionRequest(ONetworkProtocolBinary.java:318) at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.execute(ONetworkProtocolBinary.java:196) at com.orientechnologies.common.thread.OSoftThread.run(OSoftThread.java:77) Caused by: java.lang.NullPointerException at com.orientechnologies.orient.core.storage.impl.local.paginated.OPaginatedCluster.findFreePage(OPaginatedCluster.java:1970) at com.orientechnologies.orient.core.storage.impl.local.paginated.OPaginatedCluster.addEntry(OPaginatedCluster.java:1906) at com.orientechnologies.orient.core.storage.impl.local.paginated.OPaginatedCluster.createRecord(OPaginatedCluster.java:475) ... 21 more $ANSI{green {db=BSNTS_DB}} Error on creating record in cluster: plocal cluster: fiscaltransaction com.orientechnologies.orient.core.exception.OPaginatedClusterException: Error during record creation DB name="BSNTS_DB" Component Name="fiscaltransaction" at com.orientechnologies.orient.core.storage.impl.local.paginated.OPaginatedCluster.createRecord(OPaginatedCluster.java:494) at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.createRecord(OAbstractPaginatedStorage.java:960) at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.executeSaveRecord(ODatabaseDocumentTx.java:2075) at com.orientechnologies.orient.core.tx.OTransactionNoTx.saveNew(OTransactionNoTx.java:246) at com.orientechnologies.orient.core.tx.OTransactionNoTx.saveRecord(OTransactionNoTx.java:179) at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.save(ODatabaseDocumentTx.java:2598) at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.save(ODatabaseDocumentTx.java:103) at com.orientechnologies.orient.core.record.impl.ODocument.save(ODocument.java:1800) at com.orientechnologies.orient.core.record.impl.ODocument.save(ODocument.java:1791) at com.orientechnologies.orient.core.record.impl.ODocument.save(ODocument.java:63) at com.orientechnologies.orient.core.sql.OCommandExecutorSQLInsert.saveRecord(OCommandExecutorSQLInsert.java:330) at com.orientechnologies.orient.core.sql.OCommandExecutorSQLInsert.execute(OCommandExecutorSQLInsert.java:231) at com.orientechnologies.orient.core.sql.OCommandExecutorSQLDelegate.execute(OCommandExecutorSQLDelegate.java:72) at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.executeCommand(OAbstractPaginatedStorage.java:2680) at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.command(OAbstractPaginatedStorage.java:2626) at com.orientechnologies.orient.core.command.OCommandRequestTextAbstract.execute(OCommandRequestTextAbstract.java:69) at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.command(ONetworkProtocolBinary.java:1515) at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.executeRequest(ONetworkProtocolBinary.java:577) at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.sessionRequest(ONetworkProtocolBinary.java:318) at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.execute(ONetworkProtocolBinary.java:196) at com.orientechnologies.common.thread.OSoftThread.run(OSoftThread.java:77) Caused by: java.lang.NullPointerException at com.orientechnologies.orient.core.storage.impl.local.paginated.OPaginatedCluster.findFreePage(OPaginatedCluster.java:1970) at com.orientechnologies.orient.core.storage.impl.local.paginated.OPaginatedCluster.addEntry(OPaginatedCluster.java:1906) at com.orientechnologies.orient.core.storage.impl.local.paginated.OPaginatedCluster.createRecord(OPaginatedCluster.java:475)

and i'm wondered when run the select count command result that show as below :

image

So when i insert new Record in FiscalTransaction class , randomly raised error and not saving the record .

Please help me .

Steps to reproduce the problem .

I don't know , this behavior is unexpected .

It seems we have a null pointer for a record in storage after the database hanging . i guess there is a problem with cluster 29 because when i insert some record after 3 successful insert the following error appears :

Error during saving of record with rid #29:-1

Thanks
Saeed Tabrizi

bug

Most helpful comment

@saeedtabrizi seems I have found problem of your issue and working on it.

All 24 comments

Hi @saeedtabrizi , to restore cluster data structure you need to export DB to JSON and import back. Now I am trying to understand reason of why this issue happened. Did you see any other exceptions in log or only this one ?

@saeedtabrizi is it possible to send us database of files of "fiscaltransaction" cluster ?

Hi @laa
yes , i recover my data with insert into from select ..... command now , thanks for help .
i have no any more log data but if you want to reproduce this event i recommend you run a stress test with the following steps :

1- set test environment to : Linux , 4 Core Cpu , 8GB Ram .
2- create two class with 1000 record per class .
3- call update on first class with 2kb data with 100 itreation or more per second .
4 - during the call update a client send insert command to second class .
5 - after some hours the database must be hang and cpu goes to 100% usage .
6- terminate the orientdb process with kill -9 signal
7- run the ODB and check the record count .

Thanks
Saeed Tabrizi

@laa Yes , in which email address ?

@saeedtabrizi

after some hours the database must be hang and cpu goes to 100% usage

I am not sure that I will reproduce it. But I will need more information from your side.
We consider this issue as super critical and going to fix it ASAP we got all information.

So my next questions:

  1. Files which you are going to send me are unmodified files , I mean files just after the NPE ?
  2. Is it possible to reproduce this case on your environment ? (I am afraid I just will spend hours for nothing)
  3. If you agree to reproduce it, could you create the couple of thread dumps when CPU usage goes 100% and if possible open profiler of your choice and send me hot methods ?
  4. It is not clear for me when you get this NPE ? Or in other words how did you get the stack trace which you published ?

@laa
Thanks for your attention :
1- No i have not unmodified files for fiscal transaction class . but i have an older database back up that unmodified and have problem in other class (Invoice) in the same bug scenario .
2- yes i'm tracing to realize when this event happens on db . currently everything is normal (just after repair operation).
3- yes , sure .
4- when this problem happens my clients reports the hanging on server response to me . after some minutes ,i see the cpu on the operatioanl server working anormally ( 100 % usage) , i kill ODB process and restart the db . when the clients starting to have a fiscal transaction they reporting the fiscal transaction on application hangs . when i see the orientdb.err file i see the exception stack trace that i published here .

@saeedtabrizi Did you see any progress of restore operation when you restart the server ?

@laa no certainly .

@laa after the crash , database started normally and only in some fiscaltransctions randomly an error happened in my application . when i traced it, i found a problem in record insertion command in the special cluster .

@saeedtabrizi hm that is the problem (maybe) if database crashed unexpectedly (for example hard kill) restore process should start, but if you had soft kill how did you rid off 100% CPU load? I think that once we find what makes CPU 100% busy we find a clue to this problem. Once this happens could you send me information that I asked. I had some ideas, but I need confirmation. I am sorry that we have to wait to fix this issue.

@saeedtabrizi I have more question. Do you use special settings for your server or default once ?

@laa My setting is default of ODB release 2.2.7 community edition .
I have no any change or optimization .
be relax , i'm here to answer you're questions. I just think ODB is an edge link for us.

@saeedtabrizi OK, then could you do followings steps.

  1. If CPU usage is 100%, please get information which we discussed.
  2. Kill the process as usual and because we did work with you in online mode, please keep the exact/untouched copy of the database (that is critical so I will be able to look at it and experiment with it).

Sure I do not want to harm your users one more time, but we need to reproduce it at least once and get all information to fix the issue.
Once I post it, on next day at most I will look at it to provide a fix ASAP. Sorry for inconveniences.

Many Thanks @laa .
OK i report it to you when this event happens again and take a snap shot from our vm to save the state of cpu hangs for next usage .
Now i have an untouched copy of database with the same problem in last week . i can send it to you , if i have an private email address .

@saeedtabrizi , do you mean that DB is killed and you did not change it ? if ok please send me to a.[email protected] I will look on it.

@saeedtabrizi seems I have found problem of your issue and working on it.

@laa , i was sent it to you some days ago .
Thanks

@saeedtabrizi thank you. Your issue should be already fixed in upcoming release. Still will be a minor issue about free space usage statistic. I will try to use a database which you sent to fix it too. But your major issue about DB freeze should gone.

@saeedtabrizi could you update till 2.2.10 version, it should solve the problem which you found, I keep the issue open in order to do not forget to fix minor issue inside of this big one it is related to free space usage statistic, and it should not affect your operations on a database.

@laa Thanks for your attention .
Okay i will test it asap .

Hi @laa
It seems to be Ok . however i try simulate a stress test to simulate related problem again .
Thanks

@saeedtabrizi that is cool, thank you very much for your time.

@saeedtabrizi I close this issue because I suppose that it can be confirmed as fixed. If you reproduce the problem on a stress test, feel free to reopen issue.

@laa I handle it in similar test case and run a heavy stress test . the good news , there is no any crash or abnormally usage in Cpu occurred .
Thanks
Saeed Tabrizi .

Was this page helpful?
0 / 5 - 0 ratings