Parity-ethereum: Client occasionally stops importing new blocks

Created on 3 Jun 2020  ·  78Comments  ·  Source: openethereum/parity-ethereum

  • OpenEthereum version: v3.0.0-stable-b079d1746
  • Operating system: Linux
  • Installation: built from source
  • Fully synchronized: yes
  • Network: ethereum
  • Restarted: yes

Openethereum occasianally (1-3x per month) stops importing new blocks, simply goes silent without producing any error. Then we issue SIGTERM and we get "Finishing work, please wait..." message. But it does not finish in many minutes... so we kill the process with SIGKILL. Upon restarting everything works normally, until next such 'freeze'.

We first noticed this behaviour in Parity 2.7.2 several months ago and it is now present with Openethereum 3.0.0. We noticed this independently on two different hardware configurations, one fast synced, the other fatdb synced from scratch. There is a similar issue reported in #11737 and also #11539, but the latter is linked to slow block import. That is not the case here. Except for the 'freezes', everything is working responsively, fast and overall very well, block imports are very fast.

Any suggestions on how to debug this issue much appreciated. Thanks.

F2-bug 🐞 M4-core ⛓ P2-asap 🌊

Most helpful comment

You might be right, I haven't looked to much into the PRs. However it might still be worth it to merge them if the code is OK, so that other people can try out.
Regarding the issue arising multiple times per day, it might be related to the fact that I'm connected to many peers (~500), which might trigger a deadlock more often than for other nodes.

All 78 comments

This sounds concerning and debugging it seems non-trivial. The best thing would be a backtrace of what all the threads are doing when the node freezes up, which requires attaching gdb to the process when it's is in the bad state and executing thread apply all bt.
One way of maybe collecting a hint on what is going on could be to run the node with -l shutdown and see what is logged after SIGTERM: I'd expect you to see messages from the various subsystems shutting down (until one of them don't; try comparing what you see on a non-stuck node to a stuck one).
You could also turn on logging for the sync subsystem with -l sync=debug but that will produce a LOT of logs and might fill up your disk if you need it on for a month.

Can you post your configuration please, it might help us reproduce.

we are also experiencing similar issues after the upgrade to ver 3.0

dvdplm, thanks for suggestions, will try all that and report back.

Attaching a config.toml from current hardware with 64gb ram that is fatdb synced. Previous hardware (that also exhibited the same problem) was fast synced and had 32gb, so configuration had smaller caches sizes, but otherwise it was similar.

config.toml.txt

@dvdplm There is a thread backtrace on https://github.com/openethereum/openethereum/issues/11737 which seems like the same issue as this one.

I see the same issue, it matches the 1-3x per month originally reported. I saw it happen on parity 2.7.2 and saw it again on openethereum 3.0.0.

I noticed this only on nodes with tracing enabled, and #11737 also has tracing. But @marko429 you don't have it do you?

@dvdplm Could you maybe develop on what you would want exactly from gdb? Only the output of "threads apply all bt" ? I tried it once but it only spits out addresses as there is no debug symbols. That's still helpful to you ?
I need to know in advance to I can run that quickly before restarting the node if I see it happen again.

@palkeo, I'll make the gdb trace the next time it happens. Can you please share your cache-sizes and ram size?

Same behavior here on 2 different Ubuntu machines. We have to monitor last synced block, compare it with etherscan and restart the client.

@palkeo, I'll make the gdb trace the next time it happens. Can you please share your cache-sizes and ram size?

My cache size are here: https://github.com/openethereum/openethereum/issues/11735
Ram size is 16G.

Experiencing both slow block import and periodic freeze on both full/fast and archive nodes since 2.7.2, tracing enabled.

2.5.13 runs fine.

11737 / #11494 — same behavior.

Could this be because of “database format changes” introduced in 2.7.2?

I tried it once but it only spits out addresses as there is no debug symbols. That's still helpful to you ?

You are right, without the debug symbols in there it's not useful. My bad.

We are also experiencing those issues since version 2.7.2 on multiple independent nodes.

I can confirm this happens, without more data points / metrics we are blind to diagnose, @adria0 is working on having prometheus metrics and that might add more light here

Is there anything you can think of that could be useful to you to troubleshoot it?
Maybe providing a debug build (with symbols and anything else that could help) could help in getting you interesting core dumps and backtraces? Any logging option you think could help?

Is there anything that is helpful for debugging while it is stuck ? I have a stuck node right now ...

Cross-referencing potentially related issues:

  • #11793
  • #11737
  • #11683

I had this problem too, v3.0.1,

/opt/openethereum \
--warp \
--jsonrpc-port=60007 \
--logging="rpc=trace" \
--accounts-refresh=0 \
--db-path=/disk600/ethereumBlockchain \
--jsonrpc-server-threads=32 \
--jsonrpc-apis="all" \
--log-file=/home/parity/parity.log

ssd disk, serverrom with 1GBps internet, restart didn't help, reboot didn't help. Before that RPC got really slow, 10 sec for account get balance, had about 3500 accounts

to recover, I started over, deleted accounts, whole blockchain, it failed to sync snapshots,

2020-06-29 19:17:49 IO Worker #0 INFO import Syncing snapshot 3664/5112 #0 15/25 peers 920 bytes chain 3 MiB db 0 bytes queue 298 KiB sync RPC: 0 conn, 0 req/s, 0 µs 2020-06-29 19:17:50 IO Worker #2 INFO import Syncing snapshot 3665/5112 #0 15/25 peers 920 bytes chain 3 MiB db 0 bytes queue 298 KiB sync RPC: 0 conn, 0 req/s, 0 µs 2020-06-29 19:17:57 IO Worker #3 INFO import Syncing snapshot 3670/5112 #0 15/25 peers 920 bytes chain 3 MiB db 0 bytes queue 298 KiB sync RPC: 0 conn, 0 req/s, 0 µs 2020-06-29 19:18:01 IO Worker #2 INFO import Syncing snapshot 3673/5112 #0 15/25 peers 920 bytes chain 3 MiB db 0 bytes queue 298 KiB sync RPC: 0 conn, 0 req/s, 0 µs 2020-06-29 19:18:05 IO Worker #1 INFO import Syncing snapshot 3675/5112 #0 15/25 peers 920 bytes chain 3 MiB db 0 bytes queue 298 KiB sync RPC: 0 conn, 0 req/s, 0 µs 2020-06-29 19:18:10 IO Worker #3 INFO import Syncing snapshot 3676/5112 #0 15/25 peers 920 bytes chain 3 MiB db 0 bytes queue 298 KiB sync RPC: 0 conn, 0 req/s, 0 µs 2020-06-29 19:18:15 IO Worker #1 INFO import Syncing snapshot 3676/5112 #0 15/25 peers 920 bytes chain 3 MiB db 0 bytes queue 298 KiB sync RPC: 0 conn, 0 req/s, 0 µs 2020-06-29 19:18:20 IO Worker #3 INFO import Syncing snapshot 3676/5112 #0 15/25 peers 920 bytes chain 3 MiB db 0 bytes queue 298 KiB sync RPC: 0 conn, 0 req/s, 0 µs 2020-06-29 19:18:25 IO Worker #2 INFO import Syncing snapshot 3676/5112 #0 15/25 peers 920 bytes chain 3 MiB db 0 bytes queue 298 KiB sync RPC: 0 conn, 0 req/s, 0 µs 2020-06-29 19:18:30 IO Worker #1 INFO import Syncing snapshot 3676/5112 #0 15/25 peers 920 bytes chain 3 MiB db 0 bytes queue 298 KiB sync RPC: 0 conn, 0 req/s, 0 µs 2020-06-29 19:18:35 IO Worker #0 INFO import Syncing snapshot 3676/5112 #0 15/25 peers 920 bytes chain 3 MiB db 0 bytes queue 298 KiB sync RPC: 0 conn, 0 req/s, 0 µs 2020-06-29 19:18:40 IO Worker #1 INFO import Syncing snapshot 3676/5112 #0 15/25 peers 920 bytes chain 3 MiB db 0 bytes queue 298 KiB sync RPC: 0 conn, 0 req/s, 0 µs 2020-06-29 19:18:45 IO Worker #3 INFO import Syncing snapshot 3676/5112 #0 15/25 peers 920 bytes chain 3 MiB db 0 bytes queue 298 KiB sync RPC: 0 conn, 0 req/s, 0 µs 2020-06-29 19:18:50 IO Worker #1 INFO import Syncing snapshot 3676/5112 #0 15/25 peers 920 bytes chain 3 MiB db 0 bytes queue 298 KiB sync RPC: 0 conn, 0 req/s, 0 µs

very anoying, business down second day, switching to geth

Had the issue again. Attaching openethereum shutdown log and gdb output.

gdb.txt
shutdown.txt

Thank you @marko429 !

I will analyze it more later, but till then what I can see now is that threads 46,47,48,49 get stuck on SyncSupplier::dispatch_packet. Four threads correspond to our four Sync/Network IoWorkers.

Threads 24,25,26,27 are Client IoWorkers.

Thread 39 is last one that is deadlocked, it is called from Importers thread.

I did re-sync again, no hardware changes, no software changes. Snapshot got synced. Now syncing blocks. I suspect there is race condition dead lock problem.

Putting link to important gdb log: https://github.com/openethereum/openethereum/issues/11737#issuecomment-637042123
Just to consolidate information that we have

From this log I can see that Network/Sync IoWorker threads 33,34,35,36 are stuck

And Client IoWorkers are:
Thread 22, 20 calls Informant timer and tries to lock
Thread 21 calls EthSync transactions_received
Thread 19 calls informant on timer but this time it locks mutex and it is calling EthSync ethcore_sync::api::SyncProvider::status

@rakita One thing there is in #11737 as well is a full debug logging up to the point of the hang - not sure if you have seen that (https://github.com/openethereum/openethereum/issues/11737#issuecomment-649679781) - I'm also happy to do any debugging on our node when it hangs again, or enable any extra logging you would like to try and track this issue down.

@CrispinFlowerday I did see it and thank you for both gdb and debug log!

Compiling the data:

  • In both gdb logs i can see that All four Network IoWorkers that receive msg from peer want to do ReadLock , probably on sync: &RwLock<ChainSync>, from supplier::SyncSupplier::dispatch_packet() function.
  • In the first log there are four Client IoWorkers and only one in the second log, thay want to do WriteLock by calling .sync.write(); on ChainSync, from function (EthSync as client_traits::ChainNotify)::transactions_received

  • In first log we can see that we want to WriteLock in (EthSync as client_traits::ChainNotify)::new_blocks by calling sync.write() on ChainSync. This call was not from IoWorker but from Importer::import_verified_blocks thread.

  • In second log there are left three IoWorkers and all of them are calling Informant. The first informant locks its internal mutex and goes on while the other two waits for first to release mutex. the first informant after locking internal mutex asks for ReadLock by calling .sync.read().status() on ChainSync from (EthSync as ethcore_sync::api::SyncProvider)::status

Nice to see all the work on this issue! I see it almost happening daily, although I'm also running a light node

Sounds like there is progress being made in understanding the issue. What can we expect to happen next?

I did some digging and trying to find the cause of the issue, I thought gdb log with some code searching would be well enough to pinpoint deadlock, but unfortunately, it is still unsure.

What we know is that RWLock that covers ChainSync get stuck, and it seems that somebody took Write access and didn't release it, and the strange thing is I couldn't find that thread in gdb logs (I even started to suspect parking_log::mutex, but that mutex implementation seems really stable and that assumption is probably not valid).

On that note, I put additional warning logs that should give us more information on who is the last one that locks that mutex. It saves last 100 locks on write/reads and prints them if we couldn't acquire either read or write in 15min and then panics after two days of sleep.
branch is: rakita/11758_stuck_problem

@CrispinFlowerday , @marko429 it would be really helpful if you can pull this branch and do the same as you did earlier with gdb.

Are the locks persistent ? If not, service restart should help, but it didn't. Thought I'm not 100% sure, it takes 20 minutes for openethereum to open db, so I'm not sure if I waited long enough.

I missed one read lock and because of that I added a new commit to rakita/11758_stuck_problem

@PavelNiedoba locks are not persistent, restarting service should recover node at least for this issue. Your problem is probably related to this #11265. Commit is merged few days ago.

Restarts have always worked for me. Sometimes you can struggle to find good peers or have issues with udp hole punching, but that’s an entirely different issue

@rakita, I've been running the node with TSAN ThreadSanitizer enabled. A friend of mine said to me that TSAN generally does not generate false positives, but his experience is from using it in C++.

I run it for some hours, I plan to keep it running more time. https://gist.github.com/adria0/2c30c5b9367d20345c3d9fb703d9a244

Another observation I’ve made, which may or may not be link to this, the more peers I have, the longer the node will keep running before getting stuck. Eg with 32 min peers it happens very frequently. With 256 min peers it seems to take much longer.

Hi - our node continues to hang. This unfortunately is a stopper for us in terms of our support of the Berlin HF using OpenEthereum. Is there anything we can provide to help debug this issue?

The same issue, seems stop syncing when broadcast transactions.

Hello @vogelito @niulingyun please reread the thread for more information.
but tldr: use this branch rakita/11758_stuck_problem it will print warning logs when this problem happens and this will help us with debug.

@rakita, got the freeze caught with your rakita/11758_stuck_problem branch.

oe-freeze.txt

Thank you for log AlexSSD7!

In logs i can see that last write/read locks are:

ThreadId 11 write12, time :299,795535
ThreadId 14 write12, time: 299,760048
ThreadId 12 write12 time: 299,714896
ThreadId 13 write12 time: 299,706122

write12 corresponds to transactions_received function from ChainNotify EthSync.

ThreadId 31 on_peer_transactions read, time: 299,810642
ThreadId 29 on_peer_transactions read, time: 299,756657
ThreadId 30 process_priority_queue write, time: 295,306915
ThreadId 32 on_peer_transactions read, time: 283,876855

Last few locks are:

current_read:{}
("on_peer_transactions ThreadId(31)", Instant { tv_sec: 6280080, tv_nsec: 299810642 }, UnlockRead
("on_peer_transactions ThreadId(31)", Instant { tv_sec: 6280080, tv_nsec: 299803575 }, LockRead
("write12 ThreadId(11)", Instant { tv_sec: 6280080, tv_nsec: 299795535 }, LockWrite
("write12 ThreadId(14)", Instant { tv_sec: 6280080, tv_nsec: 299760048 }, LockWrite
("on_peer_transactions ThreadId(29)", Instant { tv_sec: 6280080, tv_nsec: 299756657 }, UnlockRead
("on_peer_transactions ThreadId(29)", Instant { tv_sec: 6280080, tv_nsec: 299754664 }, LockRead

The strange thing is the last lock is read_lock that it got released. and after that nothing. There is no write lock that I expect to happen to pinpoint the location of "deadlock".

I am again suspicious of parking_lot, and one of the paths we can take is to replace this particular lock with std one (we are using parking_lot because it is faster). But before that, we will recheck if there are any logical problems that we missed.

Another possible suspicious thing is what @denisgranha said in discord, usage of different glibc library.

Another one ;)

oe-freeze-1.txt

And here is formated history. Thank you Alex!
formated_history.txt

Same behaviour like first, no read locks and write should lock but it seems it could not do that, very strange.

Here's my output, got stuck after 14 days running rakita branch.
debug.txt

Thank you @marko429 for the log!
Can you include a little bit more of logs, I want to see some stats on what was happening before it got stuck, if you don't have it it is okay.
And can you tell me if you used docker and if it is alpine or something other or is it standard Linux? I am interested in libc version. (on linux you can use ldd --version to get that).

@vorot93 is checking for logical problems. And after that, we will probably replace that lock with std just for test.

Additionally, in issue #11760 you can see that creating snapshots can raise RAM and slow down the system, there was no way to reduce it to optimal. On machines with <16RAM this was big problem and introduces a lot of instability.

@rakita: Sure, attached are the stats prior to timeout. Running on arch linux, compiling from source, then running the binary, not using docker. ldd gives: (GNU libc) 2.31 Thanks for your help on this.
log_stats.txt

Here is the dump from our node (finally hung again last night) - Its a Ubuntu 18.04.4 EC2 instance, not using docker, using Ubuntu GLIBC 2.27-3ubuntu1.2) 2.27

openetherem-stuck-2020-07-27.txt

I noticed this happens much more frequently on one of our nodes that has a lot of peers, so possibly the deadlock could be related to network code path?
And I only ever saw nodes with tracing enabled deadlock like that.

It seems that it is not related to libc, 2.31 is the newest one, or even if it is docker/bare metal.

@vorot93 couldn't find anything interesting in code and I took the next step and that is replaced parking_lot with standard rwlock, just for test.
Branch is here: rakita/11758_use_std_rwlock
Just a note because std rwlock does not have try_until and that means if deadlock happens it will not write my previous logs it will just stop writing any periodic logs.

I've updated our node, but it typically only hangs once a week or so, so might be a while.

Update here too.

I'd roughly estimate 30-50% higher block import times in this std_rwlock version. Of course, it's a pretty poor comparison, since other things may not be the same.

Yea, unfortunately, it is kinda expected to be slower. As palkeo assumed this is related to network and all transactions/blocks received lock this mutex to synchronize peer's state and that's a lot of locks :). The reason why parking_lot was used in the first place is speed.

I am not very sure but I think the deadlock is related to sync: RwLock<ChainSync> in struct ChainSyncApi.

An incomplete tracking in the source code:
https://www.processon.com/view/link/5f3abb321e085306e161606f

SyncProtocolHandler contains sync: ChainSyncApi and chain: BlockChainClient

  1. sync.read() is called before on_peer_transactions(&sync, io, ...) and io contains chain.
  2. chain can be a struct Client. So io.chain().queue_transactions() calls queue(chain.io_channel, ..., closure).
  3. The closure takes Client as a parameter and calls client.notify on it. Then channel sends the closure.
  4. I notice that channel.send() can call channel.send_sync(), which happens in the same thread.
  5. handler.message() is called on each handler in channel.Handlers.
  6. When message type is Execute, it calls the closure, which in turn calls transactions_received on all ChainNotify (SyncEth impls ChainNotify)
  7. transactions_received() finally calls sync.write().

What concerns me is how to decide if
SyncProtocolHandler.sync==SyncProtocolHandler.chain.io_channel.handlers[N].client.notify[N].SyncProtocolHandler.sync
and channel.send_sync() may happen at the same time. It is just too complex to verify.

Hi @BurtonQin thank you for looking into this, you are really good at debugging.
Yes, it is tied with RwLock<ChainSync> and all your points are correct.

But (There is always the but) assumption number 4 is not applied in this context:
4. I notice that channel.send() can call channel.send_sync(), which happens in the same thread.
Because that particular Channel always has Sender set and that means it will always call async self.tx.try_send(msg), only place where it is not set is in VerificationQueue because that queue has its own thread pool but that is a different story.

If for any case we call sync_call and that is the reason for bug, gdb logs will be a lot different, the logs that I put would show us who was the last who locked mutex, and lastly, this bug would be a lot easier to reproduce.

std_rwlock version has been running for about three weeks now without an issue.

I guess this bug is related to priority between read and write. parking_lot write is prior than read, which means a read lock will block when a write lock is waiting. But the priority of std RwLock depends on the underlying OS. No priority on Linux.

@BurtonQin yes, that is a big difference between parking_lot and glibc and musl that is effecting how app works. Both glibc and musl I think favors read over write lock, while parking_lot favors write over read. I think this is good explanation on this topic for others: link.

But why do I think that problem is with parking lot and not with who favors what is the test we did and logs we added (branch).
The problem where somebody takes read and does not release it is something I expected to see with test logs and parking_lot, but a strange thing happened, and we got an empty list. With this, I can see that nobody has locked that mutex (output) while some threads are waiting to lock the write lock and we got deadlock. There are two assumptions with this output, one is, logs that I put are not covering some case, and there is a bug in parking_lot. Artem additionally checked the code and couldn't find anything wrong and this is why my conclusion is that this is a problem with parking_lot.

I can confirm our node is working fine as well with the rakita/11758_use_std_rwlock branch. I have also now seen a hang of a ropsten node which hasn't occurred for us previously (using the release code - although right now I'm not sure exactly what release we are using there)

@CrispinFlowerday and @marko429 thank you both for the help!

Both our OpenEthereum nodes (16 core, 16GB memory, RAID-5 SSD Pool, Archiving+Tracing) are stuck since ~3 weeks.

Even restarting doesn't solve the problem anymore. The nodes are utilizing the SSD pool and just do nothing. Please raise the priority for this issue. OpenEthereum is pretty much useless at this point.

Edit:

std_rwlock version has been running for about three weeks now without an issue.

Oh and unfortunately this did not help in our case. We are now running one node with this rakita/11758_use_std_rwlock branch and one with the last official release (v3.0.1).

Happy to provide logs if it helps.

Hello @cogmon if a node does not recover after restart that means this is a different problem.

On that note, we did notice a lot of regression introduced with big changes in 2.7.x, and 3.0.x is mostly based on 2.7.x with rebranding. In next release, we decided to move back to stable 2.5.x and make it ready for berlin fork. #11858

Hi There, thanks for the info. I have started Parity v2.5.13 and it does not find any peers:

2020-08-24 12:43:11 UTC Starting Parity-Ethereum/v2.5.13-stable-253ff3f-20191231/x86_64-linux-gnu/rustc1.40.0
2020-08-24 12:43:11 UTC Keys path /data/parity/base/keys/ethereum
2020-08-24 12:43:11 UTC DB path /data/parity/db/ethereum/db/906a34e69aec8c0d
2020-08-24 12:43:11 UTC State DB configuration: fast +Trace
2020-08-24 12:43:11 UTC Operating mode: active
2020-08-24 12:43:11 UTC Configured for Ethereum using Ethash engine
2020-08-24 12:43:11 UTC Removed existing file '/data/parity/base/jsonrpc.ipc'.
2020-08-24 12:43:12 UTC Updated conversion rate to Ξ1 = US$406.73 (11707778 wei/gas)
2020-08-24 12:43:16 UTC Public node URL: enode://xxxxxxxxxxxxxx@<ip>:30303
2020-08-24 12:43:46 UTC    0/25 peers   832 bytes chain  512 KiB db  0 bytes queue 448 bytes sync  RPC:  0 conn,    0 req/s,  166 µs
2020-08-24 12:44:16 UTC    0/25 peers   832 bytes chain  512 KiB db  0 bytes queue 448 bytes sync  RPC:  0 conn,    0 req/s,  166 µs
2020-08-24 12:44:46 UTC    0/25 peers   832 bytes chain  512 KiB db  0 bytes queue 448 bytes sync  RPC:  0 conn,    0 req/s,  131 µs
2020-08-24 12:45:16 UTC    0/25 peers   832 bytes chain  512 KiB db  0 bytes queue 448 bytes sync  RPC:  0 conn,    0 req/s,  131 µs
2020-08-24 12:45:46 UTC    0/25 peers   832 bytes chain  512 KiB db  0 bytes queue 448 bytes sync  RPC:  0 conn,    0 req/s,  131 µs
2020-08-24 12:46:16 UTC    0/25 peers   832 bytes chain  512 KiB db  0 bytes queue 448 bytes sync  RPC:  0 conn,    0 req/s,  131 µs
2020-08-24 12:46:46 UTC    0/25 peers   832 bytes chain  512 KiB db  0 bytes queue 448 bytes sync  RPC:  0 conn,    0 req/s,  131 µs

Is there any special requirement to run this version?

@g574, same here

You shoudn't use 2.5,It doesn't support Constantinople and Istanbul hard fork.

Hi There, thanks for the info. I have started Parity v2.5.13 and it does not find any peers:

2020-08-24 12:43:11 UTC Starting Parity-Ethereum/v2.5.13-stable-253ff3f-20191231/x86_64-linux-gnu/rustc1.40.0
2020-08-24 12:43:11 UTC Keys path /data/parity/base/keys/ethereum
2020-08-24 12:43:11 UTC DB path /data/parity/db/ethereum/db/906a34e69aec8c0d
2020-08-24 12:43:11 UTC State DB configuration: fast +Trace
2020-08-24 12:43:11 UTC Operating mode: active
2020-08-24 12:43:11 UTC Configured for Ethereum using Ethash engine
2020-08-24 12:43:11 UTC Removed existing file '/data/parity/base/jsonrpc.ipc'.
2020-08-24 12:43:12 UTC Updated conversion rate to Ξ1 = US$406.73 (11707778 wei/gas)
2020-08-24 12:43:16 UTC Public node URL: enode://xxxxxxxxxxxxxx@<ip>:30303
2020-08-24 12:43:46 UTC    0/25 peers   832 bytes chain  512 KiB db  0 bytes queue 448 bytes sync  RPC:  0 conn,    0 req/s,  166 µs
2020-08-24 12:44:16 UTC    0/25 peers   832 bytes chain  512 KiB db  0 bytes queue 448 bytes sync  RPC:  0 conn,    0 req/s,  166 µs
2020-08-24 12:44:46 UTC    0/25 peers   832 bytes chain  512 KiB db  0 bytes queue 448 bytes sync  RPC:  0 conn,    0 req/s,  131 µs
2020-08-24 12:45:16 UTC    0/25 peers   832 bytes chain  512 KiB db  0 bytes queue 448 bytes sync  RPC:  0 conn,    0 req/s,  131 µs
2020-08-24 12:45:46 UTC    0/25 peers   832 bytes chain  512 KiB db  0 bytes queue 448 bytes sync  RPC:  0 conn,    0 req/s,  131 µs
2020-08-24 12:46:16 UTC    0/25 peers   832 bytes chain  512 KiB db  0 bytes queue 448 bytes sync  RPC:  0 conn,    0 req/s,  131 µs
2020-08-24 12:46:46 UTC    0/25 peers   832 bytes chain  512 KiB db  0 bytes queue 448 bytes sync  RPC:  0 conn,    0 req/s,  131 µs

Is there any special requirement to run this version?

Yes @g574, Parity 2.5.13 has harcoded list of bootnodes, but parity tech are not running them. You can switch to the Gnosis maintained bootnodes by using

--bootnodes enode://68f46370191198b71a1595dd453c489bbfe28036a9951fc0397fabd1b77462930b3c5a5359b20e99677855939be47b39fc8edcf1e9ff2522a922b86d233bf2df@144.217.153.76:30303,enode://ffed6382e05ee42854d862f08e4e39b8452c50a5a5d399072c40f9a0b2d4ad34b0eb5312455ad8bcf0dcb4ce969dc89a9a9fd00183eaf8abf46bbcc59dc6e9d5@51.195.3.238:30303

this should work.

@adria0 Thanks a lot, it works.

@islishude see #11858.

On that note, we did notice a lot of regression introduced with big changed in 2.7.x, and 3.0.x is mostly based on 2.7.x with rebranding. In next release, we decided to move back to stable 2.5.x and make it ready for berlin fork. #11858

@rakita 2.5.13 works fairly, so that looks like a good a strategy

we're running 2.5.13 where possible as well, so yes, looking forward to it 👍

I'm not at all surprised, yet somehow now relieved that my issue is no longer CentOS 7 related. I'd eventually closed the old issue after adding the --reserved-peers nodes.txt argument to sync to my local Go-Ethereum archive node.

https://github.com/openethereum/openethereum/issues/11297

@BurtonQin yes, that is a big difference between parking_lot and glibc and musl that is effecting how app works. Both glibc and musl I think favors read over write lock, while parking_lot favors write over read. I think this is good explanation on this topic for others: link.

But why do I think that problem is with parking lot and not with who favors what is the test we did and logs we added (branch).
The problem where somebody takes read and does not release it is something I expected to see with test logs and parking_lot, but a strange thing happened, and we got an empty list. With this, I can see that nobody has locked that mutex (output) while some threads are waiting to lock the write lock and we got deadlock. There are two assumptions with this output, one is, logs that I put are not covering some case, and there is a bug in parking_lot. Artem additionally checked the code and couldn't find anything wrong and this is why my conclusion is that this is a problem with parking_lot.

It looks like this behavior? https://docs.rs/parking_lot/0.11.0/parking_lot/type.RwLock.html

"This lock uses a task-fair locking policy which avoids both reader and writer starvation. This means that readers trying to acquire the lock will block even if the lock is unlocked when there are writers waiting to acquire the lock. Because of this, attempts to recursively acquire a read lock within a single thread may result in a deadlock. "

@matkt, to be honest, I expected something like that when I first looked at parking_lot, but when I looked at gdb logs and code this didn't fit. It was a strange conclusion, and that is why we added logs after every lock to try to deduce who is last one that locks the lock and we got empty list.

I started running a node with https://github.com/openethereum/openethereum/pull/11769 and https://github.com/openethereum/openethereum/pull/11766 in, and I haven't seen the issue for the past ~30h (whereas it would occur multiple times per day before) ; it might be worth merging the PRs.

Hello @ngotchac,

11769 can potentially happen only if there are calls to parity_addReservedPeer/parity_removeReservedPeer API. I wrote bigger reasoning in the Issues.

11766 is tied to restore_db that only happens when warping finishes.

Both of these commits do not explain how they can fix this issue or what gdb showed us. And information that for you this issue happens multiple times a day is inconsistent from what multiple users reported and from what I saw with testing.

You might be right, I haven't looked to much into the PRs. However it might still be worth it to merge them if the code is OK, so that other people can try out.
Regarding the issue arising multiple times per day, it might be related to the fact that I'm connected to many peers (~500), which might trigger a deadlock more often than for other nodes.

std_rwlock version is still working without fault. It's been almost a month now. I'd wager very strongly now that this issue is fixed by switching to standard rust locks.

@CrispinFlowerday, is std_rwlock still working well for you? We are still running without a single lock up. I don't think we ever run this long since 2.7 has been out. It seems to me, the issue is solved.

Yep, still going strong for us - been running since July 29th with no issues.

@rakita could we please consider merging these PRs and cutting a version? It would be great to reduce the uncertainty about the next version, specifically whether we'll have to do a full re-sync.

Please use 2.5 if this issue affects you. Next release (3.1) will be based on 2.5 which is not affected.

@vorot93 has this been 100% decided? and if so, is there any word on a downward migration script? considering the concerns expressed in #11858

Was this page helpful?
0 / 5 - 0 ratings