With eXist 5 I frequently encounter hanging servers after 3 or 4 days. All stack traces point to BTreeCache - called from the structural index - which is the only runnable thread. All other threads are blocked. The server does not recover, even if waiting an hour. Configuration is unchanged. This needs to be investigated.
"qtp826245889-5576" #5576 prio=5 os_prio=0 tid=0x00007fcd4c3dd000 nid=0x74c0 runnable [0x00007fce0e055000]
java.lang.Thread.State: RUNNABLE
at org.exist.storage.cache.BTreeCache.removeNext(BTreeCache.java:67)
at org.exist.storage.cache.BTreeCache.add(BTreeCache.java:45)
at org.exist.storage.cache.BTreeCache.add(BTreeCache.java:38)
at org.exist.storage.cache.BTreeCache.add(BTreeCache.java:1)
at org.exist.storage.btree.BTree.getRootNode(BTree.java:508)
@wolfgangmm The other threads that are blocked... can you show the stack traces too please?
I added the stack dump above
@wolfgangmm It's a bit tricky as your stack trace shows that the shutdown has been initiated. This compounds the issue of understanding what is happening, as you have many threads trying to take a broker, whilst the shutdown is blocking those and waiting for any outstanding brokers to be returned. The outstanding broker is likely in the BTreeCache#removeNext, where it seems to be spinning, so I suspect some sort of race-condition on the state of that BTreeCache.
The BTreeCacheis marked @NotThreadSafe, so the first thing I would check is that all access to it is correctly synchronized.
Otherwise, do you have a jstack from where the issue occurs, but before shutdown is issued?
Unfortunately I did overwrite the earlier jstack after triggering the shutdown, but I have observed the same scenario 3 or 4 times in a row. All stack traces were pointing to BTreeCache#removeNext, so I'm investigating in this area.
I was just looking at the BTreeCache class because I just experienced a similar problem while stress testing my collections today. My test is constantly reloading XML files while performing operations (queries & XSL transformations) against them. (Will supply stack traces if I can recreate it.) I was able to connect with a debugger and look at what's happening. BTreeCache.removeNext seemed to be working on a cache with a single item in it that wasn't eligible for removal, and thus was stuck in an infinite loop -- except that when I restarted all the threads but the one stuck in the loop (an accidental click on the wrong icon), everything was able to resume (which I take to mean some other thread added something to the cache.)
BTreeCache.removeNext and the related LRUCache.removeNext concern me somewhat. Neither checks for the possibility that nothing is eligible for removal from the cache. It's a little clearer in LRUCache that unless there's an outside influence, an unwritten assumption, or some side-effect in the allowUnload or the get* methods calls the changes the return value of allowUnload, the loop could potentially go on forever. In BTreeCache.removeNext, the loop needs to traverse everything in the cache twice (once with mustRemoveInner false and once with it true) before it will repeat its tests.
So perhaps a lock is missing somewhere, but I haven't yet got a handle on the locking conventions and have no clue where to look.
@paulmer That's very interesting. I was trying to reproduce the issue with stress tests to make sure it is indeed fixed, but without success so far. Could you tell us which of the caches had only one page in it? In my case the hang seemed to be caused by the structural index - which should always be much larger than one page. So even though I see the potential for an endless loop, I'm not 100% sure what effectively causes it.
If you would like to help testing, you may also try Adam's patch here: https://github.com/adamretter/exist/tree/refactor/sequence-long-hash-map
@wolfgangmm The situation appeared again during overnight testing, and I've attached a stack trace captured via Eclipse's debugger. (I'll add a jstack trace if it completes.) It looks to me like it's the structure index cache that's inconsistent. The looping thread is "qtp1036420486-2501". The LRUCache object has a "max" value of 256, and it's map has an item count of 256, but there's only one item in the map:
Here's a better trace with jstack:
trace.txt
I think I see a code path that might lead to the map becoming empty without the item count being updated. Line 152 of SequencedLongHashMap (function insert(key, value)) removes an entry from the linked lists if it is a duplicate of the entry being inserted, and it does this without updating the item count. It then returns that entry. I suspect it's a copy-and-paste issue, and the remove() shouldn't be there? (Edit: Nevermind. I see how "insert" is working in conjunction with "put" now to move duplicate items to the back of the LRU queue.)
@paulmer I also came to the starvation conclusion in our Slack channel:
Adam Retter 6:34 PM
@Wolfgang so it looks to me like both LRUCache#removeOne and BTreeCache#removeNext have potential lock-up bugs. There are a couple circumstances where it might happen, but they all involved the cache being full and a new item being added.
- The cache contains only entries which cannot be unloaded. i.e. cached.allowUnload() == false.
- The cache contains items which have the same key as the new item being added.
- A mix of (1) and (2).
Either way, it seems to me the policy for replacing an item in the cache when it is full is not safe.
The above goes for the LRUCache - the BTreeCache is similar but introduces additional invariants into the lockup (edited)
@Wolfgang happy to discuss when you have a moment - it looks to me like LRUCache#removeOne and BTreeCache#removeNext need to be re-written to cope with the scenario that the cache is full and the items in the cache cannot be “unloaded” etc
I created a patch, which does not solve the starvation issue, but does remove quite some custom hash-map code in SequencedLongHashMap. Maybe you would like to try it? https://github.com/adamretter/exist/tree/refactor/sequence-long-hash-map
I think the expression
if(cached.allowUnload() && cached.getKey() != item.getKey() &&
(mustRemoveInner || !cached.isInnerPage()))
is too solid
We have to remove cached.allowUnload() if during several cicles noone element is removed. That will prevent us from locking ourselves in jail.
for instance some of if(cached.getKey() != item.getKey() && cached.allowUnload() && cicleCount%2 == 0 && (mustRemoveInner || !cached.isInnerPage())) will help
@adamretter I'll give your patch a try. The starvation issue while a real theoretical problem I think is a red herring since the code is only triggered if the cache is full. The problem I'm seeing is that somehow the item count (which triggers the loop) and the linked list contents get out of sync. I've put some consistency checking code into the SequencedLongHashMap class and found that the number of items in the sequence (first/last list) becomes different from the number of items in the values array and the "items" count at some point. It's that inconsistency that ultimately causes the entry into the infinite loop: the count is full, but the list is empty. I don't see anything obviously wrong with the code so I can't help wondering if the issue might be related to unsynchronized concurrent updates corrupting the sequence. I haven't figured out a way that could happen either, though.
@paulmer I am hoping that my patch addresses that.
@paulmer @wolfgangmm Is this now considered fixed?
@adamretter I've been running the code in production for several weeks now and I'm no longer seeing any problems (whereas the server was locking up after a couple of hours before), so I think this is fixed.
Most helpful comment
@adamretter I've been running the code in production for several weeks now and I'm no longer seeing any problems (whereas the server was locking up after a couple of hours before), so I think this is fixed.