So, I've decided to track this one down finally.
Got a _low priority_ todo list when I have spare time, and decided to dedicate a few hours figuring this long time error out with Vault & Paper:
Reference:
https://github.com/MilkBowl/Vault/issues/719
https://github.com/MilkBowl/Vault/issues/735
Latest Paper/Vault builds: https://pastebin.com/raw/hS6PNAzW
Wrote a script to start server 200 times monitoring for this error.
First thing I did was run the script on the latest spigot 1.12.2 build. (No occurrences)
Next, search for plugin conflictions, however the results of this test was showing that no specific plugin causes this problem. Removing several plugins stops the errors, and adding them back, starts the error up again. I tried shuffling through plugin removals, additions, changes, you name it, I tried it.
Got it down to a 10 plugin environment reproducing the error quite well.
Next, decided to jump through Paper builds to see if I can at least pin point where this started. And finally, progress has been made. After a good thousand startups and shuffling through builds, I've got it down to this:
Paper build 1349 - 200 startups - (No occurrences)
Paper build 1350 - 200 startups - (78 occurrences)
Whatever you guys changed in this commit:
https://github.com/PaperMC/Paper/commit/7dfbe442471640b40d76217cd91a79dbd25238c1
Is what's causing this vault issue.
The reason it's so hard to reproduce is because, as stated earlier, I genuinely cannot figure out what specific plugins must be on the server for this to happen. I do however have a very reproducible environment available.
CC: @sleaker - tagging you as perhaps you can share some input since it's your plugin, however possibly a paper issue. Or perhaps Vault is doing something wrong that paper doesn't like as of 1350? Who knows, I'm not a dev, I'll leave that for you guys to figure out :P
The changes made to that class have no observable impact on behavior outside of paper no longer ensuring that all async tasks have actually started running, your issue isn't a paper issue, but it's likely that if it's much harder to replicate in spigot that such a behavior change may prod a bit harder at a potential bug in another plugin accessing the plugin manager;
This does look like LP might not be handling the potential that it might be accessed from multiple threads in a state where it needs to recalculate permissions
I would not say clearly a paper issue.
The error is in permissions. If that change is the source, then the plugin is doing something unsafe to begin with and paper simply made it more visible.
Those errors are on the main thread, and that change represents async tasks...
The only thing I can see differing here is that previously, the async tasks were scheduled in alignment with the main thread a little tighter, that may of gave more breathing room to avoid the issue.
But that doesn't mean its paper's fault, it's still a plugins fault, but it might not be vaults directly, but instead luck perms.
modifying permissions should be synchronized if done off main.
I do see potential that we can likely fix it by adding a synchronized to recalculatePermissions. I don't know why that isn't there already.
But I will back port that to 1.12
Fixed in ec58c20 and 998bf84. Ultimately its the plugins issue, as this will still be a risk in spigot, just rarer I guess.
Tested with above change.
126 startups and 0 errors :)
@andrewkm I tracked it down to that commit back in my report, but linked to the direct patch file as it was the only point where Paper played around with the scheduler so my assumption was just that it would be caused by their scheduler changes. Appreciate you taking the time to dig into it :D
@aikar - the original Bukkit scheduler runs tasks Synchronously, so the std runtask operation should never cause a concurrent mod exception, this is why I was pointing at PaperSpigot. The runTaskAsynchronously call isn't used in vault.. so I guess then maybe luckperms was performing stuff asynchly? I'm not sure... but maybe they do have their own scheduler (I'm not familiar with it). I would agree that it shouldn't be patched into paper necessarily if your scheduler changes are correctly blocking the main thread, otherwise someone should potentially slap luckperms for their scheduler? I don't know... atleast the synchronized stuff works I guess?
@Sleaker yes, none of this is an issue in vault. vault was the victim here.
Luck has to be modifying permissions async, which permission modification was not safe to call async. So Luck is the actual culprit here, but I made permissions safe to be changed async now in the commit that fixed it here.
Luck will still have risks on Spigot though.
Also to explain the patch, nothing was changed about synchronous tasks. It was asynchronous tasks that got changed.
Before this change, dispatching async tasks themselves would take up time on main, but it didn't take up time for execution. But this serial dispatch would mean that the start of an async task was less likely to align with concurrent sync tasks being ran, but no guarantees.
But the way the Bukkit scheduler worked was horribly bad -- if you had tons of async tasks, you would lag the main thread still....
But with Paper's change, we no longer hold up the main thread dispatching scheduled tasks, but they still execute at the same scheduling as they normally would.
Just, they will now start a bit faster than before, or maybe a little slower, depending on conditions and placement.
But no async task should ever be relying on order on the main thread, there is no guarantee there, so our change is not an API break.
I don't see anything that suggests it's an LP issue.
Whilst it's true LuckPerms does perform some operations async, non thread-safe collections in Bukkit are not touched. LP like many other permission plugins extends & replaces the standard PermissibleBase class for players - as such, this field is never used by LP.
At least it's all resolved thanks everyone :)
This file is what threw the error, which offers no additional thread safety over what bukkit provided (which as I said, Bukkit itself was not safe until now)....
Since all the modifying methods delegate to the original, you're still subject to the issues I just fixed in the paper commit.
Your code is not safe to be used async outside of Paper, however if you apply the same changes I did in paper to your wrapper, you should be able to guarantee thread safety then.
At least synchronize on the modification methods.
Indeed, but LP is just overriding. It's not using any of those methods async (and therefore causing the issue).
The patches in Paper don't do anything to make querying the permissions field thread safe.
@lucko querying was not a problem...
it's a simple .get() call, which is either set or not.
The problem here is something called an UPDATE operation async.
I guess there could be yet another plugin thats manipulating permissions, but you said above LP does do some async operations.
Does LP change permissions at all async (ie, attempting to offload recalc?)
Nope, the recalculate method is not used by the LP implementation.
This area is a hot spot for performance in the Bukkit implementation, especially when plugins change attachment permissions frequently, or when lots of permissions need to be set.1 The permissible in LP was rewritten to avoid this, whilst still (aiming to!) retaining the expected system behaviour.
1 This is why you see quite a few plugins using reflection to modify attachment permissions, and then calling the recalculate method manually once they're done modifying. e.g. Factions (MassiveCore plugins in general), Towny, PlotSquared, etc etc
@lucko
java.util.ConcurrentModificationException: null
at java.util.HashMap$HashIterator.nextNode(Unknown Source) ~[?:1.8.0_181]
at java.util.HashMap$KeyIterator.next(Unknown Source) ~[?:1.8.0_181]
at org.bukkit.permissions.PermissibleBase.clearPermissions(PermissibleBase.java:168) ~[patched_1.12.2.jar:git-Paper-1560]
at org.bukkit.permissions.PermissibleBase.recalculatePermissions(PermissibleBase.java:149) ~[patched_1.12.2.jar:git-Paper-1560]
at me.lucko.luckperms.bukkit.model.permissible.MonitoredPermissibleBase.recalculatePermissions(MonitoredPermissibleBase.java:132) ~[?:?]
at org.bukkit.craftbukkit.v1_12_R1.command.ServerCommandSender.recalculatePermissions(ServerCommandSender.java:66) ~[patched_1.12.2.jar:git-Paper-1560]
at org.bukkit.plugin.SimplePluginManager.dirtyPermissibles(SimplePluginManager.java:675) ~[patched_1.12.2.jar:git-Paper-1560]
at org.bukkit.plugin.SimplePluginManager.calculatePermissionDefault(SimplePluginManager.java:654) ~[patched_1.12.2.jar:git-Paper-1560]
at org.bukkit.plugin.SimplePluginManager.addPermission(SimplePluginManager.java:626) ~[patched_1.12.2.jar:git-Paper-1560]
at org.bukkit.plugin.SimplePluginManager.addPermission(SimplePluginManager.java:614) ~[patched_1.12.2.jar:git-Paper-1560]
at net.milkbowl.vault.Vault$1.run(Vault.java:155) ~[?:?]
It appears this MonitoredPermissibleBase is still ending back up in bukkits version.
If 2 things call .addPermission on whatever object is occurring here, and recalculatePermissions is called concurrently, that is what's triggering this como.
Mhm, the monitored class just delegates back to the Bukkit object - only players have their permissible completely replaced.
My point stands:
I don't see anything that suggests it's an LP issue.
Best solution would be to use a read/write lock to protect the permissions map, or use a thread safe collection. Making the methods synchronized is only beneficial if all of the mutate methods are changed - which unless I'm missing something, the Paper patch doesn't do? (nevermind, I misread :) )
synchronizing is the only safe way to handle it as the recalc method has multiple sub methods it has as part of the process, so you gotta ensure no thread jumps in the middle --- which is exactly what was happening in these crashes.
so it seems like somethings modifying a non player permissible async then ? what would be modifying permissions outside of your own code?
anyways i recommend you apply the same synchronization to all of your wrappers and that should help give same guarantees we have here.
Most helpful comment
Also to explain the patch, nothing was changed about synchronous tasks. It was asynchronous tasks that got changed.
Before this change, dispatching async tasks themselves would take up time on main, but it didn't take up time for execution. But this serial dispatch would mean that the start of an async task was less likely to align with concurrent sync tasks being ran, but no guarantees.
But the way the Bukkit scheduler worked was horribly bad -- if you had tons of async tasks, you would lag the main thread still....
But with Paper's change, we no longer hold up the main thread dispatching scheduled tasks, but they still execute at the same scheduling as they normally would.
Just, they will now start a bit faster than before, or maybe a little slower, depending on conditions and placement.
But no async task should ever be relying on order on the main thread, there is no guarantee there, so our change is not an API break.