When a bunch of concurrent requests to /grid/register are made, there is a chance that the following exception will occur:
java.lang.NullPointerException
at java.base/java.util.TreeMap.rotateRight(TreeMap.java:2240)
at java.base/java.util.TreeMap.fixAfterInsertion(TreeMap.java:2272)
at java.base/java.util.TreeMap.put(TreeMap.java:580)
at org.openqa.selenium.AbstractCapabilities.setCapability(AbstractCapabilities.java:98)
at org.openqa.selenium.MutableCapabilities.setCapability(MutableCapabilities.java:100)
at org.openqa.grid.internal.utils.configuration.GridNodeConfiguration.lambda$fixUpCapabilities$12(GridNodeConfiguration.java:402)
at java.base/java.util.stream.ReferencePipeline$11$1.accept(ReferencePipeline.java:441)
at java.base/java.util.stream.ReferencePipeline$11$1.accept(ReferencePipeline.java:442)
at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654)
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
at org.openqa.grid.internal.utils.configuration.GridNodeConfiguration.fixUpCapabilities(GridNodeConfiguration.java:410)
at org.openqa.grid.common.RegistrationRequest.<init>(RegistrationRequest.java:92)
at org.openqa.grid.common.RegistrationRequest.<init>(RegistrationRequest.java:59)
at org.openqa.grid.common.RegistrationRequest.<init>(RegistrationRequest.java:48)
at org.openqa.grid.common.RegistrationRequest.fromJson(RegistrationRequest.java:123)
at org.openqa.grid.web.servlet.RegistrationServlet.process(RegistrationServlet.java:100)
at org.openqa.grid.web.servlet.RegistrationServlet.doPost(RegistrationServlet.java:70)
When this exception occurs, all further attempts to register fail, regardless of which node makes the registration attempt.
Start a hub, and then have a large number (~30) of nodes connect at once to the hub.
Detailed steps to reproduce the behavior:
Like the above says, once the hub is started, connect a large amount of nodes at the same time (we started seeing it happen with around 20-30 nodes). We've observed this behavior both in a Docker stack that starts the nodes all at once, and in the attached script when run locally sans Docker. It happens maybe once every 20 times the hub and nodes are started up.
The nodes are expected to connect to the hub without issue, and show up in the /grid/console.
nohup java -jar selenium-server-standalone-3.141.59.jar -role hub >hub.log &2>1 &
sleep 10
nohup java -jar selenium-server-standalone-3.141.59.jar -role node -port 5555 -browser browserName=chrome,maxInstances=1 -host 127.0.0.1 -hub http://127.0.0.1:4444/grid/register >node1.log &2>1 &
nohup java -jar selenium-server-standalone-3.141.59.jar -role node -port 5554 -browser browserName=chrome,maxInstances=1 -host 127.0.0.1 -hub http://127.0.0.1:4444/grid/register >/dev/null &2>1 &
nohup java -jar selenium-server-standalone-3.141.59.jar -role node -port 5553 -browser browserName=chrome,maxInstances=1 -host 127.0.0.1 -hub http://127.0.0.1:4444/grid/register >/dev/null &2>1 &
nohup java -jar selenium-server-standalone-3.141.59.jar -role node -port 5552 -browser browserName=chrome,maxInstances=1 -host 127.0.0.1 -hub http://127.0.0.1:4444/grid/register >/dev/null &2>1 &
nohup java -jar selenium-server-standalone-3.141.59.jar -role node -port 5551 -browser browserName=chrome,maxInstances=1 -host 127.0.0.1 -hub http://127.0.0.1:4444/grid/register >/dev/null &2>1 &
nohup java -jar selenium-server-standalone-3.141.59.jar -role node -port 5550 -browser browserName=chrome,maxInstances=1 -host 127.0.0.1 -hub http://127.0.0.1:4444/grid/register >/dev/null &2>1 &
nohup java -jar selenium-server-standalone-3.141.59.jar -role node -port 5549 -browser browserName=chrome,maxInstances=1 -host 127.0.0.1 -hub http://127.0.0.1:4444/grid/register >/dev/null &2>1 &
nohup java -jar selenium-server-standalone-3.141.59.jar -role node -port 5548 -browser browserName=chrome,maxInstances=1 -host 127.0.0.1 -hub http://127.0.0.1:4444/grid/register >/dev/null &2>1 &
nohup java -jar selenium-server-standalone-3.141.59.jar -role node -port 5547 -browser browserName=chrome,maxInstances=1 -host 127.0.0.1 -hub http://127.0.0.1:4444/grid/register >/dev/null &2>1 &
nohup java -jar selenium-server-standalone-3.141.59.jar -role node -port 5546 -browser browserName=chrome,maxInstances=1 -host 127.0.0.1 -hub http://127.0.0.1:4444/grid/register >/dev/null &2>1 &
nohup java -jar selenium-server-standalone-3.141.59.jar -role node -port 5545 -browser browserName=chrome,maxInstances=1 -host 127.0.0.1 -hub http://127.0.0.1:4444/grid/register >/dev/null &2>1 &
nohup java -jar selenium-server-standalone-3.141.59.jar -role node -port 5544 -browser browserName=chrome,maxInstances=1 -host 127.0.0.1 -hub http://127.0.0.1:4444/grid/register >/dev/null &2>1 &
nohup java -jar selenium-server-standalone-3.141.59.jar -role node -port 5543 -browser browserName=chrome,maxInstances=1 -host 127.0.0.1 -hub http://127.0.0.1:4444/grid/register >/dev/null &2>1 &
nohup java -jar selenium-server-standalone-3.141.59.jar -role node -port 5542 -browser browserName=chrome,maxInstances=1 -host 127.0.0.1 -hub http://127.0.0.1:4444/grid/register >/dev/null &2>1 &
nohup java -jar selenium-server-standalone-3.141.59.jar -role node -port 5541 -browser browserName=chrome,maxInstances=1 -host 127.0.0.1 -hub http://127.0.0.1:4444/grid/register >/dev/null &2>1 &
nohup java -jar selenium-server-standalone-3.141.59.jar -role node -port 5540 -browser browserName=chrome,maxInstances=1 -host 127.0.0.1 -hub http://127.0.0.1:4444/grid/register >/dev/null &2>1 &
nohup java -jar selenium-server-standalone-3.141.59.jar -role node -port 5539 -browser browserName=chrome,maxInstances=1 -host 127.0.0.1 -hub http://127.0.0.1:4444/grid/register >/dev/null &2>1 &
nohup java -jar selenium-server-standalone-3.141.59.jar -role node -port 5538 -browser browserName=chrome,maxInstances=1 -host 127.0.0.1 -hub http://127.0.0.1:4444/grid/register >/dev/null &2>1 &
nohup java -jar selenium-server-standalone-3.141.59.jar -role node -port 5537 -browser browserName=chrome,maxInstances=1 -host 127.0.0.1 -hub http://127.0.0.1:4444/grid/register >/dev/null &2>1 &
nohup java -jar selenium-server-standalone-3.141.59.jar -role node -port 5536 -browser browserName=chrome,maxInstances=1 -host 127.0.0.1 -hub http://127.0.0.1:4444/grid/register >/dev/null &2>1 &
nohup java -jar selenium-server-standalone-3.141.59.jar -role node -port 5535 -browser browserName=chrome,maxInstances=1 -host 127.0.0.1 -hub http://127.0.0.1:4444/grid/register >/dev/null &2>1 &
nohup java -jar selenium-server-standalone-3.141.59.jar -role node -port 5534 -browser browserName=chrome,maxInstances=1 -host 127.0.0.1 -hub http://127.0.0.1:4444/grid/register >/dev/null &2>1 &
nohup java -jar selenium-server-standalone-3.141.59.jar -role node -port 5533 -browser browserName=chrome,maxInstances=1 -host 127.0.0.1 -hub http://127.0.0.1:4444/grid/register >/dev/null &2>1 &
nohup java -jar selenium-server-standalone-3.141.59.jar -role node -port 5532 -browser browserName=chrome,maxInstances=1 -host 127.0.0.1 -hub http://127.0.0.1:4444/grid/register >/dev/null &2>1 &
nohup java -jar selenium-server-standalone-3.141.59.jar -role node -port 5531 -browser browserName=chrome,maxInstances=1 -host 127.0.0.1 -hub http://127.0.0.1:4444/grid/register >/dev/null &2>1 &
nohup java -jar selenium-server-standalone-3.141.59.jar -role node -port 5530 -browser browserName=chrome,maxInstances=1 -host 127.0.0.1 -hub http://127.0.0.1:4444/grid/register >/dev/null &2>1 &
nohup java -jar selenium-server-standalone-3.141.59.jar -role node -port 5529 -browser browserName=chrome,maxInstances=1 -host 127.0.0.1 -hub http://127.0.0.1:4444/grid/register >/dev/null &2>1 &
nohup java -jar selenium-server-standalone-3.141.59.jar -role node -port 5528 -browser browserName=chrome,maxInstances=1 -host 127.0.0.1 -hub http://127.0.0.1:4444/grid/register >/dev/null &2>1 &
nohup java -jar selenium-server-standalone-3.141.59.jar -role node -port 5527 -browser browserName=chrome,maxInstances=1 -host 127.0.0.1 -hub http://127.0.0.1:4444/grid/register >/dev/null &2>1 &
nohup java -jar selenium-server-standalone-3.141.59.jar -role node -port 5526 -browser browserName=chrome,maxInstances=1 -host 127.0.0.1 -hub http://127.0.0.1:4444/grid/register >/dev/null &2>1 &
nohup java -jar selenium-server-standalone-3.141.59.jar -role node -port 5525 -browser browserName=chrome,maxInstances=1 -host 127.0.0.1 -hub http://127.0.0.1:4444/grid/register >/dev/null &2>1 &
OS: Amazon Linux, macOS
Browser: Chrome, Firefox
Browser version: Various
Browser Driver version: Various
Language Bindings version: Unknown
Selenium Grid version (if applicable): 3.141.59
It appears that there is another symptom associated with the same cause, where TreeMap.getEntry goes into an infinite while loop, and maxes out all the threads' CPU usages. This issue doesn't have any visible output on the logs, but also has the same effect of preventing any registrations from taking place.
The cause appears to be that somehow, the "caps" TreeMap field of AbstractCapabilities is being accessed concurrently. I can confirm that this issue is not present in releases of Selenium as recent as 3.14.0.
Patching the standalone jar with an AbstractCapabilities class where this line:
private final Map<String, Object> caps = new TreeMap<>();
is replaced with this:
private final Map<String, Object> caps = Collections.synchronizedSortedMap(new TreeMap<>());
fixes the issue.
This may or may not be the correct fix. It might be worth figuring out why this capabilities map is being accessed concurrently now where it wasn't in the past. I have yet to find anything concrete in regards to that, though.
Please let me know any thoughts, questions, or concerns you may have in regards to this bug report. This is impacting our systems greatly, as we regularly spin up grids of around 30 nodes, and it's causing maybe 1 out of 10 deployments to be duds. I will try and be prompt with any responses.
I can confirm that this actually happens with large grids in general, even when nodes registrations are spaced about by a few seconds each. I.e., while it's less likely that you'll encounter NPEs or lock-ups when you trickle in node registrations, there's still a chance that you'll encounter them.
"I can confirm that this issue is not present in releases of Selenium as recent as 3.14.0."
to clarify what that means... you are confirming the issue does not occur with 3.14.0, but does occur in 3.141.59?
Yes, we've never encountered the issue in 3.14.0, but run into it frequently with 3.141.59.
I think I found the root cause of the issue, introduced here: https://github.com/SeleniumHQ/selenium/commit/3ae0b6245488f079213763c5a80b5d4462fac84f#diff-d03d9f23eb4368f1081c345ba529a5c1
The same capabilities List is loaded from DEFAULT_CONFIG_FROM_JSON. Even though the enclosing list started to get copied in https://github.com/SeleniumHQ/selenium/commit/7426cb7d8a0f2c4edd982151e01c1d77d44d26eb#diff-d03d9f23eb4368f1081c345ba529a5c1, the capabilities inside were not copied.
These relevant lines from the stack trace in the initial comment indicate that every time RegistrationRequest.fromJson is called, fixUpCapabilities is called on the same default capabilities object, which on the whole just ends up getting overwritten later:
at org.openqa.grid.internal.utils.configuration.GridNodeConfiguration.fixUpCapabilities(GridNodeConfiguration.java:410)
at org.openqa.grid.common.RegistrationRequest.<init>(RegistrationRequest.java:92)
at org.openqa.grid.common.RegistrationRequest.<init>(RegistrationRequest.java:59)
at org.openqa.grid.common.RegistrationRequest.<init>(RegistrationRequest.java:48)
at org.openqa.grid.common.RegistrationRequest.fromJson(RegistrationRequest.java:123)
Please let me know if my interpretation is wrong or mistaken, though. This is partially based on conjecture.
Closed via #6924
Most helpful comment
Closed via #6924