Selenium: Queued sessions inconsistent behaviour Grid V4 vs Grid V3

Created on 20 Jan 2021  路  6Comments  路  Source: SeleniumHQ/selenium

馃悰 Bug Report

I have 2 grids:

  • 3.141.59-20210105
  • 4.0.0-beta-1-prerelease-20210114

with 1 Chrome node. You can use the following docker compose file to spin up the same environment
docker-compose.yml.zip

I am sending 2 Chrome tests in parallel but experiencing different behaviour between Grid v3 vs Grid v4
nunitProject.zip

Selenium 3 Grid behaviour

  • Because there is just one node, 1 test is put into a queue waiting for the first one to be finished then it is executed successfully
    image
    image

Logs:
selenium3GridLogs.txt

Selenium 4 Grid behaviour

  • First test is sent to the grid and executed properly
  • Second test is running and trying to create session, eventually times out.

After the first test is finished I can see graphql response saying there is available slot, yet it is not being assigned to my second test
{ "data": { "grid": { "nodes": [ { "id": "bb800483-fb81-4fea-849b-62ec4a23cb7b", "uri": "http://172.19.0.3:5555", "status": "UP", "sessions": [], "maxSession": 1, "capabilities": "[\n {\n \"slots\": 1,\n \"browserName\": \"chrome\"\n }\n]" } ], "uri": "http://172.19.0.2:4444", "totalSlots": 1, "usedSlots": 0, "sessionCount": 0 } } }

image

  • Furthermore there is no indication in Grid UI that node would be in use and no information about queued session either

image

Logs:
seleniumGrid4Logs.txt

C-grid R-awaiting answer

Most helpful comment

Hi @pujagani, I did more testing on this with the latest prerelease 4.0.0-beta-1-prerelease-20210128, restarted the environment multiple times and I am not seeing the reported behaviour any more. I think we can close this ticket. Thanks again for your time you spend with your investigation.
concurrentSessionsOk.log

All 6 comments

Thank you for providing all the details. I was able to run the setup from the docker-compose file and run the C# tests. However, I am unable to reproduce the issue. Based on the logs, I can see the first request being handled and the session deleted from the local store. Then the second request then gets picked up from the queue and a session is created, after a while that is deleted too. The behaviour is as desired.

log.txt

Regarding the Grid UI, we are currently working on getting information related to queued sessions out via GraphQL and display it on the UI.

Hi @pujagani, thanks for your time. It is a bit tricky to replicate this issue, but it is still happening. Looks like this happens when you spin up fresh instance of grid/node (meaning this behaviour will happen only first time you send you tests) and one of the tests has to wait a bit longer in the session queue, e.g. 20s

I have updated the test solution so you can trigger the test from the command line (to rule out any weird behaviour of IDE, I am using Rider for example)

Steps to reproduce:

  • Spin up Grid and node from attached compose file (you will have grid 3 running on localhost:5555 and grid 4 on localhost:4444 both with 1 Chrome node, so by switching grid URL in code you can run the same tests on both grids.
    docker-compose.yml.zip
    )
  • Check grid 4 logs and make sure chrome node was registered
  • Open and build the attached c# solution
    Solution1.zip
  • Execute tests dotnet /Solution1/ConcurrentTest/bin/Debug/netcoreapp3.1/ConcurrentTest.dll --workers 2

Expected behaviour

  • First test is executed within 20 s, the second one is waiting in a queue and then it is picked up by the grid.
  • CommandTimeout value for RemoteWebDriver is set to 2 minutes (I am expecting both tests being executed within 40s, definitely below 2 minutes )

Current behaviour

  • This works perfectly on grid 3 but grid 4 behaves differently -> first test will pass the second one will wait 2minutes and timeout. From the logs it looks like session distributor did not even noticed there were 2 tests coming to the grid so it will create only one session and then delete it after test is done.
    SecondTestIgnoredSessionNotCreatedtxt.zip

Interesting thing is after the first failure if I execute the very same tests from that point onwards grid sessions queuing works fine, both tests will consistently pass all the time.

Thank you for sharing the detailed steps to recreate the issue. I attempted to recreate it and I only saw the 2nd request error out the very first time I ran the docker-compose file. I did not see it again.
To narrow down the problem area, following attempts were made:

  1. Run tests against the Standalone docker container and the test passed.
  2. Run tests against the fully distributed Grid containers and the test passed.
  3. Wipe out all the docker volumes, images and containers and re-run Hub-Node containers and the test passed.
  4. Wipe out all the docker volumes, images and containers and re-run Standalone container and the test passed.
  5. Wipe out all the docker volumes, images and containers and re-run fully distributed Grid containers and the test passed.
  6. Edit the docker-compose file of Hub-Node mode to pass command-line options and re-run the setup and the test passed.
  7. Tweak the test to wait for 30 seconds in the queue and run against all the Grid modes and the test passed.
  8. To rule out the long-running thread related issues, let the docker container for Hub-Node run for few hours and tested against that and the test passed.

So far, I have not been successful in recreating the issue successfully in order to triage, fix and test the fix.
If possible, please share how are you creating the "fresh" instances each time. Even the docker commands might help. Thank you!

Since the issue seems to happen only on startup and once, it seems like there might be a timing issue. There could be a chance that the Distributor within the Hub is not fully ready (since it checks every second if the queue has a request and then polls) when it receives 1st request and that request is ignored and 2nd request is received when it is ready and it gets executed correctly.
Though such a thing would only happen one-off I am guessing and not every time the fresh instances are created.

To rule-out any race conditions if at all, I have created a PR to ensure the Distributor only uses the local Grid model to check for Grid capacity instead of the Remote Node status. https://github.com/SeleniumHQ/selenium/pull/9120

To ensure the Grid is ready before starting the tests, please refer https://github.com/SeleniumHQ/docker-selenium#waiting-for-the-grid-to-be-ready.

thanks @pujagani, one more thing I will try is to run grid in full mode, make sure all components are ready, then run the same tests and if I experience the same problem I will collect logs from all components and report back.

Hi @pujagani, I did more testing on this with the latest prerelease 4.0.0-beta-1-prerelease-20210128, restarted the environment multiple times and I am not seeing the reported behaviour any more. I think we can close this ticket. Thanks again for your time you spend with your investigation.
concurrentSessionsOk.log

Was this page helpful?
0 / 5 - 0 ratings