Docker-selenium: Session [....] not available and is not among the last 1000 terminated sessions

Created on 12 Mar 2018  路  25Comments  路  Source: SeleniumHQ/docker-selenium

Meta -

Image(s):
hub/node-chrome
Docker-Selenium Image Version(s):
3.5.3-boron, 3.9.1-actinium
Docker Version:
17.09.1-ce

I've been seeing this issue for a while now, happening randomly. I'm testing with webdriverio + mocha, running in docker. I've seen this happen when running 2 chrome nodes, but I think the problem occurs more often (but not reproducible) when I run 5 chrome nodes.

Example of the error that I'm seeing:

Session [95635190928628f91d62dab7595e271d] not available and is not among the last 1000 terminated sessions. Active sessions are[ext. key 4304cbbbd14fcb10fd6a6866baaf22c6, ext. key 6e5afdf576a0249056ad69d89e7ad154, ext. key e2260047ddd1461aec6c3f7986eda7b3, 85a10aad-9e02-4bcf-a567-59817ccb0865 (int. key, remote not contacted yet.)]

I've noticed that every time this error happens, the following message is logged by the hub:
INFO - Removed a session that had not yet assigned an external key b9dba954-9635-4bff-b338-655f727baffc, indicates failure in session creation BROWSER_TIMEOUT

I've tried setting BROWSER_TIMEOUT to 0/-1/30000, I don't think it's actually related to it. After setting the hub's log level to DEBUG I managed to get a bit more insight, but I'm still unsure what's exactly happening. The following gist contains part of that log, I only left the thread that failed and another thread that I think is related to the issue. If any other information is required please let me know and I will supply it:
https://gist.github.com/anonymous/d570b0eb57143c03d9f8256eeb2865bb

From my limited understanding of how the hub works, I think this is timeline of what happens:

  • session requested by 75: 10:38:20.286 DEBUG - http-outgoing-75 >> POST /wd/hub/session HTTP/1.1
  • 86 is requesting a connection but not creating a new one?:
    ```10:38:25.949 DEBUG - Connection request: [route: {}->http://172.18.0.7:5555][total kept alive: 0; route allocated: 1 of 2000; total allocated: 5 of 2000]
    10:38:25.949 DEBUG - Connection leased: [id: 86][route: {}->http://172.18.0.7:5555][total kept alive: 0; route allocated: 2 of 2000; total allocated: 6 of 2000]
- 75 times out and deletes the session for some reason:
```10:38:25.956 DEBUG - http-outgoing-75 << "[read] I/O error: Read timed out"
10:38:25.956 DEBUG - http-outgoing-75: set socket timeout to 0
10:38:25.956 DEBUG - Executing request DELETE /wd/hub/session/7e54a0292efc104182ab778f2fa4e7cc HTTP/1.1
  • at the same time 86 is trying to get that session: 10:38:25.962 DEBUG - http-outgoing-86 >> GET /wd/hub/session/7e54a0292efc104182ab778f2fa4e7cc/window/current/size HTTP/1.1
  • which is returning 200 for the first 2 requests, but in the middle is actually getting deleted by 75: 10:38:25.963 DEBUG - http-outgoing-75 >> "DELETE /wd/hub/session/7e54a0292efc104182ab778f2fa4e7cc HTTP/1.1[\r][\n]"
  • 86 at some point times out and retries the requests:
    ```10:38:26.003 DEBUG - http-outgoing-86 << "[read] I/O error: Read timed out"
    10:38:26.003 DEBUG - http-outgoing-86: set socket timeout to 0
    10:38:26.003 DEBUG - Executing request POST /wd/hub/session/7e54a0292efc104182ab778f2fa4e7cc/execute HTTP/1.1
- this returns the following message: `10:38:26.202 DEBUG - http-outgoing-86 << "{"sessionId":"","status":6,"value":{"message":"no such session\n  (Driver info: chromedriver=2.35.528139 (47ead77cb35ad2a9a83248b292151462a66cd881),platform=Linux 3.13.0-143-generic x86_64)"}}"`
- next it seems like 86 tries to initialize a new sessions: `10:38:26.257 DEBUG - http-outgoing-86 >> POST /wd/hub/session HTTP/1.1`
- and it looks like the hub is giving it a new session: `10:38:33.513 DEBUG - http-outgoing-86 << "{"status":0,"sessionId":"95635190928628f91d62dab7595e271d","value":{"applicationCacheEnabled":false,"rotatable":false,"mobileEmulationEnabled":false,"networkConnectionEnabled":false,"chrome":{"chromedriverVersion":"2.35.528139 `
- and then fails:
```10:38:33.514 DEBUG - Connection [id: 86][route: {}->http://172.18.0.7:5555] can be kept alive indefinitely
10:38:33.514 DEBUG - Connection released: [id: 86][route: {}->http://172.18.0.7:5555][total kept alive: 1; route allocated: 1 of 2000; total allocated: 5 of 2000]
10:38:33.593 DEBUG - Session [95635190928628f91d62dab7595e271d] not available and is not among the last 1000 terminated sessions.
Active sessions are[ext. key 4304cbbbd14fcb10fd6a6866baaf22c6, ext. key 6e5afdf576a0249056ad69d89e7ad154, ext. key e2260047ddd1461aec6c3f7986eda7b3, 85a10aad-9e02-4bcf-a567-59817ccb0865 (int. key, remote not contacted yet.)]
10:38:33.710 DEBUG - Session [95635190928628f91d62dab7595e271d] not available and is not among the last 1000 terminated sessions.
Active sessions are[ext. key 4304cbbbd14fcb10fd6a6866baaf22c6, ext. key 6e5afdf576a0249056ad69d89e7ad154, ext. key e2260047ddd1461aec6c3f7986eda7b3, 85a10aad-9e02-4bcf-a567-59817ccb0865 (int. key, remote not contacted yet.)]
10:38:33.725 DEBUG - Session [95635190928628f91d62dab7595e271d] not available and is not among the last 1000 terminated sessions.
Active sessions are[ext. key 4304cbbbd14fcb10fd6a6866baaf22c6, ext. key 6e5afdf576a0249056ad69d89e7ad154, ext. key e2260047ddd1461aec6c3f7986eda7b3, 85a10aad-9e02-4bcf-a567-59817ccb0865 (int. key, remote not contacted yet.)]
10:38:33.750 DEBUG - Session [95635190928628f91d62dab7595e271d] not available and is not among the last 1000 terminated sessions.
Active sessions are[ext. key 4304cbbbd14fcb10fd6a6866baaf22c6, ext. key 6e5afdf576a0249056ad69d89e7ad154, ext. key e2260047ddd1461aec6c3f7986eda7b3, 85a10aad-9e02-4bcf-a567-59817ccb0865 (int. key, remote not contacted yet.)]

Any information/suggestions on how to debug this further would be appreciated !

R-awaiting-answer

Most helpful comment

@borisivan Here's the behavior that I believe I am seeing. Feel free to check out my repo to confim. We are using protractor to shard multiple features/tests into different sessions to run in parallel.

I'm noticing the order of operation looking like the following:

  1. Feature file 1 assigned session a
  2. Feature file 1 finishes test in session a and outputs test results.
  3. Session b assigned to feature file 2
  4. Session a does not terminate properly (potentially due to codebase)
  5. Feature file 2 throws an error saying that session b does not exist (but does exist on the node)
  6. Feature file 2 terminates, leaving unused session hanging on node.

I'm confirming this right now. So don't hold this as doctrine, but I am wondering more and more if the issue lies with some sort of asynchronous call in feature file 1 not terminating, leaving it's session open beyond the time frame of when the test should be "completed" and throws an error with the follow up feature file.

All 25 comments

+1 here. I haven't taken the time to dive into the debug log like @eyalzek But I have been experiencing the same error for about the last two weeks or so.

We're testing using protractor and behave, and I specifically only see this issue when I start up multiple sessions (5+) to test in parallel.

```23:50:12 [chrome #01-24] [23:50:12] I/runnerCli - Session [564a337b0773f8f8273510a48e1c3b79] not available and is not among the last 1000 terminated sessions.

23:50:12 [chrome #01-24] Active sessions are[ext. key f174e45b117e359a66ccc99043fb10ce, ext. key cf505647975e2f8162a15f31c80d21e0, ext. key cd98c8aac51adf41b413a1adc601e2c5, ext. key e4d688a4b4db6c7544be4d309214dec7, ext. key f4ca8698b3fe328984a2f2d7ec42b353, e8699ff8-8abc-4858-b51a-ee36f64bd61c (int. key, remote not contacted yet.)]
```

Hi @eyalzek, @loghen41

Would it be possible to have a script and the test runner configuration so we can try to reply this? Is this also happening with the latest releases? (3.11.0)?

Augmenting my previous comment here: This is our Docker-compose file we're using to set up our Selenium Grid

hub:
  image: selenium/hub:${GRID_VERSION}
  ports:
    - "4444:4444"
  environment:
    - GRID_TIMEOUT=30000
    - GRID_BROWSER_TIMEOUT=30000
chrome:
  image: selenium/node-chrome:${GRID_VERSION}
  links:
    - hub
  shm_size: 128M

@diemol in my case it's a complicated testing suite that isn't written by me, which also makes it harder to debug, so it's not that straightforward to try and get a minimum reproducible example... Better go with the example given by @loghen41 . In any case, this is an example of my compose file (though I'm running 5 chrome instances, A-E):

version: '2.1'

networks:
  e2e:

services:
  hub:
    image: selenium/hub:3.9.1-actinium
    container_name: hub
    ports:
      - 4444:4444
    environment:
      - GRID_MAX_SESSION=20
      - GRID_TIMEOUT=12000
      - GRID_NEW_SESSION_WAIT_TIMEOUT=30000
      - GRID_DEBUG=true
      - GRID_CLEAN_UP_CYCLE=20000
    networks:
      - e2e

  chrome-A:
    image: selenium/node-chrome${CI--debug}:3.9.1-actinium
    container_name: chrome-A
    depends_on:
      - hub
    environment:
      - NODE_APPLICATION_NAME=@A
      - NODE_MAX_INSTANCES=10
      - NODE_MAX_SESSION=10
      - HUB_PORT_4444_TCP_ADDR=hub
      - HUB_PORT_4444_TCP_PORT=4444
    ports:
      - 5900:5900
    volumes:
      - /dev/shm:/dev/shm
    networks:
      - e2e

@diemol

To directly respond to your request, Here's our configuration file for our grid runner:

var path = process.cwd();

exports.config = {
    seleniumAddress: 'http://10.61.132.147:4444/wd/hub',
    framework: 'custom',
    frameworkPath: require.resolve('protractor-cucumber-framework'),

    specs: [ path + '/features/**/*.feature'],

    capabilities: {
        'browserName': 'chrome',
        'loggingPrefs' : {'driver' : 'INFO', 'browser': 'INFO'},
        shardTestFiles: true,
        maxInstances: process.env.NUMBER_OF_SHARDS
    },

    cucumberOpts: {
        format: ['json:../../../../../results.json', "pretty"],
        require: [ path + '/features/step_definitions/**/*.js',
            path + '/../../support/commonStepDefinitions.js' ]
    }
};

And here is our command we use to launch the tests:

protractor ../../support/config/conf.chrome_shard_docker.js --specs 'features/**/*.feature' --params.environment=qa '--cucumberOpts.tags=~@inDevelopment' --disableChecks --params.showHelps=true

I will say, we are also testing multiple applications at once. So we could be having the issue of applications clashing with each other with stealing resources. But as I check the console, the jobs seem to be queueing properly, so I'm surprised to see them being stolen.

Would it be possible to have some test scripts as well? I mean, otherwise I need to try to build from scratch the same environment you guys already have there, and this can already oversee the issue that might be causing the final problem.

The ideal scenario would be to have a GitHub repo where we can reproduce the issue and do something about it.

I hope you can understand that we would like to help, but without that information it is very hard to do something efficiently.

@diemol That makes sense.... I thought you might say something like that. The codebase for our testing suite is unfortunately behind locked doors (and I don't have enough ownership to place it into a public forum) So... I unfortunately am in a similar situation as @eyalzek. I know the testing suite pretty well backwards and forwards, but am not allowed to make it public.

The only thing I'm not as familiar with is the underlying workings of the selenium grid. I'd be happy to potentially debug some things on my end, but I'm not entirely sure where to start.

I would try to things:

  • First, use a plain grid, without docker. Just to see if the error is directly in Selenium or if there is something we need to improve in the images for this case.
  • Second, I see that both reports have in common the use of 3rd party bindings (WebDriverIO and Protractor), I would also check in their repos and forums for similar problems. Just to isolate the issue and see if it is on the client side (bindings).

Now about the tests, perhaps the overall test structure can be put in a public GitHub repo. Like the package.json, configuration files, and then a set of simple tests like searching in Google or something like that. If the issue is in Selenium, I would expect that it can be reproducible with this type of configuration.

I haven't considered that it might be related to resource usage, but although this happens mostly on our CI, I did see this happen on my local machine which definitely has enough resources and the containers are not bound in any way... In any case, I suspect that starting:
1 x selenium hub
10 x chrome nodes
1 x mock server

and then running tests in parallels on all of those nodes, might reproduce this. I will try to create some kind of POC tomorrow if I have time.

@diemol I'll see what I can do to test/review with those recommendations and report back here my findings. I could potentially build a small little repo like that, I'll still have to check what permissions I have before I do so. I'll check into that now.

I've now done substantially more tests on my end, and I'm almost positive that the errors I was receiving beforehand is coming from a surplus of requests/lack of resources. I specifically see it in my case where I was spinning up 6 nodes, I found that anytime I reached 5-6 sessions, errors began popping up (from what it appears to be resources stealing sessions from each other)

As such I'll be migrating our local selenium grid to our aws environment, (as we lack more physical resources locally)

I'll be glad to post my results if this particular thread is still open.

After spending some time on this today I realized that when this error happens, it was always on the same test suite, which led me to focus on it and I realized that the suite running before it had a describe with a bunch of hooks, but the only test case was skipped, e.g:

describe('foo', function() {
    before(function(done) {
        ...
    });

    beforeEach(function(done) {
        ...
    });

    it.skip('bar', function(done) {
        ...
    });

    after(function(done) {
        ...
    });
});

once I skipped the entire suite (e.g: describe.skip(...)) it seemed to have fixed the problem. I'm running it over and over again to see if this reproduces.

Then again, even with the "faulty" configuration this doesn't happen always, so it's hard for me to say if this is a mocha / selenium issue, but it seems more likely that it's coming from mocha.

I will update if I manage to get any more info about this.

I spoke too soon, it just happened again elsewhere... but I'm still suspecting that mocha is the culprit (or js promises). Will dive deeper in the rabbit hole tomorrow.

Hi @eyalzek @loghen41,

Thanks for your effort in debugging the issue! Have you guys found anything related to the images? Seems that it is more related to the available resources, right?

Please let us know if there is something specific that can be improved, or a reproducible issue.

I'll leave this issue open for a few days more in case there are more findings that can help the community.

@diemol I'm currently transitioning our local grid into aws right now (which I'm almost finished with) What I've specifically found is that the problem only seems to occur when long running requests are sent to the grid beyond the capacity of what the resources of what that grid can handle

ie. if you have a grid with 10 nodes, and 10 max sessions, but it receives 20 requests, and the test suites sent to it are long running tests, they potentially can have their session taken from them and handed to one of the requests in queue. I have been curious to do load testing in a reproducible manner, but for my organization it's been more critical that I produce a working replacement for our current system. Once I get this up and running, I probably can look at doing some load testing.

Hi @loghen41, I have the same problem you referenced in the comment above. When running 6 tests in parallel, in a suite with 240 tests that each range from 30 seconds to 4 minutes, with 7 containers that stay up and running, I see 3-4 tests hit that problem once or twice per day, if the suite runs 10-15 times a day.

I was able to reduce the frequency of occurrence with converting all webDriver instantiation and retrieval with threadLocal usage, but it still happens. My suspicion is that there is something not threadsafe somewhere in the these selenium-grid docker solutions, but I haven't been able to figure it out either. (or provide a way to reliably reproduce). However, it's reliable enough that it happens daily, and always on a different test.

@borisivan which testing framework are you using?

@eyalzek java w/ testNG. Using Java threadLocal to create and provide the webDriver for the tests.

@diemol I've finished my work that I needed to complete for my work on aws. As such I've dedicated the better part of today to building a small testing repo in order to test this particular bug. I'm still working on duplicating it, but you can see the current version I'm at here:

https://github.com/loghen41/docker-selenium-grid

@diemol After a week or so, I finally have a reproduced version of our issue here: https://github.com/loghen41/docker-selenium-grid

It appears that the session is persisting even after a new session has been assigned to a new test suite. I am yet to identify the exact reason for this. I'm specifically reviewing this from a poor coding practice perspective/test suite perspective. (You'll notice that I'm forcing a particular part of the codebase to fail on purpose, which is what triggers the session management errors.) But if you have any insight from a selenium perspective, I'd be glad to hear it.

Hi @loghen41, when you say: "...even after a new session has been assigned to a new test suite", what do you mean by "assigned to a new test suite"?

@borisivan Here's the behavior that I believe I am seeing. Feel free to check out my repo to confim. We are using protractor to shard multiple features/tests into different sessions to run in parallel.

I'm noticing the order of operation looking like the following:

  1. Feature file 1 assigned session a
  2. Feature file 1 finishes test in session a and outputs test results.
  3. Session b assigned to feature file 2
  4. Session a does not terminate properly (potentially due to codebase)
  5. Feature file 2 throws an error saying that session b does not exist (but does exist on the node)
  6. Feature file 2 terminates, leaving unused session hanging on node.

I'm confirming this right now. So don't hold this as doctrine, but I am wondering more and more if the issue lies with some sort of asynchronous call in feature file 1 not terminating, leaving it's session open beyond the time frame of when the test should be "completed" and throws an error with the follow up feature file.

Hi all,

Apologies for the late reply.

After reading the whole issue again, it just looks like two different things.

What @eyalzek reports is related to the browserTimeout parameter, which was broken for a while and it was fixed in the upstream project recently.

@loghen41 shows a different issue, where sessions are lost when something fails in a given feature file. I tried to reproduce this with Java/TestNG (which is what I normally use) and I could not. I am not that good in Protactor, I would not know if that happens entirely due to the framework.

If this is still an issue, would it be possible to recreate it with plain Selenium code? Without any framework managing or reusing sessions?

Closing as issue become stale.

Was this page helpful?
0 / 5 - 0 ratings