Docker-selenium: Running 200 Chrome instances in two nodes at the same docker host fails because of X window limits.

Created on 18 Jan 2017 · 30Comments · Source: SeleniumHQ/docker-selenium

Meta -

Image(s):
node-chrome
node-firefox

Docker Version:
3.0.1-fermium, 3.0.1-dysprosium

OS:
Ubuntu Linux 16.04

Expected Behavior

On the same Docker host (Ubuntu 16.04, 16GB , 8 cores)

Start a hub
Start 2 nodes with 100 MAX SESSIONS / INSTANCES each
Start a loop requesting 200 sessions through Selenium 3.0.1 Client
Should start and run 200 browsers successfully

Actual Behavior -

Requesting the browsers will work as intended until ~64 chrome browsers, or ~46 firefox browsers are opened
After this new browsers will be returned very slowly, and existing browsers will die off as soon as newer ones are opened.

Suspected Cause

The xvfb-run command used to run the 'headless' browser instances can only support 256 max 'clients' for it's 'screen'. After starting ~64 chrome instances (every chrome seems to consume 2 'clients'), this limit is reached and no new ones can be started. For Firefox, this is ~46 instances (it seems that Firefox consumes 3 'clients').

This was verified by adding the-maxclients 512 parameter to the xvfb-run -s parameter, which doubles the amount of 'clients' per 'screen', and made us able to launch double the amount of browsers per node.

The more scalable solution is that processes started through separate xvfb-runs could use different 'screens', good for 256 clients _each_, by using the -a option. Our use case calls for ~600 browsers per host, so we wanted to use 6 nodes of 100 instances each.

However, it seems that separate chrome-node dockers on the same host share the same 'screen', meaning even with multiple separate chrome-node dockers, the maximum amount of started chrome instances is still limited to ~128 (256 screen 'clients')

Compare

When two selenium nodes are started on the same host machine with

xvfb-run -a -s "-screen 0 1024x768x24 -ac +extension RANDR" java -Dwebdriver.chrome.driver=./chromedriver -jar selenium-server-standalone-3.0.1.jar -role node -nodeConfig node<#>.json

With the following node1.json

{
  "capabilities":
  [
    {
      "browserName": "chrome",
      "maxInstances": 100,
      "seleniumProtocol": "WebDriver"
    }
  ],
  "proxy": "org.openqa.grid.selenium.proxy.DefaultRemoteProxy",
  "maxSession": 100,
  "port": 5555,
  "register": true,
  "registerCycle": 5000,
  "hub": "http://localhost:4444",
  "nodeStatusCheckTimeout": 5000,
  "nodePolling": 5000,
  "role": "node",
  "unregisterIfStillDownAfter": 60000,
  "downPollingLimit": 2,
  "debug": false,
  "servlets" : [],
  "withoutServlets": [],
  "custom": {}
}

this will not exhibit the same limitations, and can start the 200 browsers without a hitch.

What we tried

We tried fixing this by cloning this project, and change the entrypoint.sh of the NodeBase source from xvfb-run -n $SERVERNUM to xvfb-run -a, but this still exhibited the same problem.

E-hard I-node-all S-investigating

Source

Menthalion

👍2

Most helpful comment

@madhavajay I've been able to get 100 chromes to run on a simple 8 core / 16 GB Dell desktop with these selenium dockers. I started 1 hub, and 2 nodes with 50 chromes each. Don't forget each node you start also has a java process running, so running one browser per dockernode like you might be able to do with PhantomJS isn't an option.

Be sure to have no limiting factors on the host OS, since it can be resource intensive in handles and processes

echo "DefaultTasksMax=infinity" >> /etc/systemd/system.conf \
echo "DefaultLimitNOFILE=10000000" >> /etc/systemd/system.conf \
echo "UserTasksMax=infinity" >> /etc/systemd/logind.conf

echo "* soft     nproc          unlimited" >> /etc/security/limits.conf \
echo "* hard     nproc          unlimited" >> /etc/security/limits.conf \
echo "* soft     nofile         unlimited" >> /etc/security/limits.conf \
echo "* hard     nofile         unlimited" >> /etc/security/limits.conf \
echo "root soft     nofile         unlimited" >> /etc/security/limits.conf \
echo "root hard     nofile         unlimited" >> /etc/security/limits.conf

Then start the hub
docker run -d -p 4444:4444 --name selenium-hub -e GRID_MAX_SESSION=200 selenium/hub:3.0.1-germanium
And two nodes
docker run -d --link selenium-hub:hub -v /dev/shm:/dev/shm -e NODE_MAX_SESSION=50 -e NODE_MAX_INSTANCES=50 selenium/node-chrome:3.0.1-germanium

Menthalion on 21 Feb 2017

👍4 🎉2

All 30 comments

thank you VERY much for your thoroughly detailed report - i'll be investigating this.

ddavison on 18 Jan 2017

👍1

I see that you are using fermium. fermium was released late last night (eastern time.) was this an issue in earlier versions as well?

ddavison on 18 Jan 2017

Yes, this happened in 3.0.1-dysprosium as well. I should have noted that, sorry. We started exploring the possibilities of migrating our test platform from Selenium / PhantomJS to Selenium / Chrome in earnest only yesterday, and ran into this issue right away. Diagnosing the cause took a bit more time though.

I cloned the project to see if I could fix this in a local build today, which is when I noted that the version had changed.

As an aside, the two java nodes in the comparison were started as two separate user processes. Perhaps I should try and see if that still works when started from the same script.

Menthalion on 18 Jan 2017

@Menthalion very interesting issue. Out of curiosity, have you tried Chrome headless ? Is still in the DEV channel but wondering how does it handle your 200+ parallel tests.

google-chrome-unstable --headless --disable-gpu --remote-debugging-port=9222 https://testmysite.thinkwithgoogle.com

elgalu on 19 Jan 2017

I'm getting an error trying to set chrome startup arguments through selenium capabilities on that version.

It's officially not supported by the chromedriver, so that might be it. The same error came up on the Internet with mismatched driver / chrome versions.

So I can't really test this up to scale with my current code. Seeing it doesn't seem to use screens, my guess would be it would run fine.

Capabilities [{message=unknown error: cannot parse capability: chromeOptions
from unknown error: must be a dictionary
  (Driver info: chromedriver=2.27.440175 (9bc1d90b8bfa4dd181fbbf769a5eb5e575574320),platform=Linux 4.4.0-36-generic x86_64), platform=ANY}]
Session ID: d4dc3fc77c9d256ee5b9b6a57a897669
Build info: version: '3.0.1', revision: '1969d75', time: '2016-10-18 09:49:13 -0700'
System info: host: 'PC1610', ip: '10.0.111.31', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '1.8.0_111'
Driver info: driver.version: EventFiringWebDriver  sun.reflect.NativeConstructorAccessorImpl.newInstance0 (NativeConstructorAccessorImpl.java:-2)

Menthalion on 19 Jan 2017

@elgalu @ddavison We managed to get Chrome headless to run at the bare metal Ubuntu 16.04 machine (24 cores, 140GB) , but hit another limit scaling up to 600 browsers (6 nodes of 100 browsers on the same machine), failing our tests at the ~235 browser mark.

The error message of the Selenium nodes' Java process was Unable to create new native Thread, followed by bash error bash fork: retry: Resource temporarily unavailable.

The amount of processes from ps -elfT at the time was around 12000 / 12K.

We managed to solve this by changing the following Linux setttings:

Check the sysctl file max setting with cat /proc/sys/fs/file-max and make sure it's at least 65535
Since every Chrome (headless) process creates two temp files, the ulimit -nu soft and hard limits needed changing to unlimited by editing etc/security/limits.conf and adding

* soft     nproc          unlimited
* hard     nproc          unlimited   
* soft     nofile         unlimited   
* hard     nofile         unlimited

Change the /etc/systemd/logind.conf [Login] section and change the setting UserTasksMax=262140 from UserTasksMax=12288
Change the /etc/systemd/system.conf [Manager] section and change the settings DefaultTasksMax=262140 and DefaultLimitNOFILE=262140. DefaultLimitNOFILE bleeds through to ulimit -n, with a default of 1024 even if set to unlimited in limits.conf.

After these changes we could start up and use 600 headless Chromes, which took up around 30K processes, as well as ~5GB in ~1200 Chromium temp files.

The same limits apply (even worse) to normal Chrome processes that are run through xvfb-run java -jar, since every Chrome process is wrapped in a xvfb process, near doubling the amount of processes compared to Chrome headless, as well as opening X Window temp files next to the Chrome temp files. So this might be an addition to the limits xvfb-run introduces screen limit wise.

With the same settings, we never got above ~450 xvfb wrapped normal Chrome instances over 6 nodes of 100 browser each, with Chromes failing to start at the ~35K process mark.

Menthalion on 24 Jan 2017

👍4 🎉2

Great detailed report @Menthalion thank you so much.

Have you tried this scenario with Zalenium ? @diemol and I have created this auto vertical scaling docker-selenium solution, you can quickly try it out with the one-liner installer:

curl -sSL https://raw.githubusercontent.com/dosel/t/i/p | bash -s 3 start

My guess is it won't handle that many tests in a 16GB 8 cores machine but if you have a moment to try it I'm a bit curious also.

elgalu on 24 Jan 2017

@elgalu Well, the 16GB machine I did the earlier tests on was my local machine, the 600 test was done on one of the two nodes of our main end-to-end test infrastructure, which have 24 cores / 140GB each.

I'll see if we can give Zalenium a try there, I can't promise anything since we already spent a large part of the allotted project time on getting to this level. I'd need to bribe a scrum master here, or come up with some really convincing arguments why we'd need to investigate yet another avenue.

Menthalion on 24 Jan 2017

👍1

@elgalu does Zalenium already support Chrome headless ?

Menthalion on 24 Jan 2017

No. And I think it won't as that would kill the video recording functionality, the live preview feature, etc..

But it may be included in docker-selenium for the non-debug images in the future I guess.

elgalu on 24 Jan 2017

@elgalu I don't think it will have much chance running succesfully then, since we're already running into resource problems on bare metal ?

Menthalion on 24 Jan 2017

As much as I want to promote Zalenium (since we developed it with @elgalu), I think it is not the right tool for your scenario since we run each test in a new container created on demand.
I guess this would slow down your current execution times.

Nevertheless, perhaps you can try it in a different scenario and give us some feedback if you want :)

diemol on 24 Jan 2017

Thanks @diemol , this use case is indeed different: We're using Selenium for a load test of our product, which is why we need so many browsers. I'll keep an eye out for Zalenium for the tests we launch from our build stack, it seems promising there.

@elgalu Chrome headless did scale for a simple test of starting the browsers and pointing them to a url, but it seems to be too immature yet for testing our SPA (a lot of the Selenium functions we called failed) which is a shame.

Menthalion on 25 Jan 2017

👍1

@ddavison Although my initial findings of the scaling of vfdb-run nodes in docker vs vfdb-run on bare metal still stand, my conclusion why seems to have been wrong.

After doing some more tests I found out that a xvfb-run inside a docker doesn't share resources with the host. However, an n amount of chromes on a xvfb-run node inside the docker seems to consume twice as much xvfb resources as a similar xvfb-run node on bare metal. This can be checked with lsof -U | grep Xvfb

So the docker limit is not the amount of chromes in total over all nodes=128. but the amount of chromes inside a node is limited to 64, since every chrome ran in a docker seems to consume 4 local resources, vs 2 for every chrome on bare metal.

I verified this by running 4 nodes of 50 browsers each, and starting 200 browsers, which worked like a charm. This might be increased by adding maxclients=512 to the xvfb-run --server-args. That would make each node scale to 128 chromes.

Another option to increase it would be to identify why more processes are spawned per chrome run on the docker.

In any of these cases (not only for chrome which consumes 2 resources in the docker but 4 in bare metal, but also for firefox where there are 3 resources needed on bare metal already)), a warning might be in order about these limits for the nodes.

This because the browsers will just silently stop spawning, there is no warning or timeout when this happens.

Menthalion on 26 Jan 2017

👍3 🎉1

@Menthalion I am trying to do the exact same thing as you, load testing with real browsers. We have a setup thats using Nightwatch, TestArmada, BrowserMobProxy, HarStorage and then Selenium Grid with PhantomJS in Docker containers connected to the Grid, plus lots of custom JS code to drive the tests and generate the reports.

I gave up on trying to get Selenium Docker Chrome to run at scale because it seems to resource heavy. I couldn't run 100 on 16 Cores, but for PhantomJS I can.

I am also considering some Ghetto Cloud to get to 200 by running about 30-40 PhantomJS nodes on a bunch of the Dev iMac 5ks.

Id be interested to chat / compare notes with what you have achieved seeing as there are very little resources I can find out there on this specific topic which is to run lots of real browsers parallel for Load Testing.

madhavajay on 2 Feb 2017

@ddavison @elgalu After getting bare metal scalability for Selenium / virtual framebuffer Chrome out of the way, I've been trying to get past the ~64 chrome instances per selenium node docker by changing local builds, but to no avail.

I made sure none of the kernel / systemd limiting factors are in place, replaced Xvfb with an xpra / xdummy combination, but still can't get a handle on why the Xwin limit / consumption within the docker setup is twice as high as on two different bare metal systems

Menthalion on 21 Feb 2017

Be sure to have no limiting factors on the host OS, since it can be resource intensive in handles and processes

echo "DefaultTasksMax=infinity" >> /etc/systemd/system.conf \
echo "DefaultLimitNOFILE=10000000" >> /etc/systemd/system.conf \
echo "UserTasksMax=infinity" >> /etc/systemd/logind.conf

echo "* soft     nproc          unlimited" >> /etc/security/limits.conf \
echo "* hard     nproc          unlimited" >> /etc/security/limits.conf \
echo "* soft     nofile         unlimited" >> /etc/security/limits.conf \
echo "* hard     nofile         unlimited" >> /etc/security/limits.conf \
echo "root soft     nofile         unlimited" >> /etc/security/limits.conf \
echo "root hard     nofile         unlimited" >> /etc/security/limits.conf

Menthalion on 21 Feb 2017

👍4 🎉2

Be sure to have no limiting factors on the host OS, since it can be resource intensive in handles and processes

echo "DefaultTasksMax=infinity" >> /etc/systemd/system.conf \
echo "DefaultLimitNOFILE=10000000" >> /etc/systemd/system.conf \
echo "UserTasksMax=infinity" >> /etc/systemd/logind.conf

echo "* soft nproc unlimited" >> /etc/security/limits.conf \
echo "* hard nproc unlimited" >> /etc/security/limits.conf \
echo "* soft nofile unlimited" >> /etc/security/limits.conf \
echo "* hard nofile unlimited" >> /etc/security/limits.conf \
echo "root soft nofile unlimited" >> /etc/security/limits.conf \
echo "root hard nofile unlimited" >> /etc/security/limits.conf
Then start the hub
docker run -d -p 4444:4444 --name selenium-hub -e GRID_MAX_SESSION=200 selenium/hub:3.0.1-germanium
And two nodes
docker run -d --link selenium-hub:hub -v /dev/shm:/dev/shm -e NODE_MAX_SESSION=50 -e NODE_MAX_INSTANCES=50 selenium/node-chrome:3.0.1-germanium"

why do we need two node -chrome instances?

alkansa on 23 Mar 2018

Hi,
Apologize if this is not may be the correct place to ask, but I came across this issue when searching for xvfb maxclients limit.
I'm running my tests on 24 core 128GB machine.
I can easily open hundreds of browsers(Firefox).
What advantage doest Docker provide over just opening those browsers on the machine?
Thanks
Vadim

traffisco on 9 Aug 2018

@traffisco

What advantage doest Docker provide over just opening those browsers on the machine?

Portability
Small
Cross-platform
Idempotency
Scalable

ddavison on 9 Aug 2018

Hi,
I'm selecting components for the new server for automated tests.
In this issue there were very important informations for me about hardwares and number of browsers running in parallel.
I want to prepare for the purchase of the new server, so every examples of how many browsers and on which hardware (number of cores/threads, quantity of ram) are welcomed and I will be very grateful for it.

JarominP on 7 Jan 2019

@ddavison agree it's very convenient running in docker, but... scalable and small? Adding extra code to execute does not, as a rule, improve performance or reduce file size.

DanielHeath on 3 Apr 2019

@DanielHeath
but isn't that a strawman?
"scalable" in this context is about the ability to deploy and manage large-scale selenium grids... not selenium runtime performance or docker container overhead.

cgoldberg on 4 Apr 2019

Ahh, that's a more sensible interpretation, thank you. Yes, being able to use fleet management tooling is definitely a scalability advantage.

DanielHeath on 4 Apr 2019

Hi,
I'm selecting components for the new server for automated tests.
In this issue there were very important informations for me about hardwares and number of browsers running in parallel.
I want to prepare for the purchase of the new server, so every examples of how many browsers and on which hardware (number of cores/threads, quantity of ram) are welcomed and I will be very grateful for it.

Sorry that I only saw your question now @JarominP . We're currently running around 1000 headless chrome instances bare metal on a 5 year old 24 core 148 GB memory linux server (bear in mind this is for testing a very resource intensive SPA). The OP issue here should not be a limiting factor in docker anymore either since the framebuffer isn't needed when running chrome headless in a docker.

Before chrome headless was a thing we had to give up running it in a docker because of these limitations. Afterwards we had not enough incentive to switch back.

Menthalion on 4 Apr 2019

What worked for me is running hundreds of headless browsers with multiple tabs with an extension that manages the session per tab(cookies, proxy, navigation). Spawning a new tab in an existsing browser is less memory intensive

traffisco on 4 Apr 2019

Sorry that I only saw your question now @JarominP . We're currently running around 1000 headless chrome instances bare metal on a 5 year old 24 core 148 GB memory linux server (bear in mind this is for testing a very resource intensive SPA). The OP issue here should not be a limiting factor in docker anymore either since the framebuffer isn't needed when running chrome headless in a docker.

hi Menthalion,
can you tell me how you managed? At the moment I try run nearly the same amount of chrome instances, but after ca. 600 the process stops and closes all browsers.
I use selenium hub and chrome nodes as a docker image, configured with docker-comopse like this.

`version : '3'
services:
   selenium-hub:
      image: selenium/hub
      container_name: local-selenium-hub
      ports:
        - 4444:4444
      environment:
        - GRID_BROWSER_TIMEOUT=120000
        - GRID_TIMEOUT=120000
        - GRID_CLEAN_UP_CYCLE=60000
        - GRID_MAX_SESSION=50
        - GRID_MAX_INSTANCES=3
        - JVM_OPTS=-Xmx10g
        - GRID_JETTY_MAX_THREADS=1500
      shm_size: 10g

   chrome:
      image: selenium/node-chrome:3.14.0-krypton 
      environment:
        - HUB_PORT_4444_TCP_ADDR=local-selenium-hub
        - HUB_PORT_4444_TCP_PORT=4444
        - NODE_BROWSER_NAME=chrome
        - NODE_MAX_INSTANCES=50
        - NODE_MAX_SESSION=50
        - NODE_SELENIUM_PROTOCOL=WebDriver
        - JVM_OPTS=-Xmx24g
      ports:
        - 5900
      depends_on:
        - selenium-hub
      volumes:
        - /dev/shm:/dev/shm --privileged
      shm_size: 10g`

And then:

  docker-compose up -d
  docker-compose scale chrome=25

cseeegraef on 27 Jun 2019

I have around 10 Oracle Linux 7 servers with 8 core and 16 GB ram. What is the most optimum solution so that my scripts run at a good speed.
As of now am using 1 hub and 15 nodes with one instance on each server. Am I doing this right or is there a better way to do it. Am using shm-size of 2 GB for each node.

avinash10993 on 18 Jul 2019

Hi all, I just went again and read all the comments, and to be honest, these docker images are not thought to be running a large number of browsers inside them, we actually recommend to just run one browser instance per container.

I understand different people have different approaches, but we believe containers should not be used in this way.

Having said that, I will close this issue since different approaches and comments have been done, but it is clear that there is no single formula that will work for all (since each browser will use different resources based on the website that it is loading). Hopefully, all these comments serve as a knowledge base for others who try to achieve the same.

diemol on 26 Mar 2020

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.