Readthedocs.org: Builds fail trying to `create_container` and/or `remove_container`

Created on 20 Oct 2020 · 12Comments · Source: readthedocs/readthedocs.org

Details:

Project URL: https://readthedocs.org/projects/jupyterlab/
Build URL(if applicable): https://readthedocs.org/projects/jupyterlab/builds/12139704/
Read the Docs username(if applicable): minrk

Expected Result

Build succeeds, docs are published

Actual Result

Build succeeds, docs are not published. Or an error somewhere is not reported. The build status is

Error
There was a problem with Read the Docs while building your documentation. Please try again later.

but no errors are reported in the logs.

The build log ends with:

The HTML pages are in _build/html.
Updating searchtools for Read the Docs search... /home/docs/checkouts/readthedocs.org/user_builds/jupyterlab/checkouts/latest/docs/source/conf.py:281: RemovedInSphinx40Warning: The app.add_stylesheet() is deprecated. Please use app.add_css_file() instead.
  app.add_stylesheet('custom.css')  # may also be an URL
Command time: 739s Return: 0

This started failing after an update to how the project builds docs:

typedoc-built HTML files are built and staged to the output directory (adds an expensive npm build that could create memory pressure, but succeeds before sphinx builds continue)
conda is used for installation, which included bumping the nodejs runtime to 14.x on $PATH and sphinx from 1.8 to 3.2.1 from the last successful build (12139664)

Accepted Bug

Source

minrk

Most helpful comment

I think we're good, thanks! The last four PR builds also passed.

blink1073 on 10 Nov 2020

🎉1 👍1

All 12 comments

Use add_css_file instead. https://github.com/jupyterlab/jupyterlab/blob/master/docs/source/conf.py

Daltz333 on 20 Oct 2020

But that's just a deprecation, not sure why it's failing on it.

Daltz333 on 20 Oct 2020

It seems it was a temporal error. It failed when trying to cleanup the docker container used to build the documentation.

I just trigger a new build or latest: https://readthedocs.org/projects/jupyterlab/builds/12146392/ (this may fail because it's using a smaller builder)

Also, it could be that you are hitting the max time allowed to build (your build reports 1311 seconds). So, increasing this time may help here as well. Let's see.

@minrk I think I know what happened. Where you using pip to create the virtualenv and now you are using conda? I guess it was the case. I'm moving your project a bigger builder so it doesn't time out or fail because of memory usage.

I triggered another build https://readthedocs.org/projects/jupyterlab/builds/12146492/ (this should success because it's running in a bigger server).

Please, let me know if the following builds are building OK.

humitos on 20 Oct 2020

Hrm... The new build on the bigger server didn't succeed either. I will need to take a deeper look at this because I'm not sure what's happening here and the logs does not tell me too much about this.

humitos on 21 Oct 2020

OK. I triggered a new build for latest and jumped into the server that started the build. I think found something there. CPU consumption is very high while node is running, however, there is not too much memory usage at that point.

Besides, it seems that it's not one specific process being killed, but the whole VM instance. My SSH session was closed, I don't have access to the same VM anymore and it does not appears under Azure instances anymore. So, it seems it's being killed by Azure for some reason? :man_shrugging:

humitos on 26 Oct 2020

I looked the VM instance to not be reachable by autoscale rules, just in case, and it wasn't killed by Azure. While doing the build, I checked the memory used by node and I found that it was about ~40Mb and 100% CPU during ~650 seconds (each node step --there are 3).

Then, logs show that the Docker container user for building can't be removed (which our app handles), then ~5 minutes of a celery process at 100% CPU (which I don't understand what's doing), and finally we got a disconnection from Redis that makes the build to be marked as failed with that generic error:

Oct 26 13:56:46 buildlarge000B1E supervisord: build celery.worker.consumer.consumer:338[1617]: WARNING consumer: Connection to broker lost. Trying to re-establish the connection...
Oct 26 13:56:46 buildlarge000B1E supervisord: build Traceback (most recent call last):
Oct 26 13:56:46 buildlarge000B1E supervisord: build   File "/home/docs/lib/python3.6/site-packages/redis/connection.py", line 177, in _read_from_socket
Oct 26 13:56:46 buildlarge000B1E supervisord: build     raise socket.error(SERVER_CLOSED_CONNECTION_ERROR)
Oct 26 13:56:46 buildlarge000B1E supervisord: build OSError: Connection closed by server.

humitos on 26 Oct 2020

I was able to reproduce this issue in my local RTD development instance. I do see different calls to the API that fails for this project:

/api/v2/version/<pk>/
/api/v2/build/<pk>/
/api/v2/version/<pk>/ (again)

They retry over and over again and finally they failed with Max retries exceeded and the build is marked as failed with unhandled exception.

These URL are hit after the sphinx-build command has finished and the artifacts where uploaded to the Blob storage (Azurite).

This issue is not present when building any other project locally.

humitos on 26 Oct 2020

OK, thanks to @stsewd that realized this. I think we found the issue and we have a workaround for your case.

Your git repository has ~15k tags and ~30 branches. Read the Docs keeps a sync of those tags/branches and create a Version object for each of them. When RTD tries to sync all of those tags using the API, it results in a timeout (we have a PR that will make this call async for this reason: #7548) and the build hitting the API to update them fails.

I checked your documentation and you have 2.3.x and 1.2.x versions enabled. Both are built from branches. So, I disabled the "sync of tags" for your repository. Let's see if that helps here.

Take into account that new tags won't appear in Read the Docs as versions now and you won't be able to build them for now (at least until we deploy the async PR). If you find yourselves needing these tags, please ping us back and we will try re-enabling the sync of tags with the async PR merged.

All of this does not explain why sometimes it failed with OOM, though.

humitos on 26 Oct 2020

👍1

It looks like the "sync of tags" workaround didn't fully solve the problem, we're still seeing failing builds that aren't due to OOM, e.g. https://readthedocs.org/projects/jupyterlab/builds/12211330/

blink1073 on 28 Oct 2020

Yeah... There is something weird happening with Docker, but I'm not sure what it is yet. Our code was already catching some exceptions for this case (remove_container when the build was finished) --which wasn't super important and the build could continue, but now we are getting ReadTimeout and failed as un-catched exception (I opened a PR to add this exception as well). However, I'm seeing similar issues in other projects when calling create_container which we can't simply catch and ignore.

Let's see if that PR that I opened helps for now as a workaround, but the root cause of the problem is still unknown to me. I guess that after an OOM issue in the servers the Docker daemon gets dumb --maybe the OS is killing it? We will need to debug this more deeply, I'm sure.

humitos on 29 Oct 2020

We just deployed #7618 and I triggered a build for jupyterlab. It passed. You can see the output at https://readthedocs.org/projects/jupyterlab/builds/12307459/. I'd be interested about how things keep going on with your project.

humitos on 10 Nov 2020

I think we're good, thanks! The last four PR builds also passed.

blink1073 on 10 Nov 2020

🎉1 👍1

Was this page helpful?

0 / 5 - 0 ratings