Build succeeds, docs are published
Build succeeds, docs are not published. Or an error somewhere is not reported. The build status is
Error
There was a problem with Read the Docs while building your documentation. Please try again later.
but no errors are reported in the logs.
The build log ends with:
The HTML pages are in _build/html.
Updating searchtools for Read the Docs search... /home/docs/checkouts/readthedocs.org/user_builds/jupyterlab/checkouts/latest/docs/source/conf.py:281: RemovedInSphinx40Warning: The app.add_stylesheet() is deprecated. Please use app.add_css_file() instead.
app.add_stylesheet('custom.css') # may also be an URL
Command time: 739s Return: 0
This started failing after an update to how the project builds docs:
Use add_css_file instead. https://github.com/jupyterlab/jupyterlab/blob/master/docs/source/conf.py
But that's just a deprecation, not sure why it's failing on it.
It seems it was a temporal error. It failed when trying to cleanup the docker container used to build the documentation.
I just trigger a new build or latest: https://readthedocs.org/projects/jupyterlab/builds/12146392/ (this may fail because it's using a smaller builder)
Also, it could be that you are hitting the max time allowed to build (your build reports 1311 seconds). So, increasing this time may help here as well. Let's see.
@minrk I think I know what happened. Where you using pip to create the virtualenv and now you are using conda? I guess it was the case. I'm moving your project a bigger builder so it doesn't time out or fail because of memory usage.
I triggered another build https://readthedocs.org/projects/jupyterlab/builds/12146492/ (this should success because it's running in a bigger server).
Please, let me know if the following builds are building OK.
Hrm... The new build on the bigger server didn't succeed either. I will need to take a deeper look at this because I'm not sure what's happening here and the logs does not tell me too much about this.
OK. I triggered a new build for latest and jumped into the server that started the build. I think found something there. CPU consumption is very high while node is running, however, there is not too much memory usage at that point.
Besides, it seems that it's not one specific process being killed, but the whole VM instance. My SSH session was closed, I don't have access to the same VM anymore and it does not appears under Azure instances anymore. So, it seems it's being killed by Azure for some reason? :man_shrugging:
I looked the VM instance to not be reachable by autoscale rules, just in case, and it wasn't killed by Azure. While doing the build, I checked the memory used by node and I found that it was about ~40Mb and 100% CPU during ~650 seconds (each node step --there are 3).
Then, logs show that the Docker container user for building can't be removed (which our app handles), then ~5 minutes of a celery process at 100% CPU (which I don't understand what's doing), and finally we got a disconnection from Redis that makes the build to be marked as failed with that generic error:
Oct 26 13:56:46 buildlarge000B1E supervisord: build celery.worker.consumer.consumer:338[1617]: WARNING consumer: Connection to broker lost. Trying to re-establish the connection...
Oct 26 13:56:46 buildlarge000B1E supervisord: build Traceback (most recent call last):
Oct 26 13:56:46 buildlarge000B1E supervisord: build File "/home/docs/lib/python3.6/site-packages/redis/connection.py", line 177, in _read_from_socket
Oct 26 13:56:46 buildlarge000B1E supervisord: build raise socket.error(SERVER_CLOSED_CONNECTION_ERROR)
Oct 26 13:56:46 buildlarge000B1E supervisord: build OSError: Connection closed by server.
I was able to reproduce this issue in my local RTD development instance. I do see different calls to the API that fails for this project:
/api/v2/version/<pk>//api/v2/build/<pk>//api/v2/version/<pk>/ (again)They retry over and over again and finally they failed with Max retries exceeded and the build is marked as failed with unhandled exception.
These URL are hit after the sphinx-build command has finished and the artifacts where uploaded to the Blob storage (Azurite).
This issue is not present when building any other project locally.
OK, thanks to @stsewd that realized this. I think we found the issue and we have a workaround for your case.
Your git repository has ~15k tags and ~30 branches. Read the Docs keeps a sync of those tags/branches and create a Version object for each of them. When RTD tries to sync all of those tags using the API, it results in a timeout (we have a PR that will make this call async for this reason: #7548) and the build hitting the API to update them fails.
I checked your documentation and you have 2.3.x and 1.2.x versions enabled. Both are built from branches. So, I disabled the "sync of tags" for your repository. Let's see if that helps here.
Take into account that new tags won't appear in Read the Docs as versions now and you won't be able to build them for now (at least until we deploy the async PR). If you find yourselves needing these tags, please ping us back and we will try re-enabling the sync of tags with the async PR merged.
All of this does not explain why sometimes it failed with OOM, though.
It looks like the "sync of tags" workaround didn't fully solve the problem, we're still seeing failing builds that aren't due to OOM, e.g. https://readthedocs.org/projects/jupyterlab/builds/12211330/
Yeah... There is something weird happening with Docker, but I'm not sure what it is yet. Our code was already catching some exceptions for this case (remove_container when the build was finished) --which wasn't super important and the build could continue, but now we are getting ReadTimeout and failed as un-catched exception (I opened a PR to add this exception as well). However, I'm seeing similar issues in other projects when calling create_container which we can't simply catch and ignore.
Let's see if that PR that I opened helps for now as a workaround, but the root cause of the problem is still unknown to me. I guess that after an OOM issue in the servers the Docker daemon gets dumb --maybe the OS is killing it? We will need to debug this more deeply, I'm sure.
We just deployed #7618 and I triggered a build for jupyterlab. It passed. You can see the output at https://readthedocs.org/projects/jupyterlab/builds/12307459/. I'd be interested about how things keep going on with your project.
I think we're good, thanks! The last four PR builds also passed.
Most helpful comment
I think we're good, thanks! The last four PR builds also passed.