Weblate: Intermittent E019 during normal usage

Created on 3 Dec 2018  路  12Comments  路  Source: WeblateOrg/weblate

Describe the bug
Every now and then the red exclamation mark lights up in the header, and forwards me to a Performance Report that indicates weblate.E019 | The Celery does not process tasks or is too slow in processing them.

After checking various things with Celery (such as celery -A weblate status or celery -A weblate inspect active etc.) and Flower, tasks are being run normally; but usually about 3 tasks run at the same time (that is: social-auth-cleanup, configuration-health-check and commit-pending every 3600 seconds in a default configuration).
Tasks run just fine, Celery logs show no errors.

To Reproduce
Steps to reproduce the behavior:

  1. Configure Celery with default settings: https://github.com/WeblateOrg/weblate/blob/8ea68d529241181da7037bfa4d9c9fe9e25eb097/examples/celery-weblate.conf#L1-L5
  2. Let Weblate sit there and run for a few hours
  3. Check back with Performance Report every now and then
  4. Optionally dismiss the message and wait another few hours

Expected behavior
No errors.

Server configuration and status

$ ./manage.py list_versions
 * Weblate weblate-3.3
 * Python 3.5.3
 * Django 2.1.3
 * Celery 4.2.1
 * celery-batches 0.2
 * six 1.11.0
 * social-auth-core 2.0.0
 * social-auth-app-django 3.1.0
 * django-appconf 1.0.2
 * translate-toolkit 2.3.1
 * Whoosh 2.7.4
 * defusedxml 0.5.0
 * Git 2.11.0
 * Pillow 5.3.0
 * python-dateutil 2.7.5
 * lxml 4.2.5
 * django-crispy-forms 1.7.2
 * django_compressor 2.2
 * djangorestframework 3.9.0
 * user-agents 1.1.0
 * jellyfish 0.6.1
 * pytz 2018.7
 * pyuca 1.2
 * python-bidi 0.4.0
 * PyYAML 3.12
 * Database backends: django.db.backends.mysql
 * Cache backends: default:RedisCache
 * Celery: redis://localhost:6379, N/A, regular
 * Platform: Linux 4.9.0-6-amd64 (x86_64)

$ ./manage.py check --deploy
System check identified some issues:

WARNINGS:
?: (security.W004) You have not set a value for the SECURE_HSTS_SECONDS setting. If your entire site is served only over SSL, you may want to consider setting a value and enabling HTTP Strict Transport Security. Be sure to read the documentation first; enabling HSTS carelessly can cause serious, irreversible problems.

System check identified 1 issue (0 silenced).

Additional context
With default configuration, Celery only runs with one node w1 (unless the commented out suggestion of using 3 nodes w1 w2 w3 is used in /etc/default/celery-weblate) and will likely always fail this test since chances are that the health check running the ping will not immediately execute and fail check.

I _guess_ using more nodes (at least 3 if billing is not used, then probably at least 4) could fix this, but I don't think I should be throwing more hardware/software/processes at the problem...

And permanently ignoring the report isn't something we should want here either, since it _should_ notify me when Celery really isn't responding too well. Unless theres a different error if the nodes are down and completely unavailable.

documentation

Most helpful comment

I've updated the example in d7a5e6cd02 to better cover this.

All 12 comments

It can occassionally appear on the initial import which pushes quite a lot of fulltext updates, but it really should not be there for normal usage. On Hosted Weblate there is just one node and it works fine except when importing many translations.

I see it pop up every few hours...so I'm not quite sure what to do about it.
Switched to 2 nodes before creating this issue, had it once by then, and not ever since...

According to the Celery logs from that time, it seems to occur when commits/pushes happen (which are on the same network and generally fast), but not for no-op checks or fetches (with no local changes that need to be committed).

How many (v)CPUs does the machine running Weblate has? AFAIK Celery starts process per CPU.

It's set to 1 vCPU atm, because Weblate used to be fairly light on resources (compared to other machines running on the same host). We generally try to allocate as few resources as possible for those machines, since the hosts are a few years old and moderately filled with VMs.
With 2 workers, I got 5 Celery processes running right now (although service celery-weblate status says Tasks: 6 (limit: 36864)), and the work seems to be distributed normally.

Whats kinda odd is that w1 received the health-check, but there is nothing in /var/log/celery/weblate-w1-1.log for that time (at least nothing at INFO or above).
/var/log/celery/weblate-w1.log has this:

[2018-12-03 15:56:38,129: INFO/Beat] Scheduler: Sending due task configuration-health-check (weblate.wladmin.tasks.configuration_health_check)
[2018-12-03 15:56:38,132: INFO/Beat] Scheduler: Sending due task commit-pending (weblate.trans.tasks.commit_pending)
[2018-12-03 15:56:38,135: INFO/MainProcess] Received task: weblate.wladmin.tasks.configuration_health_check[63ca0a38-3a35-4837-a845-59864bf60697]
[2018-12-03 15:56:38,151: INFO/Beat] Scheduler: Sending due task social-auth-cleanup (weblate.accounts.tasks.cleanup_social_auth)
[2018-12-03 15:56:38,316: INFO/MainProcess] Received task: weblate.utils.tasks.ping[4e9f492e-cfec-40db-8e1c-16c3de3a514e]

So the ping is handled by the same worker apparently (ie. the one that is currently waiting for the ping to come back?)

Also, while poking around settings I noticed that https://github.com/WeblateOrg/weblate/blob/8ea68d529241181da7037bfa4d9c9fe9e25eb097/examples/celery-weblate.conf#L27 isn't used celery-weblate.service (so setting it to a higher number didn't do anything; assuming it _(c|s)hould_ affect the number of workers per node)

Well, since I changed the number of works to two, I had none of those happen. Might be coincidence, dunno.

I've reduced the number of nodes back down to 1, but added --concurrency=2 to CELERYD_OPTS to see if it makes a difference.

Upping the CPU count is the last thing I'd like to do if I can avoid it.

The CELERY_WORKER_RUNNING is just a boolean flag, to turn on specific error handling code in Celery daemon which should not be enabled when running as uwsgi:

https://github.com/WeblateOrg/weblate/blob/1fd73c0dee376993c657840bb345402b1da2a678/weblate/celery.py#L58-L88

As for the ping task, one of the worker processes should log it's completion if logging set to INFO or higher:

[2018-12-04 08:14:47,171: INFO/ForkPoolWorker-2] Task weblate.utils.tasks.ping[0f54b7aa-2d85-4e2f-8b11-3e6c16947105] succeeded in 0.00478047499928s: None

Instead of adding more workers, it's better to just have two worker processes in this case by specifying --concurrency 2.

I've updated the example in d7a5e6cd02 to better cover this.

I don't have a single log entry containing the word "succeeded"; not for ping and not for others. Should this be concerning?

I'm not really sure - the worker logging sometimes doesn't behave as I'd expect (seen that in https://github.com/WeblateOrg/weblate/issues/2423), so it's most likely a logging issues rather than not processing tasks. Anyway if you don't get error about Sentry from ./manage.py check --deploy then tasks are processed without any issues.

Yeah, thats what I've been seeing all the time - logs are rather quiet, but tasks work as expected.

I'll just see how the --concurrency=2 change fares and report back if it helps (or not).

Okay, I'll close the issue for now as that has been covered by improved docs and examples.

  • In case you see problem with the fix, please comment on this issue.
  • In case you see similar problem, please open separate issue.
  • If you are happy with the outcome, consider supporting Weblate by donating.

So far it seems to have quenched the intermittent issues, so, thanks! 馃憤

Was this page helpful?
0 / 5 - 0 ratings