Weblate: Importing large translation files

Created on 26 Oct 2018  路  35Comments  路  Source: WeblateOrg/weblate

Describe the bug
I have a large collection of translations, currently 22 files (each file representing one language) in .PO format. Each language has around 9000 - 64000 keys, where only one language (set as master language) has all translated keys. When I start the import, it will fail with 504 Gateway Time-out error page after a long time (more then half an hour).

To Reproduce
Steps to reproduce the behavior:

  1. Create project and component
  2. Set source code repository & repository push url to accessible git repo
  3. Set File mask to "locale/*.po"
  4. Set Monolingual base language file to "locale/en.po"
  5. Set Base file for new translations to "locale/en.po"
  6. Leave all other setting as default.
  7. Wait a long time for error to happen :)

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Server configuration and status

* Weblate 3.2.2
 * Python 3.5.3
 * Django 2.1.2
 * Celery 4.2.1
 * celery-batches 0.2
 * six 1.10.0
 * social-auth-core 1.7.0
 * social-auth-app-django 2.1.0
 * django-appconf 1.0.2
 * translate-toolkit 2.3.1
 * Whoosh 2.7.4
 * defusedxml 0.5.0
 * Git 2.11.0
 * Pillow 4.0.0
 * python-dateutil 2.5.3
 * lxml 3.7.1
 * django-crispy-forms 1.7.2
 * django_compressor 2.2
 * djangorestframework 3.9.0
 * user-agents 1.1.0
 * jellyfish 0.6.1
 * pytz 2018.5
 * pyuca 1.2
 * PyYAML 3.12
 * tesserocr 2.3.1
 * Mercurial 4.0
 * git-svn 2.11.0
 * Database backends: django.db.backends.postgresql
 * Cache backends: avatar:FileBasedCache, default:RedisCache
 * Platform: Linux 4.15.0-36-generic (x86_64)
SystemCheckError: System check identified some issues:

CRITICALS:
?: (weblate.E003) Can not send email ([Errno -2] Name or service not known), please check EMAIL_* settings.
    HINT: https://docs.weblate.org/en/weblate-3.2.2/admin/install.html#out-mail

WARNINGS:
?: (security.W004) You have not set a value for the SECURE_HSTS_SECONDS setting. If your entire site is served only over SSL, you may want to consider setting a value and enabling HTTP Strict Transport Security. Be sure to read the documentation first; enabling HSTS carelessly can cause serious, irreversible problems.
?: (security.W008) Your SECURE_SSL_REDIRECT setting is not set to True. Unless your site should be available over both SSL and non-SSL connections, you may want to either set this setting True or configure a load balancer or reverse-proxy server to redirect all connections to HTTPS.
?: (security.W012) SESSION_COOKIE_SECURE is not set to True. Using a secure-only session cookie makes it more difficult for network traffic sniffers to hijack user sessions.
?: (security.W018) You should not have DEBUG set to True in deployment.

System check identified 5 issues (0 silenced).

Additional context
Running from docker container.

enhancement

Most helpful comment

The initial import will be always a bit slow, but there are always some ways to improve, see our docs: https://docs.weblate.org/en/latest/admin/projects.html#import-speed

Also there are some performance improvements coming for next Weblate release....

All 35 comments

May be increase client_max_body_size in nginx conf ? or something related to uwsgi.

There is one hour timeout in the nginx, that should be good enough:

https://github.com/WeblateOrg/docker/blob/master/weblate.nginx.conf#L32

Do you have additional reverse proxy in front of docker container? Eg. https-portal? That might introduce another shorter timeouts leading to this.

Hello there, I work with Ales on this project. Many thanks for the rapid answers. Unfortunately I don't know all details, but he will be in vacation the next week, so I'd like to give feedback to keep this moving.

Currently he is using just a local deployment on his laptop, so I _think_ no reverse proxy etc. is involved.

In addition to locating and increasing the timeout:

  1. How can we speed up the import process? Do you have an idea where the current bottleneck could be? Maybe the the git interface or the web client interface? Could we for instance do an import without the web client, would it help?
  2. Can we split the import into smaller chunks? Such as importing base language first, and then other languages one by one?

Sorry if that's too many questions in one pile. Thanks again.

The initial import will be always a bit slow, but there are always some ways to improve, see our docs: https://docs.weblate.org/en/latest/admin/projects.html#import-speed

Also there are some performance improvements coming for next Weblate release....

Thank you for the hints, and again for the speedy response. We'll look into them.

@nijel first of all, thanks for the hints - I did turn off debug mode and it is already faster.

It is still a bit slow, but at least I got a few languages in before timeout and can also trigger importing the rest with _Repository maintenance > Repository tools > Reset_ option. Probably I would need to play around a bit with caching, but for now I think it is OK for testing.

Now I have another question - and since it is related to this one, I did not open new ticket.

Now the interface is showing me all languages are 100% translated, even though that is not true:
screenshot from 2018-11-06 10-06-37

Note number of strings in "source" language and in another language, which is not fully translated:
screenshot from 2018-11-06 10-07-04

And here is one language, which is not fully translated:
screenshot from 2018-11-06 10-07-26

Is this related to import not being fully finished due to timeout? Since I have tried same scenario wit much smaller (test) files, and there it shows correctly status of translated languages.

Oh, another thing - I am also getting a lot of errors:

ERROR project/component/hu: duplicate string to translate: project/component - Hungarian, string 53435 ('Some translation')

Hungarian language is only example here, I am getting this also for other languages. What does this error actually means? I am 100% sure, I do not have duplicates of keys inside one language, but obviously, keys are repeating over different languages.

Is this error related actually to source language, since there I do have a lot of keys that are same as value - but again no key duplicate. From my perspective, this should not be an error, right?

I have actually found out, this is one of the reasons, my import is very slow. Am I doing something wrong here?

It should appear only if there are actually duplicate keys. Can you please post snippet of the source file showing one unit? It might be related to https://github.com/WeblateOrg/weblate/issues/1680

Sure, this is en.po file - set as source language:

#. Some comment for translator
msgid "Change password failed"
msgstr "Change password failed"

And same string in another language (eg. german):

#. Some comment for translator
msgid "Change password failed"
msgstr "Fehler bei Passwort盲nderung"

This msgid is always unique for each language, so it appears only once in translation file.

Is it possible also due to given comment?

The comment should not have any effect here. Are you sure the msgid is not duplicate? Otherwise I don't see how this error should happen (but I might be missing something).

Yes, when I search for a reported duplicate from weblate in actual .po files, there is only one occurance per language.

Now, I just got an idea - it might be because of inital problem - i got a timeout problem when importing a lanuage and this language was not imported fully - and when i restarted import, there were some leftovers from this language and it reported as duplicate.

So I will try to import translations with command line tool, not via web browser. Maybe this will fix all the issues above, since I will not hit nginx limit.

Will report :)

And thanks for the hints.

The additional import should not be an issue - it just keeps track of duplicates parsed from single file. Can you share the English file with me? You can send it to [email protected]

Thanks. As I said, I will try with command line import and we'll see what that will bring. Otherwise I will contact you via email with some data.

Will report on my progress and thanks for your help!

The duplicates will definitely not disappear by using command line :-)

@nijel OK, so the output of my tests are as follows:

  • when importing via command line, thing worked and is also a bit faster. Still it took around 2 hours to complete.
  • the problem mentioned above, where all languages shows 100% translation, remains. Do you have any suggestion what could be wrong here?
  • regarding duplicates - I will contact you over email.

@nijel so we cleared the issue with duplicates (it was a mistake on my side - actually had some duplicates).

But what about the issue about all languages showing 100% completeness mentioned in the comment above? I tried a few times (every time on a clean instance) and sometimes it did show correct, and sometimes did not. What am I doing wrong? I am using above command to import to weblate:

docker-compose exec --user weblate weblate weblate import_project --base-file-template=%s/en.po --name-template='test-component' test-project [email protected]:something/test-project.git master '**/*.po'

where all the .po files are located in ./locale/ folder.

I'd try adding --file-format po-mono, but not sure if that changes anything.

OK, will try it out. Thanks for this!

@nijel so, I've tried but with mixed results - but after importing component there are some cases where I'm getting this error:

INFO project'my-component: updating completed
Exception ignored in: <function WeakValueDictionary.__init__.<locals>.remove at 0x7f87baca6ea0>
Traceback (most recent call last):
  File "/usr/lib/python3.5/weakref.py", line 117, in remove
TypeError: 'NoneType' object is not callable
Exception ignored in: <function WeakValueDictionary.__init__.<locals>.remove at 0x7f87baca6ea0>
Traceback (most recent call last):
  File "/usr/lib/python3.5/weakref.py", line 117, in remove
TypeError: 'NoneType' object is not callable

And components with this error at the end of import log will have 100% completeness, even though it shouldn't have it as mentioned above.

Maybe this could help.

That is bug in Python, see https://bugs.python.org/issue29519. It should be harmless.

Well, then I don't know what is the cause of this behaviour.

But in any case, really appreciate your help. Thanks!

I will try to import test files you've sent me, but I still didn't get to that...

@nijel I think i have found a pattern, when this happens - if during import app throws any exception (also the one mentioned above), then percentage always shows 100%.

How I could reproduce this error:

  1. I have imported one component (with import_project command), which showed correct % of translated languages - there were no errors during import
  2. On repository, I did some work outside of weblate - did not change translations, but change some other things. I have also changed git history (with git rebase ... --interactive and then force pushing to git repo).
  3. Under this component I have triggered _Repository Maintenance > Pull_
  4. This triggered some exceptions (see below)
  5. web UI showed successful update
  6. when navigating back to list of all languages for this component, all languages shows 100%
weblate_1   | [2018-11-28 09:16:53,578: ERROR/ForkPoolWorker-8] Error: error('Error -3 while decompressing data: incorrect header check',)
weblate_1   | Traceback (most recent call last):
weblate_1   |   File "/usr/local/lib/python3.5/dist-packages/weblate/trans/search.py", line 109, in update_index
weblate_1   |     self.update_source_unit_index(writer, unit)
weblate_1   |   File "/usr/local/lib/python3.5/dist-packages/weblate/trans/search.py", line 84, in update_source_unit_index
weblate_1   |     location=force_text(unit['location']),
weblate_1   |   File "/usr/local/lib/python3.5/dist-packages/whoosh/writing.py", line 1255, in update_document
weblate_1   |     IndexWriter.update_document(self, **fields)
weblate_1   |   File "/usr/local/lib/python3.5/dist-packages/whoosh/writing.py", line 490, in update_document
weblate_1   |     self.add_document(**fields)
weblate_1   |   File "/usr/local/lib/python3.5/dist-packages/whoosh/writing.py", line 1251, in add_document
weblate_1   |     self.commit()
weblate_1   |   File "/usr/local/lib/python3.5/dist-packages/whoosh/writing.py", line 1229, in commit
weblate_1   |     self.writer.commit(**self.commitargs)
weblate_1   |   File "/usr/local/lib/python3.5/dist-packages/whoosh/writing.py", line 922, in commit
weblate_1   |     finalsegments = self._merge_segments(mergetype, optimize, merge)
weblate_1   |   File "/usr/local/lib/python3.5/dist-packages/whoosh/writing.py", line 827, in _merge_segments
weblate_1   |     return mergetype(self, self.segments)
weblate_1   |   File "/usr/local/lib/python3.5/dist-packages/whoosh/writing.py", line 101, in MERGE_SMALL
weblate_1   |     writer.add_reader(reader)
weblate_1   |   File "/usr/local/lib/python3.5/dist-packages/whoosh/writing.py", line 709, in add_reader
weblate_1   |     docmap = self.write_per_doc(fieldnames, reader)
weblate_1   |   File "/usr/local/lib/python3.5/dist-packages/whoosh/writing.py", line 678, in write_per_doc
weblate_1   |     for docnum, stored in reader.iter_docs():
weblate_1   |   File "/usr/local/lib/python3.5/dist-packages/whoosh/codec/base.py", line 419, in iter_docs
weblate_1   |     yield docnum, self.stored_fields(docnum)
weblate_1   |   File "/usr/local/lib/python3.5/dist-packages/whoosh/codec/whoosh3.py", line 495, in stored_fields
weblate_1   |     v = reader[docnum]
weblate_1   |   File "/usr/local/lib/python3.5/dist-packages/whoosh/columns.py", line 1272, in __getitem__
weblate_1   |     v = self._child[docnum]
weblate_1   |   File "/usr/local/lib/python3.5/dist-packages/whoosh/columns.py", line 869, in __getitem__
weblate_1   |     v = self._decompress(v)
weblate_1   | zlib.error: Error -3 while decompressing data: incorrect header check
weblate_1   | 
weblate_1   | During handling of the above exception, another exception occurred:
weblate_1   | 
weblate_1   | Traceback (most recent call last):
weblate_1   |   File "/usr/local/lib/python3.5/dist-packages/celery_batches/__init__.py", line 148, in apply_batches_task
weblate_1   |     result = task(*args)
weblate_1   |   File "/usr/local/lib/python3.5/dist-packages/celery/app/trace.py", line 642, in __protected_call__
weblate_1   |     return orig(self, *args, **kwargs)
weblate_1   |   File "/usr/local/lib/python3.5/dist-packages/celery/app/task.py", line 375, in __call__
weblate_1   |     return self.run(*args, **kwargs)
weblate_1   |   File "/usr/local/lib/python3.5/dist-packages/weblate/trans/search.py", line 261, in update_fulltext
weblate_1   |     fulltext.update_index(unitdata)
weblate_1   |   File "/usr/local/lib/python3.5/dist-packages/weblate/trans/search.py", line 109, in update_index
weblate_1   |     self.update_source_unit_index(writer, unit)
weblate_1   |   File "/usr/local/lib/python3.5/dist-packages/whoosh/writing.py", line 1181, in __exit__
weblate_1   |     self.close()
weblate_1   |   File "/usr/local/lib/python3.5/dist-packages/whoosh/writing.py", line 1217, in close
weblate_1   |     self.commit(restart=False)
weblate_1   |   File "/usr/local/lib/python3.5/dist-packages/whoosh/writing.py", line 1229, in commit
weblate_1   |     self.writer.commit(**self.commitargs)
weblate_1   |   File "/usr/local/lib/python3.5/dist-packages/whoosh/writing.py", line 922, in commit
weblate_1   |     finalsegments = self._merge_segments(mergetype, optimize, merge)
weblate_1   |   File "/usr/local/lib/python3.5/dist-packages/whoosh/writing.py", line 827, in _merge_segments
weblate_1   |     return mergetype(self, self.segments)
weblate_1   |   File "/usr/local/lib/python3.5/dist-packages/whoosh/writing.py", line 101, in MERGE_SMALL
weblate_1   |     writer.add_reader(reader)
weblate_1   |   File "/usr/local/lib/python3.5/dist-packages/whoosh/writing.py", line 709, in add_reader
weblate_1   |     docmap = self.write_per_doc(fieldnames, reader)
weblate_1   |   File "/usr/local/lib/python3.5/dist-packages/whoosh/writing.py", line 678, in write_per_doc
weblate_1   |     for docnum, stored in reader.iter_docs():
weblate_1   |   File "/usr/local/lib/python3.5/dist-packages/whoosh/codec/base.py", line 419, in iter_docs
weblate_1   |     yield docnum, self.stored_fields(docnum)
weblate_1   |   File "/usr/local/lib/python3.5/dist-packages/whoosh/codec/whoosh3.py", line 495, in stored_fields
weblate_1   |     v = reader[docnum]
weblate_1   |   File "/usr/local/lib/python3.5/dist-packages/whoosh/columns.py", line 1272, in __getitem__
weblate_1   |     v = self._child[docnum]
weblate_1   |   File "/usr/local/lib/python3.5/dist-packages/whoosh/columns.py", line 869, in __getitem__
weblate_1   |     v = self._decompress(v)
weblate_1   | zlib.error: Error -3 while decompressing data: incorrect header check

The error is from fulltext search, which is executed in separate Celery task, so it should not influence anything else. Did you get some other errors as well?

Anyway the index looks corrupted somehow, try rebuilding it from scratch using manage.py rebuild_index --all --clean and see if that helps.

In my tests, this happens when the loading of the translations does not complete....

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Hi @nijel

In my tests, this happens when the loading of the translations does not complete....

Thank you very much for executing the tests in your environment. We're currently busy with bringing up our server instance, hence the delay in responses. But this is actually a critical issue for us, since translators cannot translate new messages into their language when this bug hits.

I have some questions:

  1. What is the reason for loading of translations to not complete? Something like a timeout, or another kind of exception maybe?
  2. How does this behavior lead to having fewer messages in the language currently loaded than the source language (English in our case)? @alesrosina and I discussed a bit, and our hypothesis was some step at the end of loading which adds messages not present in the loaded language file but present in the source language to the loaded language. Naturally, interrupted loading would mean this step at the end not being executed. Is this hypothesis somewhere near truth? :-)
  3. Can you give us a pointer to the related code part? I suspect it's somewhere around https://github.com/WeblateOrg/weblate/blob/master/weblate/trans/models/component.py#L1603

Many thanks, have a nice Sunday.

The code you've referenced is used for starting new translation within Weblate, the code for loading is here:

https://github.com/WeblateOrg/weblate/blob/abc65d14e30639b8cfbd1954d236a390768a8a2a/weblate/trans/models/component.py#L1020-L1130

It doesn't complete because of wsgi or web server timeouts (if executed from web) or if interrupted on command line.

It always first loads the base language and then tries to find matching units in the actual translation, so it works slightly different than what you expect.

Thank you for the very quick response.

AFAIK, Ales is executing the loading via command-line, and probably not interrupting it, at least not willingly.

I think, next we should try to match our logs against this code. I understand from our discussion that we should expect the line 'updating completed' not to be present in our logs.

If it doesn't complete from the command line, it really should end up with some exception indicating error.

Hi,
Can we limit the size of the imported files in Weblate to avoid this kind of issue ?

We have other issues needing attention too. If it was me, I'd suggest won't fix for now/may be later.

I believe there is still lot of place for performance improvements rather than limiting file size. On the test files @alesrosina provided me I was able to achieve almost 30% time reduction by 370013feb87f7a21fa23ccc5477388251f46b541 and ba49d39363b2c6b78aa8cdfa786f634822cb12f1.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Closing this as fixed in 3.4 as there were numerous improvements. If it still doesn't perform well, we might reopen this issue again.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

nijel picture nijel  路  3Comments

reloxx13 picture reloxx13  路  3Comments

WTBenjamin picture WTBenjamin  路  4Comments

asereze picture asereze  路  4Comments

tariver picture tariver  路  4Comments