Common-voice: Enable Hindi locale with "hi" code

Created on 7 May 2020 · 11Comments · Source: mozilla/common-voice

This issue aims to unblock the situation with Hindi.

This is the process to make it possible, and that we used for Portuguese:

[x] hi copy from hi-IN on the locales folder on voice-web
[x] Matjaz migrates metadata on contributor info to hi and the locale is added to Pontoon
[x] hi-IN removed in Pontoon
[x] Sync happens and Peiying verifies the changes
[x] hi added via import-locales on voice-web
[x] hi-IN removed in voice-web locales folder (and via import-locales)
[x] Sentence collector exports hi sentences

claimed

Source

nukeador

Most helpful comment

I don't think we need legal's time into this if we can solve it ourselves. If the source is the one Michael mentioned it's clearly noted that it's not public domain and we should remove them, as we have done in the past with other sources we have identified as non-public domain.

nukeador on 12 May 2020

👍2

All 11 comments

hi locale folder is now on the code via https://github.com/mozilla/voice-web/commit/9ea082c6ec5743daeee97cac1e9ac3431f68c25f

nukeador on 7 May 2020

Pontoon already has "hi" locale, this PR is pending to add it to our jsons via import-locales

https://github.com/mozilla/voice-web/pull/2712

nukeador on 8 May 2020

hi-IN is no longer on pontoon and removed from this repo via 0006ea9ddd5ca5eecd0fc27406a2064c72af3792

nukeador on 8 May 2020

@phirework all changes on our side are done, the site should be now loading /hi/

@MichaelKohler let's see if we can get "hi" sentences exported

Thanks everyone!

nukeador on 8 May 2020

Pontoon part is complete with metadata migration.

peiying2 on 8 May 2020

I can verify that the export would now work. At the current stage, 59 sentences would be exported. However I have reason to believe that most of the 29k sentences that are in Sentence Collector are a copyright violation if we add them. Removing those would leave us with around 3 sentences to export.

Source mentioned in the records: Press Information Bureau, Govt of India https://pib.gov.in/indexd.aspx
https://pib.gov.in/content/102_2_Copyright-Policy.aspx seems to require attribution though, if I understand everything correctly. I'll hold off the export for now.

MichaelKohler on 8 May 2020

It may be possible to do per-sentence attribution soon but the sentences will have to be processed the way the Europarl sentences were treated, i.e. as a separate export with a separate QA process, instead of being part of the generic sentence-collector.txt. Either way, legal should probably be flagged on this.

@mbransn

phirework on 8 May 2020

nukeador on 12 May 2020

👍2

Agreed & works for me. 👍

mbransn on 12 May 2020

Deleted the copyright infringing sentences and did an export: https://github.com/mozilla/voice-web/pull/2722 . This leaves us with 3 exported Hindi sentences.

MichaelKohler on 14 May 2020

👀1

Thanks @MichaelKohler.

mbransn on 14 May 2020

Was this page helpful?

0 / 5 - 0 ratings