Common-voice: Enable Hindi locale with "hi" code

Created on 7 May 2020  路  11Comments  路  Source: mozilla/common-voice

This issue aims to unblock the situation with Hindi.

This is the process to make it possible, and that we used for Portuguese:

  • [x] hi copy from hi-IN on the locales folder on voice-web
  • [x] Matjaz migrates metadata on contributor info to hi and the locale is added to Pontoon
  • [x] hi-IN removed in Pontoon
  • [x] Sync happens and Peiying verifies the changes
  • [x] hi added via import-locales on voice-web
  • [x] hi-IN removed in voice-web locales folder (and via import-locales)
  • [x] Sentence collector exports hi sentences
claimed

Most helpful comment

I don't think we need legal's time into this if we can solve it ourselves. If the source is the one Michael mentioned it's clearly noted that it's not public domain and we should remove them, as we have done in the past with other sources we have identified as non-public domain.

All 11 comments

Pontoon already has "hi" locale, this PR is pending to add it to our jsons via import-locales

https://github.com/mozilla/voice-web/pull/2712

hi-IN is no longer on pontoon and removed from this repo via 0006ea9ddd5ca5eecd0fc27406a2064c72af3792

@phirework all changes on our side are done, the site should be now loading /hi/

@MichaelKohler let's see if we can get "hi" sentences exported

Thanks everyone!

Pontoon part is complete with metadata migration.

I can verify that the export would now work. At the current stage, 59 sentences would be exported. However I have reason to believe that most of the 29k sentences that are in Sentence Collector are a copyright violation if we add them. Removing those would leave us with around 3 sentences to export.

Source mentioned in the records: Press Information Bureau, Govt of India https://pib.gov.in/indexd.aspx
https://pib.gov.in/content/102_2_Copyright-Policy.aspx seems to require attribution though, if I understand everything correctly. I'll hold off the export for now.

It may be possible to do per-sentence attribution soon but the sentences will have to be processed the way the Europarl sentences were treated, i.e. as a separate export with a separate QA process, instead of being part of the generic sentence-collector.txt. Either way, legal should probably be flagged on this.

@mbransn

I don't think we need legal's time into this if we can solve it ourselves. If the source is the one Michael mentioned it's clearly noted that it's not public domain and we should remove them, as we have done in the past with other sources we have identified as non-public domain.

Agreed & works for me. 馃憤

Deleted the copyright infringing sentences and did an export: https://github.com/mozilla/voice-web/pull/2722 . This leaves us with 3 exported Hindi sentences.

Thanks @MichaelKohler.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Djfe picture Djfe  路  5Comments

selimsumlu picture selimsumlu  路  3Comments

r00ster91 picture r00ster91  路  4Comments

mbebenita picture mbebenita  路  3Comments

kenrick95 picture kenrick95  路  4Comments