Shaka-player: ISO 3 support or better fallback for language switcher in UI

Created on 27 Dec 2018  路  17Comments  路  Source: google/shaka-player

Have you read the FAQ and checked for duplicate open issues?:
Yes

What version of Shaka Player are you using?:
Latest

Can you reproduce the issue with our latest release version?:
With shaka player UI

Can you reproduce the issue with the latest code from master?:
Yes

Are you using the demo app or your own custom app?:
Shake Player UI. So not the real demo, but an unmodified version of the shaka player ui demo.

If custom app, can you reproduce the issue using our demo app?:

What browser and OS are you using?:
Safari/Chome

What are the manifest and license server URIs?:

What did you do?
Added a language code to m3u8 file (ie. abr) only available in ISO 3.

What did you expect to happen?
Ideally, the player displays the language name in English. Alternatively, it falls back to the ISO 3 code.

What actually happened?
Shaka Player pops up text 'Unknown' in language switcher as it's not available in ISO2.

I'm going to have a large number of languages not available in ISO2, so I'll have a massive list of "unknown" languages with no way to distinguish between them.

Even if it just displays the ISO3 code as a fallback, it allows users to make the proper selection.

UI archived bug

Most helpful comment

Okay, sounds good. I'll file a separate issue to extract that info from HLS and use it when possible. In the mean time, we can at least improve the fallback and add the language code itself to the UI.

And I'll take your suggestion on switching the place of the code and "Unknown".

All 17 comments

I think "ISO 3" and "ISO 2" are a bit ambiguous. Are you referring to the three-character and two-character language codes (ISO-639)?

Also, you didn't provide an example URL as requested in the template. You gave steps you took to put the language codes into an m3u8, but without an example of the language, we may not be able to reproduce the issue exactly. Can you please give either a URL for sample content or some example languages you're using?

Thanks for the report!

@gdub01 if for some reason your situation is too unique for us to universally support, there is some guidance I can offer on how you can add support for other languages in the library and UI with relatively ease.

In the player, we normalize all language codes using a map in shaka.util.LanguageUtils. If you want to standardize your language codes, the conversions will need to be in there. We normalize language codes so that we can be as consistent as possible - this will be important later. In the case that we cannot normalize them, we use them as provided.

In the UI, we use display names from mozilla.LanguageMapping. The choice to use the native name was to make it more accessible to the native speaker - for those who English won't make sense (_Si je parle fran莽ais, "francais" signifie plus pour moi que "french"_). We take the normalized language from the player (this is why the normalization was important) and convert them to display names in the UI.

If you add your language codes to the map in shaka.util.LanguageUtils and display names to the map in mozilla.LanguageMapping, you should get the behaviour you are asking for.

@gdub01, does that help?

Thanks @joeyparrish and @vaage.

By ISO 639-2 I'm referring to this list: http://www.loc.gov/standards/iso639-2/php/code_list.php
And ISO 639-3 this list: https://iso639-3.sil.org/code_tables/639/data

ISO 639-3 includes ISO 639-2, but with more languages added. You can see on the ISO-3 link above where some of the languages in ISO-3 map to ISO-2 languages.

I believe ISO-2 languages all work with the UI, and I enjoy the behavior you have of using native language names in the UI by default.

The problem I'm hitting is using language codes from ISO-3 that are not in ISO-2. Currently the UI says 'unknown' for valid ISO-3 codes.

I completely understand that supporting native language names for all the extra language codes in ISO-3 will be difficult/impossible to support as I believe https://github.com/mozilla/language-mapping-list only supports ISO-2.

However, if an ISO-2 code isn't found, there's still a decent chance it's a valid code in ISO-3 as ISO-3 is significantly larger.

I'm suggesting that if an ISO-2 language isn't found, we either:

  1. pass through the unknown code directly in the UI.. so instead of saying "UNKNOWN" we say "AAE".
  2. look up the code in the ISO-3 table here https://iso639-3.sil.org/code_tables/639/data and display the english language name. ie. "Arb毛resh毛 Albanian"

Sorry for not leaving an example earlier. https://drive.google.com/open?id=1oU2iFI1-v6yHa24p8zJeetxvl19r8ogh <- this is an m3u8 playlist and videos.. I don't have it accessible from a live player.. sorry. So you'll have to replace this http://localhost:4000/videos/1_jf6107-0-0/m3u8/ in the playlist files with the path to where you save it to demo.

You can switch between 3 audio languages that all work as they're all valid ISO2 codes. However 1 of the 3 caption languages is ISO-3 valid (https://iso639-3.sil.org/code/arb), but not ISO-2 and therefore shows "Unknown".

I think that default behavior of saying Unknown isn't precisely accurate. They're valid and documented codes. The player itself works with the languages, just the UI's display makes them indistinguishable from one another.

screen shot 2019-01-05 at 10 57 24 am

We already support both two-letter and three-letter codes, and "afr" in your example should map to "af", which is already in the mozilla language mapping list we use. Perhaps we omitted a normalization step somewhere.

"Afr" is available in ISO2 and it worked to shorten it fine 馃憤
"Arb", however didn't. Arb is the one that's in ISO2 but not 3.

ISO 639-3 is a more detailed spec. It's not the difference between 2 and 3 letter codes as much as more languages. ISO 2 has about _200_. ISO 3 has _7776_ according to https://en.wikipedia.org/wiki/ISO_639-3.

https://www.w3.org/International/questions/qa-lang-2or3.en.html says http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry is now the preferred method of looking of language tags (using bcp 47). Both 'AAE' and 'ARB' can be found in that list (bcp 47) as well as ISO 639-3, but not in ISO2.

That list uses shortest language code principle which I believe is what you follow as well... so you will find 2 letter codes where possible, 3 where not possible.

There's a significantly higher quantity of languages than those denoted in ISO2. Saying 'unknown' makes those other languages (though valid BCP47/ISO3 codes) makes it unusable for speakers/readers of that language.

Ah, I understand the problem now.

Whenever possible, we normalize all ISO-639-2 three-letter codes to ISO-639-1 two-letter codes. When two-letter codes don't exist, we use the longer version. This is following BCP-47.

After consulting the text of BCP-47 again, I found that ISO-639-3 ("comprehensive") and ISO-649-5 ("family") codes are also permitted in BCP-47 syntax.

Our normalization process is functioning as intended. We normalize three-letter ISO-639-2 codes like "ara" (Arabic, generally) to ISO-639-1 codes like "ar", and then we look those up in the language map created by Mozilla for display in our UI. We do not, however, have support for ISO-639-3 codes like "arb" (Modern Arabic, specifically) as appear in your sample content.

We are not going to add mappings for those, for two main reasons:

  1. There are far too many of them (7776 entries as of 2012 according to Wikipedia), and it would balloon our code to try to either normalize or name them
  2. We already support regional variants for specificity ("ar-MA", for Moroccan Arabic, for example)

What we could reasonably improve, though, is two things:

  1. Take language codes like "ar-EG" (Egyptian Arabic), which are not in the language mappings of Mozilla, and retry their base forms like "ar" to get a valid display
  2. Take any language code which isn't exactly in Mozilla's name map and append the code itself

With these two changes, "ar-EG" would show up as "Arabic (ar-EG)" and "arb" would show up as "Unknown (arb)".

@gdub01, would this be a step in the right direction?

@joeyparrish lol posted at the same time. I think that is a step in the right direction, thanks!

I understand 7776 would be a little much since most wouldn't be used most of the time. 馃憤

I suggest we de-emphasize "unknown" and emphasize the code. So instead of "Unknown (arb)" we do "arb" and not mention unknown... or perhaps "Arb - unknown"

One more suggestion:
in the m3u8 playlist a name is provide for the audio and text files like:
LANGUAGE="arb",NAME="ARABIC"

Not sure if it would be too annoying to grab this as it appears it's not in mpd files. But if that's doable I think it would be useful.

Okay, sounds good. I'll file a separate issue to extract that info from HLS and use it when possible. In the mean time, we can at least improve the fallback and add the language code itself to the UI.

And I'll take your suggestion on switching the place of the code and "Unknown".

Actually, I'm not certain about switching code & unknown. What about the other case, in which we have a match for the base language? In the case of Egypt, would you prefer "Arabic (ar-EG)" or "ar-EG (Arabic)" ?

(And when will GitHub give us a "typing" indicator? :smile:)

Fair - ya shoot that's tough.
Ideally you'd want it alphabetical, so unknown first wrecks that.
But Arabic is better than ar-EG.

I think I'd say
"ar-EG Arabic"

I'll comment early so you can see I'm typing =)

"Afrikaans" <- known
"Arabic ar-EG" <- unknown, but gets fallback from region, so name is prepended from fallback, ar-EG is original
"Arb" <- unknown entirely, so just does original

Shoot updated that again. I think we can have both if we prepend.

Okay, I think then that we've settled on something very close to what I proposed originally. Aesthetically, I prefer to use parentheses to separate the name/unknown from the code, so after my change, we will have:

"Afrikaans" <- fully known
"Arabic (Morocco)" <- fully known, including region
"Arabic (ar-EG)" <- base language known, region not in the map
"Unknown (arb)" <- entirely unknown

"Afrikaans" <- fully known
"Arabic (Morocco)" <- fully known, including region
"Arabic (ar-EG)" <- base language known, region not in the map
"arb (Unknown)" <- entirely unknown

I modified the last option only because alphabetical sort gets thrown off if unknown is used first.

The downside being it breaks the pattern if the code is put last only in that case... the upside being it appears roughly alphabetically.

To be honest, I'm just making a guess that alphabetical sort is more desirable UX than grouping all unknowns together alphabetically. So probably splitting hairs =)

I'm okay with either option in the end.

I'm going to avoid a special case and just stick with "Unknown (arb)". There is also no sorting. It's the order of the streams in the manifest, so the content provider has some influence over the order.

Was this page helpful?
0 / 5 - 0 ratings