Mastodon: Foreign language still showing in my feed after checking all the boxes to remove all but English.

Created on 12 Jun 2017 · 11Comments · Source: tootsuite/mastodon

I checked all the boxes in preferences to remove all the languages that show up in my feeds except English. However, there is still some language that shows up that is not English. I went back and verified that all boxes except English were checked. Here is a copy of one of the toots:

"غمگین مثل کرم ابریشم درون پیله ایی که اسهال داره...!"

That pasted in the wrong direction but it should help anyway.

Source

ibadukefan

Most helpful comment

Yeah we should replace the current settings with two "show only" or "exclude" settings, which have autocompleted text inputs.

nightpool on 21 Sep 2017

👍5

All 11 comments

Another:

"好吧，星期一就是這麼魔性，就是這麼忙。"

ibadukefan on 12 Jun 2017

moi aussi.
Maybe they don't set up their language correctly.
I hope can solve it by separating language for view and language for toot.

matyapiro31 on 12 Jun 2017

We are using a library named CLD3 to detect language.

"غمگین مثل کرم ابریشم درون پیله ایی که اسهال داره...!"

The detection result is ps (Pashto). We have ar (Arabic) and fa (Persian) in checkboxes, but don't have ps. However, probably we should reconsider setting interface to allow all languages anyway because CLD3 detects a lot of languages.

"好吧，星期一就是這麼魔性，就是這麼忙。"

Oh, we have zh-ch, zh-hk and zh-tw in checkboxes, but result from CLD3 is zh. So any Chinese options won't work correctly.

Also we have checkboxes for he, io, oc and pt-br, but those doesn't seem to be used in CLD3.

unarist on 12 Jun 2017

~~Looks like cld3 doesn't even support Hebrew.~~ My bad - it does, the language code is iw.

happycoloredbanana on 12 Jun 2017

We need to map our own locale identifiers (fa, he) to all the possible variants cld3 can return that are a subset of that locale (fa -> [fa, fas, pe], he -> [he, iw], etc). I think that's the reasonable short term solution. What do you think @mjankowski ?

Gargron on 12 Jun 2017

Yeah, I think the core problems are:

We are borrowing our set of locales which the UI happens to be translated in, and using that as the list of languages which people can select to filter. This was a decent starting point I guess, but clearly doesn't cover every scenario.
Even where we might have an overlap of a language generally speaking, we won't cover every single combination of top level language and local variant.

Some things we could do here:

We could change our options for what can be blocked and limit them to just the top-level language part ... so you could choose to block "en", but not more narrowly block "en_US" vs "en_GB" or something ... and if we did that, we would apply a block of "en" to ALL the en_* locales, and to the top level as well.
We could change the list of languages people are allowed to select from to be a list of things which more accurately captures what CLD3 is going to return (again, probably just the "parent" languages from this list, not every local dialect/variation) ... and use that list, instead of using the list of i18n translations we have for the UI.
We could maintain some sort of mapping of aliases, so that a person selecting to block just one language could be reasonably mapped into multiple language strings that CLD3 might return which would mean the same thing even though it's a diff string value.

I think the first two things on this list are probably both easier to do, and might get us more mileage before we decide if we need to do the last one as well.

mjankowski on 12 Jun 2017

@Gargron We have fixed some language codes on #4841, but still many language codes can't be filtered. We have 28 filter choices, but cld3 detects about 100 language codes...

unarist on 14 Sep 2017

Yeah we should replace the current settings with two "show only" or "exclude" settings, which have autocompleted text inputs.

nightpool on 21 Sep 2017

👍5

Just a couple of things here:
(a) first up we're not talking about languages here, we're talking about scripts or orthographies (ja-Latn" and "ja" are the same language but different scripts; "en" and "ru-Latn" are different languages but the same script; all four are different orthographies). (Yes, I'm aware that cld3 conflates these systematically, but see https://tools.ietf.org/html/rfc5646 which explains things in great detail)
(b) with the relatively small amounts of text in a toot it's unlikely that en_US and en_GB are going to be reliably differentiated. My recollection is that this can be done at the length of a news-wire article but not significantly shortly. More generally see https://en.wikipedia.org/wiki/Mutual_intelligibility for a list of problematic language groups.

stuartyeates on 2 Mar 2018

Why was this closed? This is still a prevalent issue.

Harmon758 on 22 Jul 2018

I agree with Harmon here. This issue is very much so still active, I don't understand why this topic was closed. I've tried both options for my language filters (selecting all boxes except english, and then selecting only english) and yet still I'm flooded with other languages in posts. It's getting ridiculous, because if I don't understand the freaking post, then I consider it spam, so right now, mastodon is packed full of useless spam.