I checked all the boxes in preferences to remove all the languages that show up in my feeds except English. However, there is still some language that shows up that is not English. I went back and verified that all boxes except English were checked. Here is a copy of one of the toots:
"غمگین مثل کرم ابریشم درون پیله ایی که اسهال داره...!"
That pasted in the wrong direction but it should help anyway.
Another:
"好吧,星期一就是這麼魔性,就是這麼忙。"
moi aussi.
Maybe they don't set up their language correctly.
I hope can solve it by separating language for view and language for toot.
We are using a library named CLD3 to detect language.
"غمگین مثل کرم ابریشم درون پیله ایی که اسهال داره...!"
The detection result is ps (Pashto). We have ar (Arabic) and fa (Persian) in checkboxes, but don't have ps. However, probably we should reconsider setting interface to allow all languages anyway because CLD3 detects a lot of languages.
"好吧,星期一就是這麼魔性,就是這麼忙。"
Oh, we have zh-ch, zh-hk and zh-tw in checkboxes, but result from CLD3 is zh. So any Chinese options won't work correctly.
Also we have checkboxes for he, io, oc and pt-br, but those doesn't seem to be used in CLD3.
Looks like cld3 doesn't even support Hebrew. My bad - it does, the language code is iw.
We need to map our own locale identifiers (fa, he) to all the possible variants cld3 can return that are a subset of that locale (fa -> [fa, fas, pe], he -> [he, iw], etc). I think that's the reasonable short term solution. What do you think @mjankowski ?
Yeah, I think the core problems are:
Some things we could do here:
en_* locales, and to the top level as well.I think the first two things on this list are probably both easier to do, and might get us more mileage before we decide if we need to do the last one as well.
@Gargron We have fixed some language codes on #4841, but still many language codes can't be filtered. We have 28 filter choices, but cld3 detects about 100 language codes...
Yeah we should replace the current settings with two "show only" or "exclude" settings, which have autocompleted text inputs.
Just a couple of things here:
(a) first up we're not talking about languages here, we're talking about scripts or orthographies (ja-Latn" and "ja" are the same language but different scripts; "en" and "ru-Latn" are different languages but the same script; all four are different orthographies). (Yes, I'm aware that cld3 conflates these systematically, but see https://tools.ietf.org/html/rfc5646 which explains things in great detail)
(b) with the relatively small amounts of text in a toot it's unlikely that en_US and en_GB are going to be reliably differentiated. My recollection is that this can be done at the length of a news-wire article but not significantly shortly. More generally see https://en.wikipedia.org/wiki/Mutual_intelligibility for a list of problematic language groups.
Why was this closed? This is still a prevalent issue.
I agree with Harmon here. This issue is very much so still active, I don't understand why this topic was closed. I've tried both options for my language filters (selecting all boxes except english, and then selecting only english) and yet still I'm flooded with other languages in posts. It's getting ridiculous, because if I don't understand the freaking post, then I consider it spam, so right now, mastodon is packed full of useless spam.
Most helpful comment
Yeah we should replace the current settings with two "show only" or "exclude" settings, which have autocompleted text inputs.