TL;DR: Please make the "choose your toot language" happen.
I've been browsing the issues list, it seems that the team choose to go with CLD2 for language detection… and closed this issue #691
But CLD2 detects only around 80 languages, there is more than 7000 out there. On top of that it can be wrong.
On the other hand, I totally agree that most people speak one of the top 50 languages spoken on Earth.
But there is still some weirdos to use regional dialects or conlang. I'm one of them. For the record, I toot in french, english, toki pona and kotava. I've also seen quite a lot of Esperanto and medieval french.
So I'm asking the team to consider a button allowing mastonauts to force language on a per-toot basis, with possibly a default language and CLD2 fallback if nothing was specified.
Thank you
Looks like my toots are being correctly assigned about 68% of the time? https://docs.google.com/spreadsheets/d/1BexKpvslEWedQSdhCH4Htlm0ZTxVbRjG449mgsBvs24/edit?usp=sharing (Edit: It's up to 79%!)
I would love this feature!
I think the language count is a bit higher than that, and we're using CLD3 now ... but I think your general point still stands, and we're never going to have 100% perfect coverage on this.
I think switching the filter to be "opt out" instead of "opt in" was an improvement ... because now people are only selecting things to not see, and we should only be tagging things we are confident about, so that should reduce element of false positives.
I have no objection to the eventual inclusion of a UI here, but I have a strong preference for trying to handle as many cases as we can w/out one first.
There's a just-recently-merged but not yet running anywhere commit which removes usernames and hashtags from text before language identification, which should help improve things. I'd like to let that get out there and see what it does to general identification.
For what it's worth, the current behavior already is "if the detector can't find the language, fall back on the user locale, and if they don't have one, fall back on the instance locale". I think that's what you are proposing towards the end there...?
I believe automatic language selection is here to make Mastodon as user friendly as you can, and I totally agree with this idea, since it's what helps a software to gain adoption.
The current behavior isn't bad at all in this way. I'm just personally in some exotic cases.
My use case is the following :
– I'm tooting in a weird language (kotava –ISO 639-3: avk–, toki pona –ISO 639-3 mis–, etc…)
– CLD fail to detect what language I'm using (not very surprising here)
– Mastodon goes for my locale (french) or my instance locale (I think it's also french)
– Wrong language ends assigned to the toot.
On top of that, I think CLD would probably fail with slang and toots with a lot of grammar and orthographic mistakes. Unless you are using a huge set of data to train the neural network.
So here is a quick mockup about the less intrusive way of doing things that I'm able to think of :

Assuming that in my user preferences :
– I said I'm french (so by default if CLD doesn't understand, my toot are tagged french)
– I said I'm also tooting in english, toki pona and kotava
On hovering the toot button, Mastodon would make the dropdown appear allowing me to easily force toot language. If I don't and click the first button, CLD will try to determine what language I'm using and we will go down the normal way.
I believe that exotic language users would probably don't mind going to parameters to set a few thing up. And for regular users, CLD will just work as expected.
Last note : I'm wondering what data CLD3 uses to determine what language we're tooting in. Since it's a Google product and privacy matters to some Mastodon users, I hope this doesn't send the toot content to Google. And the CLD3 repo says it's intended to work in Chrome, here again I hope this won't let non chrome users out.
Thank you for reading
With your dropdown, that requires each language to have a different word for "toot", and it also requires the user to remember which translation of "toot" belongs to which language. So as much as I like it, I think it would cause problems!
Well that's an exemple, it could contain the iso code or even the language name.
But if I'm fluent enough in a language to toot in it… I probably can remember what « toot » translation correspond to what language.
Also we could add a flag or an icon for some of them.
And maybe, but I think this feature would be rather hard to implement, toot lang might be updated in the same menu that allow to unfold or erase it.
if I'm fluent enough in a language to toot in it… I probably can remember what « toot » translation correspond to what language.
This may be true for you, but it is not true for everyone. I think you would be surprised at how much of a struggle it would be. You are a person who is really into language(s), and your mind is very much tuned to that, right? It's different for people for whom language is not a special interest - being multi-lingual can cause aphasia, and aphasia is a symptom of various mental illnesses and developmental disorders.
it could contain the iso code or even the language name.
Using whatever the most people will recognise seems most sensible. I liked @Lomplac's mock-up a lot - the language code is small, and when you click it, the menu contains both the language code and the full language's name.

This may be true for you, but it is not true for everyone.
You're right… okay.
I initially liked @Lomplac idea, that's what I had in mind when I created the issue.
But on a second thought it adds a new button or a new field to UI. This might be confusing for most of the users who don't need this feature.
So I came up with my mockup… well, well, we can put ISO 639-3 code and language into the dropdown of the « TOOT! » button.
I could allow setting language directly through the API, but I wouldn't want to implement any dropdowns like this in the web UI.
I don't usually bother bringing this up with centralized services but as decentralized services are attempting to be better, I'd like to bring this up. Spoken languages are not the only languages in our world, would you please consider adding signed languages such as American Sign Language, British Sign Language, etc.
These would be mostly useful as a tool for discovery, versus changing the UI to reflect the language choice. American Sign Language is currently conveyed through video. Adding support for this early would aid in decentralized services being truly for the people, and not just the spoken language parts of the world. Thanks.
@Iteratix If this was implemented, it would probably support ASL/BSL given that they're already valid BCP47 tags. (with the language tag of 'sgn-ase')
Neat. I'm just getting up to speed myself on how this kind of thing is implemented.
As somewhat fluent in 3 languages I totally get the need for this.
... If user selects they toot in more than 1 language then add language dropdown to toot button ( with lang code ). Last tooted language would move to default spot.
CLD isn't going to be able to cover media types such as images, videos, etc.
I understand that in the w3c standard multiple translations are supported. I think the way that for example works in facebook (default language + multiple translations) would be very desirable. I always thought is one of twitter biggest flaws:
https://www.w3.org/TR/activitystreams-core/#naturalLanguageValues
Most helpful comment
As somewhat fluent in 3 languages I totally get the need for this.
... If user selects they toot in more than 1 language then add language dropdown to toot button ( with lang code ). Last tooted language would move to default spot.