Considering there is now a feature that allows users to filter timelines based on language serious inaccuracy of post language detection should be considered a bug.
A couple examples just from my own recent posts:
"I'd rather not!" detected as Portuguese?
"a lot of people just want to feel righteous all the time and that's all that matters" detected as Norwegian?!
master
(If you're a user, don't worry about this).Mastodon's language detection is done through https://github.com/peterc/whatlanguage/ at the moment, which is very limited. It could be extended easily by adding more dictionaries, but that wouldn't really make the detection much better.
A better library would be something like https://github.com/richtr/guesslanguage.js , but that's JavaScript and it seems there's no Ruby equivalent of it yet.
Using external resources like Google or detectlanguage.com is of course not compatible with Mastodon's founding principles.
I very much prefer the idea of users being able to choose each toot's language(s) somehow, like the mock-ups in #691. Being able to choose a default language that you can change is important, because a lot of folks post in more than one language.
Language feature is not released yet, but is being tried on a few servers. It will be disabled if we're not confident in quality.
Helpful feedback here if you are on those instances:
Related: if we use a library with a confidence score, we could make language default to unknown, and only populate on high confidence
Just did some testing with that WhatLanguage library currently used in Mastodon. It's hilariously bad:
require 'whatlanguage'
wl=WhatLanguage.new(:all)
wl.language("hello world")
=> :russian
wl.language("こんにちは")
=> :german
wl.language("The rain in Spain falls mainly on the plains.")
=> :polish
wl.language("what are you saying?")
=> :greek
Two better options might be:
whatlanguage seems to be quite outdated and it probably has a double digit false positive rate although I couldn't find a comparison that benchmarks it against a known corpus.
I've done a bit of a back-of-envelope calculation and as long as the false positive rate is not biased in a direction of any smaller languages, a 99% accuracy should work for filtering purposes.
@szbalint How can we use those libraries in Ruby though? :/
For a drop-in Ruby replacement, scylla
would a good option. Reposting some of the results I got with a dataset of ~66k tweets:
I tested an alternative language detection gem called scylla against a dataset of ~66k tweets that were manually categorized by language.
Results with scylla: https://gist.github.com/patf/482754e9353e8f3db1ac3c985ca4244d
Results with whatlanguage, the gem that's currently used: https://gist.github.com/patf/496c18cc4d568157803b68285c0da9b6Overall, the accuracy with scylla is 84 % and 42 % with whatlanguage. whatlanguage lacks support for a number of languages (e.g. Japanese). Seems like a good drop-in replacement.
CLD may or may not be good, but it's © Google.. :/
It's released under the apache license and doesn't make use of any third-party APIs, so I don't see why that would be problematic.
I'll be testing cld2 against the Twitter dataset in a bit.
Using the cld2
gem, I get the same accuracy as scylla
when not using the reliable
indicator (84%). Using only results returned as reliable, accuracy is at 94%. cld2
supports more language, so the total number of tweets a language can be detected for is still higher than with scylla
, even when using reliable
.
Full results here. cld2
Performance is significantly better (2s vs. 70s CPU time for the Twitter dataset.)
Wow, okay. I guess cld2 is the way to go then?
We've replaced WhatLanguage with CLD, and the examples thus far from this thread are all fixed now:
[1] pry(main)> LanguageDetector.new("hello world").to_iso_s
=> :en
[2] pry(main)> LanguageDetector.new("こんにちは").to_iso_s
=> :ja
[3] pry(main)> LanguageDetector.new("The rain in Spain falls mainly on the plains.").to_iso_s
=> :en
[4] pry(main)> LanguageDetector.new("what are you saying?").to_iso_s
=> :en
[5] pry(main)> LanguageDetector.new("I'd rather not!").to_iso_s
=> :en
[6] pry(main)> LanguageDetector.new("a lot of people just want to feel righteous all the time and that's all that matters").to_iso_s
=> :en
This will presumably be deployed on instances that run master over the next day or so, and we can gather more feedback at that point.
I'm going to leave this issue open to continue to gather feedback on language detection, until we're confident enough that it's worth keeping in.
Oh, I should say - there was a nice side effect improvement, which is that CLD is better at returning an "I don't know" result than WL was -- and we fall back to either account setting or instance default in those cases.
Sounds great. If we want to look into further accuracy improvements down the line, the C++ code for cld2
lets you provide a language hint, which could be set to the user(/browser?) locale. I'll look into changing the cld
gem to make use of that, but let's see how this performs first.
Language filtering in v1.4.0.2 seems to fail for Japanese.
I've no way to tell what it's detecting instead though, so I'm afraid I can't be of more help in finding what's going worng ^_^;
The best way to help us debug is to provide specific URLs to statuses which are not behaving as you expect. Describe what you did, where you saw it, what you think should have happened, with as many URLs and references as possible.
Well, my local instance doesn't have a lot of Japanese on the federated timeline to begin with for some reason, but here are a few toots that still showed up just now when I tried filtering out 日本語:
https://mstdn.jp/users/hide104/updates/5967751
https://mastodon.cloud/users/PSINet/updates/887984
https://mstdn.jp/users/hide104/updates/5969606
https://mstdn.jp/users/hide104/updates/5970302
https://mastodon.art/users/Arugha_Satoru/updates/8930
(sorry for almost singling out one user, but like I said, there's not much Japanese here)
Thanks.
It looks like the .jp
ones are all marked as ru
(russian) - which I'm guessing means that server has not upgraded to 1.4rc yet (the prior language detection tool had a lot of "wrong" guesses which came through as russian). The .cloud
one is marked as english, and I'd guess the same there - they are running 1.3.3 and it's marked it incorrectly.
My conclusion here is that the problem you see is not an issue with the local filtering mechanism itself - but it does expose that the local filtering mechanism is completely reliant on incoming content being marked appropriately by it's publisher.
Oh? I thought the language detection was done on the receiving (server or client) end? Well, there's opportunity for improvement. Maybe redo the detection locally if the originating instance's version is <1.4?
Have all languages but English filtered - still getting Japanese (?) and Spanish posts.
Desktop browser - Chrome - Mastodon.social - after @Gargron said the 1.4 update was up and running
Example toots in my Local Timeline:
https://mastodon.social/@felipetiza/7232522
https://mastodon.social/@mariano_bar/7232804
https://mastodon.social/@blaguesrandom/7232805
https://mastodon.social/@AliceDiNunno/7233053
https://mastodon.social/@u14269/7233095
https://mastodon.social/@ooiaee/7233107
https://mastodon.social/@Larabi/7233108
Looking at the HTML, these are all correctly identified as non-english. If there's a bug, it's probably in the language filter rather than the language detection itself.
I believe that language filtering is not applied (at all!) to the live streaming connection ... which may explain some of this.
@mjankowski is that by design? The feed is overwhelmingly non-English and unreadable to me without language filtering. I'd go so far as to say rather unusable, especially if the number of users continues to climb...
No - it's a pretty glaring oversight! We'll get that resolved before 1.4.
I've been keeping track of my toots since yesterday, noting the language codes from the atom feed. https://docs.google.com/spreadsheets/d/1BexKpvslEWedQSdhCH4Htlm0ZTxVbRjG449mgsBvs24/edit?usp=sharing
I'm wondering if usernames are factored into language detection? I'm thinking that if they are, perhaps they shouldn't be?
Also, I think people should be able to select their language manually before clicking "toot", as suggested in #3478.
That's a very useful data-set, thank you!
I'm wondering if usernames are factored into language detection? I'm thinking that if they are, perhaps they shouldn't be?
The username of the person who posted the status is not factored in ... but any at-messaged usernames which are part of the status body are factored - which I think is what you mean.
Right now we strip out URLs, but hashtags and usernames are left in. There was a suggestion somewhere to remove those, which I think makes sense.
I'll add some specs based on your data set here, and see if removing usernames would have fixed the detection.
The username of the person who posted the status is not factored in ... but any at-messaged usernames which are part of the status body are factored - which I think is what you mean.
Right now we strip out URLs, but hashtags and usernames are left in. There was a suggestion somewhere to remove those, which I think makes sense.
Yeah, that's what I meant! And I agree, that makes sense.
I don't know if it helps but here's my atom thing: https://cybre.space/users/cassolotl.atom
Two recent posts:
"@Shutsumon Indeed! :)" - https://cybre.space/@cassolotl/1298911 - XH
"@tcql >:3 https://cybre.space/media/R9rNlCAUWWtq2bBDGw8" -
https://cybre.space/@cassolotl/1298750 - SV
Could emoticons and/or URLs be messing with it too?
We already strip URLs ... but I'm sure that emoticons -- or I guess the broader category of "non core language characters" -- might also contribute to mis-detection.
For that last one - https://cybre.space/@cassolotl/1298750 - I would not expect something so short and lacking in words to be detected as a language. That whole status is a username, a URL, and a few extra characters (which we know to be emoticon, but which I wouldn't expect language detection to understand). In those cases, we'd expect the language detection lib to throw up its hands and say it cant reliably detect the language, and then we'd just fall back to the instance default. I confirmed that for that string, that's what is happening.
So maybe I misread the atom feed? I just checked and it says SV, which is Swedish, right? :s
Sorry, when I said "I confirmed that for that string, that's what is happening." - I was talking about doing this on my local dev environment, and only AFTER the "remove usernames and hashtags" changes had been applied.
Ahhhh I see! Thanks. :)
A lot more accurate with CLD3 now, guess we forgot to reference/close this issue.
Most helpful comment
Just did some testing with that WhatLanguage library currently used in Mastodon. It's hilariously bad: