Mastodon: Language detection is inaccurate

Created on 2 May 2017 · 36Comments · Source: tootsuite/mastodon

Considering there is now a feature that allows users to filter timelines based on language serious inaccuracy of post language detection should be considered a bug.

A couple examples just from my own recent posts:

"I'd rather not!" detected as Portuguese?

"a lot of people just want to feel righteous all the time and that's all that matters" detected as Norwegian?!

[x] I searched or browsed the repo’s other issues to ensure this is not a duplicate.
[ ] This bug happens on a tagged release and not on master (If you're a user, don't worry about this).

bug

Source

rainyday

👍4

Most helpful comment

Just did some testing with that WhatLanguage library currently used in Mastodon. It's hilariously bad:

require 'whatlanguage'
wl=WhatLanguage.new(:all)
wl.language("hello world")
=> :russian
wl.language("こんにちは")
=> :german
wl.language("The rain in Spain falls mainly on the plains.")
=> :polish
wl.language("what are you saying?")
=> :greek

ghost on 3 May 2017

😄6 😕1

All 36 comments

Mastodon's language detection is done through https://github.com/peterc/whatlanguage/ at the moment, which is very limited. It could be extended easily by adding more dictionaries, but that wouldn't really make the detection much better.

A better library would be something like https://github.com/richtr/guesslanguage.js , but that's JavaScript and it seems there's no Ruby equivalent of it yet.

Using external resources like Google or detectlanguage.com is of course not compatible with Mastodon's founding principles.

ghost on 2 May 2017

I very much prefer the idea of users being able to choose each toot's language(s) somehow, like the mock-ups in #691. Being able to choose a default language that you can change is important, because a lot of folks post in more than one language.

Cassolotl on 2 May 2017

👍4

Language feature is not released yet, but is being tried on a few servers. It will be disabled if we're not confident in quality.

mjankowski on 2 May 2017

❤1

Helpful feedback here if you are on those instances:

misdetected strings (expected and actual)
failure to filter something (or excessive filtering)
UI/UX feedback

mjankowski on 2 May 2017

Related: if we use a library with a confidence score, we could make language default to unknown, and only populate on high confidence

mjankowski on 2 May 2017

Just did some testing with that WhatLanguage library currently used in Mastodon. It's hilariously bad:

require 'whatlanguage'
wl=WhatLanguage.new(:all)
wl.language("hello world")
=> :russian
wl.language("こんにちは")
=> :german
wl.language("The rain in Spain falls mainly on the plains.")
=> :polish
wl.language("what are you saying?")
=> :greek

ghost on 3 May 2017

😄6 😕1

Two better options might be:

language-detection (java) with ~99.2% accuracy
chromium-compact-language-detector (c++/python) with ~98.8% accuracy.

whatlanguage seems to be quite outdated and it probably has a double digit false positive rate although I couldn't find a comparison that benchmarks it against a known corpus.

I've done a bit of a back-of-envelope calculation and as long as the false positive rate is not biased in a direction of any smaller languages, a 99% accuracy should work for filtering purposes.

szbalint on 3 May 2017

@szbalint How can we use those libraries in Ruby though? :/

Gargron on 3 May 2017

For a drop-in Ruby replacement, scylla would a good option. Reposting some of the results I got with a dataset of ~66k tweets:

I tested an alternative language detection gem called scylla against a dataset of ~66k tweets that were manually categorized by language.

Results with scylla: https://gist.github.com/patf/482754e9353e8f3db1ac3c985ca4244d
Results with whatlanguage, the gem that's currently used: https://gist.github.com/patf/496c18cc4d568157803b68285c0da9b6

Overall, the accuracy with scylla is 84 % and 42 % with whatlanguage. whatlanguage lacks support for a number of languages (e.g. Japanese). Seems like a good drop-in replacement.

pfigel on 3 May 2017

@Gargron there seem to be a couple of Ruby bindings for the CLD C++ version, like this and this. I don't know enough about Ruby to evaluate which would be better, if any.

szbalint on 3 May 2017

CLD may or may not be good, but it's © Google.. :/

ghost on 3 May 2017

It's released under the apache license and doesn't make use of any third-party APIs, so I don't see why that would be problematic.

I'll be testing cld2 against the Twitter dataset in a bit.

pfigel on 3 May 2017

Using the cld2 gem, I get the same accuracy as scylla when not using the reliable indicator (84%). Using only results returned as reliable, accuracy is at 94%. cld2 supports more language, so the total number of tweets a language can be detected for is still higher than with scylla, even when using reliable.

Full results here. cld2 Performance is significantly better (2s vs. 70s CPU time for the Twitter dataset.)

pfigel on 3 May 2017

👍1

Wow, okay. I guess cld2 is the way to go then?

Gargron on 3 May 2017

🎉1

We've replaced WhatLanguage with CLD, and the examples thus far from this thread are all fixed now:

[1] pry(main)> LanguageDetector.new("hello world").to_iso_s
=> :en
[2] pry(main)> LanguageDetector.new("こんにちは").to_iso_s
=> :ja
[3] pry(main)> LanguageDetector.new("The rain in Spain falls mainly on the plains.").to_iso_s
=> :en
[4] pry(main)> LanguageDetector.new("what are you saying?").to_iso_s
=> :en
[5] pry(main)> LanguageDetector.new("I'd rather not!").to_iso_s
=> :en
[6] pry(main)> LanguageDetector.new("a lot of people just want to feel righteous all the time and that's all that matters").to_iso_s
=> :en

This will presumably be deployed on instances that run master over the next day or so, and we can gather more feedback at that point.

I'm going to leave this issue open to continue to gather feedback on language detection, until we're confident enough that it's worth keeping in.

mjankowski on 3 May 2017

👍2

Oh, I should say - there was a nice side effect improvement, which is that CLD is better at returning an "I don't know" result than WL was -- and we fall back to either account setting or instance default in those cases.

mjankowski on 3 May 2017

👍1

Sounds great. If we want to look into further accuracy improvements down the line, the C++ code for cld2 lets you provide a language hint, which could be set to the user(/browser?) locale. I'll look into changing the cld gem to make use of that, but let's see how this performs first.

pfigel on 3 May 2017

👍2

Language filtering in v1.4.0.2 seems to fail for Japanese.

I've no way to tell what it's detecting instead though, so I'm afraid I can't be of more help in finding what's going worng ^_^;

ghost on 22 May 2017

👍1

The best way to help us debug is to provide specific URLs to statuses which are not behaving as you expect. Describe what you did, where you saw it, what you think should have happened, with as many URLs and references as possible.

mjankowski on 22 May 2017

Well, my local instance doesn't have a lot of Japanese on the federated timeline to begin with for some reason, but here are a few toots that still showed up just now when I tried filtering out 日本語:
https://mstdn.jp/users/hide104/updates/5967751
https://mastodon.cloud/users/PSINet/updates/887984
https://mstdn.jp/users/hide104/updates/5969606
https://mstdn.jp/users/hide104/updates/5970302
https://mastodon.art/users/Arugha_Satoru/updates/8930
(sorry for almost singling out one user, but like I said, there's not much Japanese here)

ghost on 22 May 2017

❤1

Thanks.

It looks like the .jp ones are all marked as ru (russian) - which I'm guessing means that server has not upgraded to 1.4rc yet (the prior language detection tool had a lot of "wrong" guesses which came through as russian). The .cloud one is marked as english, and I'd guess the same there - they are running 1.3.3 and it's marked it incorrectly.

My conclusion here is that the problem you see is not an issue with the local filtering mechanism itself - but it does expose that the local filtering mechanism is completely reliant on incoming content being marked appropriately by it's publisher.

mjankowski on 22 May 2017

❤1

Oh? I thought the language detection was done on the receiving (server or client) end? Well, there's opportunity for improvement. Maybe redo the detection locally if the originating instance's version is <1.4?

ghost on 22 May 2017

Have all languages but English filtered - still getting Japanese (?) and Spanish posts.

Desktop browser - Chrome - Mastodon.social - after @Gargron said the 1.4 update was up and running

Example toots in my Local Timeline:

https://mastodon.social/@felipetiza/7232522
https://mastodon.social/@mariano_bar/7232804
https://mastodon.social/@blaguesrandom/7232805
https://mastodon.social/@AliceDiNunno/7233053
https://mastodon.social/@u14269/7233095
https://mastodon.social/@ooiaee/7233107
https://mastodon.social/@Larabi/7233108

dixonge on 26 May 2017

Looking at the HTML, these are all correctly identified as non-english. If there's a bug, it's probably in the language filter rather than the language detection itself.

pfigel on 26 May 2017

I believe that language filtering is not applied (at all!) to the live streaming connection ... which may explain some of this.

mjankowski on 26 May 2017

@mjankowski is that by design? The feed is overwhelmingly non-English and unreadable to me without language filtering. I'd go so far as to say rather unusable, especially if the number of users continues to climb...

dixonge on 26 May 2017

No - it's a pretty glaring oversight! We'll get that resolved before 1.4.

mjankowski on 26 May 2017

👍1

I've been keeping track of my toots since yesterday, noting the language codes from the atom feed. https://docs.google.com/spreadsheets/d/1BexKpvslEWedQSdhCH4Htlm0ZTxVbRjG449mgsBvs24/edit?usp=sharing

I'm wondering if usernames are factored into language detection? I'm thinking that if they are, perhaps they shouldn't be?

Also, I think people should be able to select their language manually before clicking "toot", as suggested in #3478.

Cassolotl on 1 Jun 2017

That's a very useful data-set, thank you!

I'm wondering if usernames are factored into language detection? I'm thinking that if they are, perhaps they shouldn't be?

The username of the person who posted the status is not factored in ... but any at-messaged usernames which are part of the status body are factored - which I think is what you mean.

Right now we strip out URLs, but hashtags and usernames are left in. There was a suggestion somewhere to remove those, which I think makes sense.

I'll add some specs based on your data set here, and see if removing usernames would have fixed the detection.

mjankowski on 1 Jun 2017

The username of the person who posted the status is not factored in ... but any at-messaged usernames which are part of the status body are factored - which I think is what you mean.

Right now we strip out URLs, but hashtags and usernames are left in. There was a suggestion somewhere to remove those, which I think makes sense.

Yeah, that's what I meant! And I agree, that makes sense.

I don't know if it helps but here's my atom thing: https://cybre.space/users/cassolotl.atom

Cassolotl on 1 Jun 2017

Two recent posts:

"@Shutsumon Indeed! :)" - https://cybre.space/@cassolotl/1298911 - XH
"@tcql >:3 https://cybre.space/media/R9rNlCAUWWtq2bBDGw8" -
https://cybre.space/@cassolotl/1298750 - SV

Could emoticons and/or URLs be messing with it too?

Cassolotl on 1 Jun 2017

We already strip URLs ... but I'm sure that emoticons -- or I guess the broader category of "non core language characters" -- might also contribute to mis-detection.

For that last one - https://cybre.space/@cassolotl/1298750 - I would not expect something so short and lacking in words to be detected as a language. That whole status is a username, a URL, and a few extra characters (which we know to be emoticon, but which I wouldn't expect language detection to understand). In those cases, we'd expect the language detection lib to throw up its hands and say it cant reliably detect the language, and then we'd just fall back to the instance default. I confirmed that for that string, that's what is happening.

mjankowski on 1 Jun 2017

So maybe I misread the atom feed? I just checked and it says SV, which is Swedish, right? :s

Cassolotl on 1 Jun 2017

Sorry, when I said "I confirmed that for that string, that's what is happening." - I was talking about doing this on my local dev environment, and only AFTER the "remove usernames and hashtags" changes had been applied.

mjankowski on 1 Jun 2017

Ahhhh I see! Thanks. :)

Cassolotl on 1 Jun 2017

A lot more accurate with CLD3 now, guess we forgot to reference/close this issue.

Gargron on 2 Jul 2017

Was this page helpful?

0 / 5 - 0 ratings