Fasttext: Language ID model: interpret softmax layer in case of mixed languages

Created on 12 Jul 2018  ·  6Comments  ·  Source: facebookresearch/fastText

Regarding FastText as language ID / detector I'm using the pre-trained model provided by Facebook, having 176 language, that works fine in most of cases with a good accuracy and acceptable in real-world examples. There are some cases indeed that should be investigated more, typically mixed languages like this one:

{
"text": "Oh yeah! C'mon! Take your time 聖なる月が照らす Na na na na Na na na na So tonight 天空駆けるよ今夜は Yea yea yea yea Yea yea yea yea Just right! シートは上々 アクセルだって上機嫌 お出かけ月夜の invitation 王女さまこちらへ make it work (yeah) Shawty, imma party till the sun down 西へ東へどこへ行こうか What's your name? 聞かないよ Before the sun rise どこへでもボディガード run with you 心躍り出すrunway そうボクら出逢うべき milky way Just love me right (a ha!) Baby love me right (a ha!) Oh! 迷わずボクを見つけて そうキミは romantic universe Just love me right (a ha!) 魅せられてるのさ Just love me right Just love me right Just love me right ボクだけの宇宙さ Shine a light 銀河のダイヤモンドのよう Na na na na Na na na na 眩しすぎる ah yeah! 光放つ軌道たどって 触れるまで何 feet 何 yard I can do this all night long baby! 彷徨いながらキミのFieldの中 Oh whoo! 星屑の夢 散らばる花火 ときめいて煌めいて (煌めいて) 風が時空を揺らす 瞬間を complete Touchdown 地球ごと Love me right! Ow! 心躍り出すrunway そうボクら出逢うべき milky way Just love me right (a ha!) Baby love me right (a ha!) (Love me right) So come on baby, 導いて キミへの romantic universe (眩しい lady) Just love me right (a ha!) ボクだけの宇宙さ Just love me right Just love me right (can you love me right) Just love me right I just wanna make you love me (ボクだけの universe) 眠れぬ夜を染めよう 空に輝く星座 照らし続けて 僕らの永遠 願い 叶えて 心躍り出す runway そうボクら出逢うべき milky way Just love me right (a ha!) Baby love me right (a ha!) (Woo yeah) Oh! 迷わずボクを見つけて そうキミは romantic universe (my Lady) Just love me right (a ha!) 魅せられてるのさ Just love me right (oh oh oh yeah) Just love me right Just love me right Just love me right ボクだけの宇宙さ (美しいセカイ) Just love me right (Twilight cosmic ride, ボクらのセカイへ) Just love me right Just love me right I just wanna make you love me Yea (woo you got to love) Yea (you got to love me) Yea ボクだけの宇宙さ (ボクだけの宇宙さ) You love me Yea (you love me) Yea (you love me) Yea ボクだけの宇宙さ"
}

This text that has mixed "Japanese" and "English" will be detected as

{
label : "JA",
scores : {
    "JA": 0.446074,
    "ZH": 0.123243,
    "EN": 0.109917
}
}

so the ja is under the 0.5 value, that is typically the minimum value I'm using to be sure that the text was mostly in that language, while 'zh' is a FP with a probability of 0.12. and 'en' was a FN with a probability of 0.10.

I have asked a similar question about CLD2 times ago and I have got an interesting answer about mixed languages and priors probabilities here.

Most helpful comment

Thanks Loreto for bringing this to my attention.

What you are getting at here is that if we are 100% sure that 50% of the text is in English and 50% in Japanese, it is incorrect that we flatten it to 50% probability that the text is in English.

To solve that we could imagine something like this:

parts : [
    {portion: 0.6,  scores: {"JA": 0.946074, "ZH": 0.02...}
    {portion: 0.4,  scores: {"EN": 0.996074, ...}}
]

But I can think of cases where it breaks a bit.

In my experience though it is hard to convince Bay Area folks that mixed language is more than esoteric, when in fact it is the norm in many markets.

All 6 comments

Thanks Loreto for bringing this to my attention.

What you are getting at here is that if we are 100% sure that 50% of the text is in English and 50% in Japanese, it is incorrect that we flatten it to 50% probability that the text is in English.

To solve that we could imagine something like this:

parts : [
    {portion: 0.6,  scores: {"JA": 0.946074, "ZH": 0.02...}
    {portion: 0.4,  scores: {"EN": 0.996074, ...}}
]

But I can think of cases where it breaks a bit.

In my experience though it is hard to convince Bay Area folks that mixed language is more than esoteric, when in fact it is the norm in many markets.

@bittlingmayer thanks I think taking in account those esoteric languages :) will furtherly improve the language model.
By example, in the past I were using CLD2, that is a "_Naïve Bayesian classifier, trained on documents of mean size of 200 characters, trained on a corpus of 100M scraped and human expert selected web pages_...".

This classifier seems to take in account priors probabilities and mixed languages in some way. It gets wrong anyways, but the approach seems to be similar to what you are suggesting here regarding parts split.

The most interesting thing is that it does not have a FP on zh, but assigns the best score to en, that seems to be the priors probability issue I have mentioned above. I think it's a good starting point to improve the output of this fastText model.

{
    "reliable": true,
    "detection": {
        "name": "ENGLISH",
        "code": "en",
        "percent": 49,
        "score": 1108
    },
    "languages": [{
            "name": "ENGLISH",
            "code": "en",
            "percent": 49,
            "score": 1108
        },
        {
            "name": "Japanese",
            "code": "ja",
            "percent": 46,
            "score": 3358
        }
    ],
    "chunks": [{
            "name": "Japanese",
            "code": "ja",
            "offset": 31,
            "bytes": 25
        },
        {
            "name": "Japanese",
            "code": "ja",
            "offset": 91,
            "bytes": 28
        },
        {
            "name": "Japanese",
            "code": "ja",
            "offset": 163,
            "bytes": 72
        },
        {
            "name": "Japanese",
            "code": "ja",
            "offset": 246,
            "bytes": 25
        },
        {
            "name": "ENGLISH",
            "code": "en",
            "offset": 271,
            "bytes": 57
        },
        {
            "name": "Japanese",
            "code": "ja",
            "offset": 328,
            "bytes": 34
        },
        {
            "name": "Japanese",
            "code": "ja",
            "offset": 416,
            "bytes": 34
        },
        {
            "name": "Japanese",
            "code": "ja",
            "offset": 485,
            "bytes": 31
        },
        {
            "name": "ENGLISH",
            "code": "en",
            "offset": 516,
            "bytes": 68
        },
        {
            "name": "Japanese",
            "code": "ja",
            "offset": 584,
            "bytes": 47
        },
        {
            "name": "ENGLISH",
            "code": "en",
            "offset": 631,
            "bytes": 45
        },
        {
            "name": "Japanese",
            "code": "ja",
            "offset": 676,
            "bytes": 25
        },
        {
            "name": "ENGLISH",
            "code": "en",
            "offset": 701,
            "bytes": 57
        },
        {
            "name": "Japanese",
            "code": "ja",
            "offset": 758,
            "bytes": 25
        },
        {
            "name": "Japanese",
            "code": "ja",
            "offset": 797,
            "bytes": 37
        },
        {
            "name": "Japanese",
            "code": "ja",
            "offset": 883,
            "bytes": 47
        },
        {
            "name": "ENGLISH",
            "code": "en",
            "offset": 939,
            "bytes": 40
        },
        {
            "name": "Japanese",
            "code": "ja",
            "offset": 979,
            "bytes": 27
        },
        {
            "name": "Japanese",
            "code": "ja",
            "offset": 1027,
            "bytes": 110
        },
        {
            "name": "Japanese",
            "code": "ja",
            "offset": 1210,
            "bytes": 31
        },
        {
            "name": "ENGLISH",
            "code": "en",
            "offset": 1241,
            "bytes": 97
        },
        {
            "name": "Japanese",
            "code": "ja",
            "offset": 1338,
            "bytes": 23
        },
        {
            "name": "ENGLISH",
            "code": "en",
            "offset": 1390,
            "bytes": 33
        },
        {
            "name": "Japanese",
            "code": "ja",
            "offset": 1423,
            "bytes": 25
        },
        {
            "name": "ENGLISH",
            "code": "en",
            "offset": 1448,
            "bytes": 112
        },
        {
            "name": "Japanese",
            "code": "ja",
            "offset": 1586,
            "bytes": 115
        },
        {
            "name": "Japanese",
            "code": "ja",
            "offset": 1708,
            "bytes": 31
        },
        {
            "name": "ENGLISH",
            "code": "en",
            "offset": 1739,
            "bytes": 79
        },
        {
            "name": "Japanese",
            "code": "ja",
            "offset": 1818,
            "bytes": 47
        },
        {
            "name": "ENGLISH",
            "code": "en",
            "offset": 1865,
            "bytes": 55
        },
        {
            "name": "Japanese",
            "code": "ja",
            "offset": 1920,
            "bytes": 25
        },
        {
            "name": "ENGLISH",
            "code": "en",
            "offset": 1945,
            "bytes": 92
        },
        {
            "name": "Japanese",
            "code": "ja",
            "offset": 2037,
            "bytes": 46
        },
        {
            "name": "ENGLISH",
            "code": "en",
            "offset": 2083,
            "bytes": 42
        },
        {
            "name": "Japanese",
            "code": "ja",
            "offset": 2125,
            "bytes": 26
        },
        {
            "name": "ENGLISH",
            "code": "en",
            "offset": 2151,
            "bytes": 123
        },
        {
            "name": "Japanese",
            "code": "ja",
            "offset": 2274,
            "bytes": 52
        },
        {
            "name": "ENGLISH",
            "code": "en",
            "offset": 2326,
            "bytes": 52
        },
        {
            "name": "Japanese",
            "code": "ja",
            "offset": 2378,
            "bytes": 24
        }
    ]
}

For this reason I have used Ngrams to analyze the input text (so chunks of text, rather than the whole text). Hope this helps, free to ask more real-world examples.

Right, you could write a function which splits up the sentences and passes it to the existing model and builds a more fine-grained output.

But there could be subtle problems with a training on labels for whole sentences or longer and then predicting labels for n-grams.

@bittlingmayer yes exactly, supposed I'm splitting into Ngrams, with a given N (let's say with N ranging from 3 to 7), there could be different results for different values of N. In this specific case the FastText model was trained on Tatoeba and SETimes datasets, that have typically short sentences like (from the Tatoeba training set here):

__label__por Tenho um gato e um cão. O gato é preto e o cão é branco.
__label__jpn 君はいまやエリート集団の一員だ。
__label__fin Laatikko oli tyhjä.
__label__por Como podemos evitar que isso volte a acontecer?

This means (my guess) that I could look at the mean Ngram Size over the dataset and use this one to segment the input text document and then run the model. The second problem then it would be: how I combine the fine grained results of the different array of probabilities for the same class?

This means (my guess) that I could look at the mean Ngram Size over the dataset and use this one to segment the input text document and then run the model.

Yes. But in that case you could do an analysis to find n-grams that occur with multiple labels, and then go back and re-label ie multi-label the rows. For example if n=3 your script would find that Como podemos evitar is also valid Spanish, it would transform the row:

__label__por  Como podemos evitar

to

__label__por __label__esp  Como podemos evitar

There will already be rows with that label but it will change the balances and give you a model that can predict a high probability for both.

Similar languages, translit, named entities and so on cause all the headaches.

The second problem then it would be: how I combine the fine grained results of the different array of probabilities for the same class?

As far as the ideal output, I think the answer at which you will arrive is that it depends on the application.

Many developers really do just want a single language code and maybe a single number (I won't call it probability, more like probability * chunk_length / total_length).

CLD2 is very much focused on the browser use case. Thus they can make useful assumptions about length and desired behaviour. Twitter has its own lang id, it is for a very different use case.

fastText langid never specified a target use case as far as I know, although presumably fastText lang id is used internally by Facebook for posts and comments, so fundamentally different than CLD2 but still for offering translation to users on user-generated content, and maybe other things like part of an input vector to the system that decides which posts or comments to show to a user.

closing since previous reply was ok.

Was this page helpful?
0 / 5 - 0 ratings