Rasa: ner_crf + intent_classifier_tensorflow_embedding is dependent on locale settings

Created on 8 Aug 2018  路  11Comments  路  Source: RasaHQ/rasa

Rasa NLU version: rasa-nlu==0.13.1

Operating system (windows, osx, ...): Ubuntu 16.04

Content of model configuration file:

language: "en"

pipeline:
- name: "tokenizer_whitespace"
- name: "ner_crf"
- name: "ner_synonyms"
- name: "ner_duckling"
- name: "intent_entity_featurizer_regex"
- name: "intent_featurizer_count_vectors"
- name: "intent_classifier_tensorflow_embedding"
  # flag if to tokenize intents
  intent_tokenization_flag: true
  intent_split_symbol: "_"
  # nn architecture
  num_hidden_layers_a: 1
  hidden_layer_size_a: [64]
  num_hidden_layers_b: 1
  hidden_layer_size_b: [32]
  batch_size: 5
  epochs: 60
  # embedding parameters
  embed_dim: 10
  mu_pos: 0.8  # should be 0.0 < ... < 1.0 for 'cosine'
  mu_neg: -0.4  # should be -1.0 < ... < 1.0 for 'cosine'
  similarity_type: "cosine"  # string 'cosine' or 'inner'
  num_neg: 10
  use_max_sim_neg: true  # flag which loss function to use
  # regularization
  C2: 0.002
  C_emb: 0.8
  droprate: 0.2

Issue:

When using the ner_spacy together with intent_classifier_tensorflow_embedding training might eventually fail depending on the locale settings.
If the locale settings are configured to use a comma (,) instead of a period (.) the training fails showing a long trace ending with an error similar as this one:

google.protobuf.text_format.ParseError: 48:12 : Message type "tensorflow.AttrValue" has no field named "5".

In the Google protobuf implementation, a locale-dependent function is used to convert float to strings what causes problems I guess when ner_duckling extracts its dimensions.

So to summarize this happens (at least) when three things co-occur:

  • using ner_spacy
  • using intent_classifier_tensorflow_embedding
  • locale settings use comma (,) for decimal numbers.

I know this is not a RASA specific error but maybe a warning could be raised or this could be stated in the documentation.

Hope at least this issue could help others in a similar situation.

All 11 comments

Thanks for raising this issue, @Ghostvv will get back to you about it soon.

@jmrf Thanks for pointing that out. Do you mind creating a PR for this?

Hey @Ghostvv, sure.

What is your prefered approach to this?
A warning at runtime when these two components are used in the pipeline for instance?

Can we catch when locale settings use comma and then check if numbers were extracted incorrectly?

We can check the locale settings from python:

   import locale
   locale.setlocale(locale.LC_ALL, "")
   print(locale.getlocale(locale.LC_NUMERIC))    #   ('en_US', 'UTF-8') or ('es_ES', 'UTF-8') etc

Even though pretty much everyone except US and UK numeric conventions uses commas as decimal separator (radix), there are not the only ones, for example th_TH (Thailand) also does use period...
(More details about locales: en_US or en_GB)

So far I am not aware of a method of checking the actual decimal point character for a given locale, so to know when the locale settings are using a comma instead of a period we should know which locale settings use a comma in their definition.... what doesn't look very practical.

The only way I can think of right now to check locale decimal point character is using the locale.atof function:
(Converts a string to a floating point number, following the LC_NUMERIC settings.)

from locale import *
# A period decimal point locale should convert wrong:
setlocale(LC_NUMERIC, 'en_US.UTF-8')
atof("123,7")     # this gives 1237.0

# A comma decimal point locale should do it right:
setlocale(LC_NUMERIC, 'es_ES.UTF-8')
atof("123,7")    # this gives 123.7

That's regarding detecting locale settings.

I need to dive deeper into the inner workings of the pipeline to see exactly what happens with ner_duckling and the protobuf part of the intent_classifier_tensorflow_embedding.

Once I have more insight into this I'll share here and we take it from there for the PR.

Interesting! Thank you very much for doing this. Let me know if we can help

Hi @Ghostvv,

I've set up a clean environment to make sure was not caused by something else in my setup and that I could reproduce the issue.

I am getting a similar outcome, what so far looks consistent with what was observed previously due to the comma vs dot matter.

Could you check if it also happens to you?

I replicated a simplified directory structure in this file:
rasa_issue_1297.zip

There's a bash script that exports all locale variables as using es_ES.UTF-8 and runs a rasa train with the above-mentioned pipeline configuration and some dummy data.

You can run it just by doing:

./run.sh

If you could check if you also see the same behaviour that would be great.

I'll continue looking into this in the following days as soon as I have some free time to dive deeper.

Cheers

thank you very much for your script, but I ran it in specifically created empty venv and training goes without any errors

Very interesting!

I just noticed that in the pipeline documentation for versions 0.13+, in the built-in components, appears ner_duckling_http instead of ner_duckling.

Running the ner_duckling_http works well for me too.

If the ner_duckling component is not any longer used in the 0.13 and above version I think we are good to close this issue.

Thanks both for your help :)

But I tried with duckling installed with pip and it was also working fine for me

That's very interesting.
Quite difficult to pinpoint the real cause as there are several components involved.
The python Duckling wrapper, Duckling and the JVM, tensorflow and protobuf, ...

I am inclined to think that might be something of a specific version.

Nevertheless with ner_duckling_http looks solved at least for the time being. If I find again some issue we can reopen and I happily will further look into it.

Cheers

Was this page helpful?
0 / 5 - 0 ratings