Rasa: Why supported languages are hardcoded?

Created on 3 Jan 2017 · 12Comments · Source: RasaHQ/rasa

Hi all,

first of all, thank you for this awesome work!

I'm almost new to NLP & co in general, so I've started stydying as much as possible to get some grasp. Meanwhile, I've started tinkering with rasa_nlu and, after running the default provided examples, I tried to use it with italian language, but I can't specify it in the configuration file (or CLI) because it has hardcoded languages (en and de).

I'm aware it needs a total_word_feature_extractor (at least for the MITIE backend), I've generated one from a relatively small italian corpus, but I can't use it anyway on rasa_nlu.

I've also read somewhere that it's possible to avoid a predefined language model at the cost of very low quality results, but at the point where I am, it's totally acceptable.

So, is it possible (or is it planned) to support more languages and/or workaround the hardcoded languages?

Thanks in advance for the replies, keep up the good job!

type

Source

andreapavoni

Most helpful comment

Hi @asssmidt - I'll add some instructions to the README :)

amn41 on 20 Jan 2017

👍6

All 12 comments

thanks for creating this issue, andrea! Support for more languages is one of our most requested features.

The reason languages are currently hard coded is that rasa wouldn't work 'as is' just by changing the language. You're totally right that you need a feature extractor for MITIE, but you should also check that the tokenizer works correctly. I would be surprised if the basic one could handle Italian contractions like _dell'arte_ .

I would propose that you create a fork of rasa and add Italian support there, and then create a pull request so we can include it in the main project. It would be excellent if you could make a good feature_extractor available as well :)

Good luck with your project!

amn41 on 3 Jan 2017

Hi again!

thank you for the quick reply. I've spent some hours tinkering on my fork. I've added it support (not sure if in the _right way™_ though) and created an italian version of the restaurant search training data.

I've also created a MITIE feature_extractor based on a small corpus, its total size is just 25MB.

the commit is here, I didn't create a PR until I get some feedback about it :P

I can also provide the italian feature_extractor by uploading it somewhere along with the raw text corpus I've used.

cheers!

andreapavoni on 4 Jan 2017

Looks cool! Ok, what I have in mind is this.

We can remove the restriction of languages on the server args, so you can basically pass anything you like there. This works because the Trainer classes already check that they support the language, see https://github.com/golastmile/rasa_nlu/blob/master/src/trainers/spacy_sklearn_trainer.py#L15 and https://github.com/golastmile/rasa_nlu/blob/master/src/trainers/mitie_trainer.py#L11

That way you can just subclass the MITIETrainer and add 'it' as a supported language. Does that work?

amn41 on 9 Jan 2017

hi @amn41 , I would like to understand how I can get started in translating this beast to Danish language. I'm a non-coder so please be gentle when describing the steps I need to do to get started ;)

Thanks!

asssmidt on 19 Jan 2017

👍1

Hi @asssmidt - I'll add some instructions to the README :)

amn41 on 20 Jan 2017

👍6

@andreapavoni are you using the italian model? how's that going?

amn41 on 16 Feb 2017

HI @amn41 and thanks for Rasa, looks great.

However, I'd like to use it in French, so I came by, and read the readme :

If you want to add a new language, the key things you need are a tokenizer and a set of word vectors.
Once you've found those, feel free to create an issue.

It looks like spacy has already a French tokenizer.
Can I have more info on the set vectors?

Thanks a lot, and well done for your work on Rasa!

Gawtier on 20 Feb 2017

There is a group of people actively working on French support. Please email me & I'll introduce you!

amn41 on 20 Feb 2017

👍2

Hi @amn41

Thank you for your responses on these questions! I would like to help with French & Dutch support (since I am living in Belgium). Anything I can do as a non-developer?

DennisPeeters on 20 Feb 2017

Hi,

I sent you an email. Thanks for your quick answer.

Gawtier on 20 Feb 2017

Hi @amn41 I didn't receive it. Feel free to send to dennis.[email protected]

DennisPeeters on 20 Feb 2017

I removed the hard coded languages and replaced them with a warning. That should allow simpler integration of new languages. It doesn't rescue you from training the mentioned word vectors & tokenizers.

tmbo on 3 Mar 2017

Was this page helpful?

0 / 5 - 0 ratings

Related issues

rasa_core.policies.ensemble.InvalidPolicyConfig: You didn't define any policies. Please define them under 'policies:' in your policy configuration file.

Arghya999 · 3Comments

rasa interactive doesnt work

nicolasfarina · 3Comments

Regarding Multiple Entity Extraction

rayush7 · 3Comments

No matching distribution found for tensorflow==1.15.0

Poojan66 · 3Comments

DIET classifier _predict_entities function clean_up_entities for Chinese language issue

johnson7788 · 3Comments