Rasa: Why supported languages are hardcoded?

Created on 3 Jan 2017  路  12Comments  路  Source: RasaHQ/rasa

Hi all,

first of all, thank you for this awesome work!

I'm almost new to NLP & co in general, so I've started stydying as much as possible to get some grasp. Meanwhile, I've started tinkering with rasa_nlu and, after running the default provided examples, I tried to use it with italian language, but I can't specify it in the configuration file (or CLI) because it has hardcoded languages (en and de).

I'm aware it needs a total_word_feature_extractor (at least for the MITIE backend), I've generated one from a relatively small italian corpus, but I can't use it anyway on rasa_nlu.

I've also read somewhere that it's possible to avoid a predefined language model at the cost of very low quality results, but at the point where I am, it's totally acceptable.

So, is it possible (or is it planned) to support more languages and/or workaround the hardcoded languages?

Thanks in advance for the replies, keep up the good job!

type

Most helpful comment

Hi @asssmidt - I'll add some instructions to the README :)

All 12 comments

thanks for creating this issue, andrea! Support for more languages is one of our most requested features.

The reason languages are currently hard coded is that rasa wouldn't work 'as is' just by changing the language. You're totally right that you need a feature extractor for MITIE, but you should also check that the tokenizer works correctly. I would be surprised if the basic one could handle Italian contractions like _dell'arte_ .

I would propose that you create a fork of rasa and add Italian support there, and then create a pull request so we can include it in the main project. It would be excellent if you could make a good feature_extractor available as well :)

Good luck with your project!

Hi again!

thank you for the quick reply. I've spent some hours tinkering on my fork. I've added it support (not sure if in the _right way鈩 though) and created an italian version of the restaurant search training data.

I've also created a MITIE feature_extractor based on a small corpus, its total size is just 25MB.

the commit is here, I didn't create a PR until I get some feedback about it :P

I can also provide the italian feature_extractor by uploading it somewhere along with the raw text corpus I've used.

cheers!

Looks cool! Ok, what I have in mind is this.

We can remove the restriction of languages on the server args, so you can basically pass anything you like there. This works because the Trainer classes already check that they support the language, see https://github.com/golastmile/rasa_nlu/blob/master/src/trainers/spacy_sklearn_trainer.py#L15 and https://github.com/golastmile/rasa_nlu/blob/master/src/trainers/mitie_trainer.py#L11

That way you can just subclass the MITIETrainer and add 'it' as a supported language. Does that work?

hi @amn41 , I would like to understand how I can get started in translating this beast to Danish language. I'm a non-coder so please be gentle when describing the steps I need to do to get started ;)

Thanks!

Hi @asssmidt - I'll add some instructions to the README :)

@andreapavoni are you using the italian model? how's that going?

HI @amn41 and thanks for Rasa, looks great.

However, I'd like to use it in French, so I came by, and read the readme :

If you want to add a new language, the key things you need are a tokenizer and a set of word vectors.
Once you've found those, feel free to create an issue.

It looks like spacy has already a French tokenizer.
Can I have more info on the set vectors?

Thanks a lot, and well done for your work on Rasa!

There is a group of people actively working on French support. Please email me & I'll introduce you!

Hi @amn41

Thank you for your responses on these questions! I would like to help with French & Dutch support (since I am living in Belgium). Anything I can do as a non-developer?

Hi,

I sent you an email. Thanks for your quick answer.

Hi @amn41 I didn't receive it. Feel free to send to dennis.[email protected]

I removed the hard coded languages and replaced them with a warning. That should allow simpler integration of new languages. It doesn't rescue you from training the mentioned word vectors & tokenizers.

Was this page helpful?
0 / 5 - 0 ratings