Rasa: which are the steps to add pt-br (brazilian portuguese) to the rasa?

Created on 1 Mar 2017 · 7Comments · Source: RasaHQ/rasa

I pretend to work with pt-br =)
I have a tokenizer (NLPNET) for pt-br o/

Source

fafg

Most helpful comment

I can't help you with that: as I built it with my current company's resources, the file doesn't belong to me. Moreover I used french wikipedia as a corpus, so as I said I doubt the language model is very robust to bad user input/typos. Nevertheless if you want to do the same process to try for yourself, I can be more specific:

A fresh dump of french wiki
The tool to transform it in a usable input for MITIE
2.1 How to remove the doc tag
Put all the files you obtained in the same folder: this will be MITIE input folder (see below)
If you don't have enough RAM (~128GB), increase your swap. I had "as little as" 32GB RAM on my workstation, so even if swap is extremely slow, it should work.
Compile and use MITIE wordrep with your fresh wiki dump wordrep -e /MITIE/input/folder. Be patient as it may take a few days without outputting anything (It took 2.5 days for me using an high end Intel i7 quadcore CPU).

Following this recipe, you'll get a total_word_features_extractor.dat

Hope this help

PHLF on 13 Apr 2017

👍5

All 7 comments

Here is the process I followed for French (thanks to @amn41 and @davisking):

Get a ~clean language corpus (a wikipedia dump works) as a set of text files
Build and run MITIE wordrep tool on your corpus. This can takes several hours/days depending on your dataset and your workstation. You'll need something like 128GB of RAM for wordrep to run -Yes that's alot: try to extend your swap-
Set the path of your new total_word_feature_extractor.dat as value of the mitie_file parameter in config.mitie.json

You are good to go using MITIE backend. As for spaCy, you'll need a tokenizer (MITIE is all inclusive) and the procedure seems more complex.

It seems to work for me at the moment but I need to test it more thoroughly.

PHLF on 3 Mar 2017

👍3

@PHLF If you don't mind, I'd create a new page on the documentation adding these thoughts as a reference.

tmbo on 3 Mar 2017

The help page can be found here:

https://rasa-nlu.readthedocs.io/en/latest/languages.html

tmbo on 3 Mar 2017

👍2

Hey @PHLF , is the french total_word_feature_extractor.dat available somewhere ? That would save me a lot of time...
Thanks !

pdesgarets on 13 Apr 2017

👍2

A fresh dump of french wiki
The tool to transform it in a usable input for MITIE
2.1 How to remove the doc tag
Put all the files you obtained in the same folder: this will be MITIE input folder (see below)
If you don't have enough RAM (~128GB), increase your swap. I had "as little as" 32GB RAM on my workstation, so even if swap is extremely slow, it should work.
Compile and use MITIE wordrep with your fresh wiki dump wordrep -e /MITIE/input/folder. Be patient as it may take a few days without outputting anything (It took 2.5 days for me using an high end Intel i7 quadcore CPU).

Following this recipe, you'll get a total_word_features_extractor.dat

Hope this help

PHLF on 13 Apr 2017

👍5

OK I understand, thanks for the help !

pdesgarets on 13 Apr 2017

Just to mention I am using Rasa in French (spacy model + duckling) and it works. Just a minor issue mentioned in #376
So French could be added to the documentation