Rasa: which are the steps to add pt-br (brazilian portuguese) to the rasa?

Created on 1 Mar 2017  路  7Comments  路  Source: RasaHQ/rasa

I pretend to work with pt-br =)
I have a tokenizer (NLPNET) for pt-br o/

Most helpful comment

I can't help you with that: as I built it with my current company's resources, the file doesn't belong to me. Moreover I used french wikipedia as a corpus, so as I said I doubt the language model is very robust to bad user input/typos. Nevertheless if you want to do the same process to try for yourself, I can be more specific:

  1. A fresh dump of french wiki
  2. The tool to transform it in a usable input for MITIE
    2.1 How to remove the doc tag
  3. Put all the files you obtained in the same folder: this will be MITIE input folder (see below)
  4. If you don't have enough RAM (~128GB), increase your swap. I had "as little as" 32GB RAM on my workstation, so even if swap is extremely slow, it should work.
  5. Compile and use MITIE wordrep with your fresh wiki dump wordrep -e /MITIE/input/folder. Be patient as it may take a few days without outputting anything (It took 2.5 days for me using an high end Intel i7 quadcore CPU).

Following this recipe, you'll get a total_word_features_extractor.dat

Hope this help

All 7 comments

Here is the process I followed for French (thanks to @amn41 and @davisking):

  1. Get a ~clean language corpus (a wikipedia dump works) as a set of text files
  2. Build and run MITIE wordrep tool on your corpus. This can takes several hours/days depending on your dataset and your workstation. You'll need something like 128GB of RAM for wordrep to run -Yes that's alot: try to extend your swap-
  3. Set the path of your new total_word_feature_extractor.dat as value of the mitie_file parameter in config.mitie.json

You are good to go using MITIE backend. As for spaCy, you'll need a tokenizer (MITIE is all inclusive) and the procedure seems more complex.

It seems to work for me at the moment but I need to test it more thoroughly.

@PHLF If you don't mind, I'd create a new page on the documentation adding these thoughts as a reference.

Hey @PHLF , is the french total_word_feature_extractor.dat available somewhere ? That would save me a lot of time...
Thanks !

I can't help you with that: as I built it with my current company's resources, the file doesn't belong to me. Moreover I used french wikipedia as a corpus, so as I said I doubt the language model is very robust to bad user input/typos. Nevertheless if you want to do the same process to try for yourself, I can be more specific:

  1. A fresh dump of french wiki
  2. The tool to transform it in a usable input for MITIE
    2.1 How to remove the doc tag
  3. Put all the files you obtained in the same folder: this will be MITIE input folder (see below)
  4. If you don't have enough RAM (~128GB), increase your swap. I had "as little as" 32GB RAM on my workstation, so even if swap is extremely slow, it should work.
  5. Compile and use MITIE wordrep with your fresh wiki dump wordrep -e /MITIE/input/folder. Be patient as it may take a few days without outputting anything (It took 2.5 days for me using an high end Intel i7 quadcore CPU).

Following this recipe, you'll get a total_word_features_extractor.dat

Hope this help

OK I understand, thanks for the help !

Just to mention I am using Rasa in French (spacy model + duckling) and it works. Just a minor issue mentioned in #376
So French could be added to the documentation

Was this page helpful?
0 / 5 - 0 ratings