I pretend to work with pt-br =)
I have a tokenizer (NLPNET) for pt-br o/
Here is the process I followed for French (thanks to @amn41 and @davisking):
You are good to go using MITIE backend. As for spaCy, you'll need a tokenizer (MITIE is all inclusive) and the procedure seems more complex.
It seems to work for me at the moment but I need to test it more thoroughly.
@PHLF If you don't mind, I'd create a new page on the documentation adding these thoughts as a reference.
The help page can be found here:
Hey @PHLF , is the french total_word_feature_extractor.dat available somewhere ? That would save me a lot of time...
Thanks !
I can't help you with that: as I built it with my current company's resources, the file doesn't belong to me. Moreover I used french wikipedia as a corpus, so as I said I doubt the language model is very robust to bad user input/typos. Nevertheless if you want to do the same process to try for yourself, I can be more specific:
wordrep -e /MITIE/input/folder. Be patient as it may take a few days without outputting anything (It took 2.5 days for me using an high end Intel i7 quadcore CPU).Following this recipe, you'll get a total_word_features_extractor.dat
Hope this help
OK I understand, thanks for the help !
Just to mention I am using Rasa in French (spacy model + duckling) and it works. Just a minor issue mentioned in #376
So French could be added to the documentation
Most helpful comment
I can't help you with that: as I built it with my current company's resources, the file doesn't belong to me. Moreover I used french wikipedia as a corpus, so as I said I doubt the language model is very robust to bad user input/typos. Nevertheless if you want to do the same process to try for yourself, I can be more specific:
2.1 How to remove the doc tag
wordrep -e /MITIE/input/folder. Be patient as it may take a few days without outputting anything (It took 2.5 days for me using an high end Intel i7 quadcore CPU).Following this recipe, you'll get a total_word_features_extractor.dat
Hope this help