Deeplearning4j: ngram vectors of FastText

Created on 7 Nov 2017  ·  14Comments  ·  Source: eclipse/deeplearning4j

Issue Description

Hello,
Im looking for using FastText generated model in my code based on deeplearning4j. As you know that
FastText models support vector lookups for out-of-vocabulary words by summing up character ngrams belonging to the word.

So, The error i got is that if a give an word which is not in the model deeplearning4j generat an exsception.

SO please, how can let deeplearning4j to handle the sub-words (ngram)?

Thank you

Version Information

Please indicate relevant versions, including, if relevant:

  • Deeplearning4j version: 0.9.1
Enhancement

Most helpful comment

As a workaround, if you know your vocabulary in advance and have access to pre-trained embeddings, you can extract the necessary subset from the .bin model and get a format readable by deeplearning4j via:

./fasttext print-word-vectors model.bin < vocabulary.txt > vectors.tsv
  • model.bin is the full model with subword information (trained via fasttext).
  • vocabulary.txt is a text file with your vocabulary (one token type per line).
  • vectors.tsv is a space-separated text file with sub-word embeddings for your vocabulary.

The file vectors.tsv can then be loaded into deeplearning4j via WordVectorSerializer.loadStaticModel(vectors.tsv) (tested on version 0.9.1).

All 14 comments

Sorry, at this moment we don’t have FastText implementation.

7 нояб. 2017 г., в 17:10, TamouzeAssi notifications@github.com написал(а):

Issue Description

Hello,
Im looking for using FastText generated model in my code based on deeplearning4j. As you know that
FastText models support vector lookups for out-of-vocabulary words by summing up character ngrams belonging to the word.

So, The error i got is that if a give an word which is not in the model deeplearning4j generat an exsception.

SO please, how can let deeplearning4j to handle the sub-words (ngram)?

Thank you

Version Information

Please indicate relevant versions, including, if relevant:

Deeplearning4j version: 0.9.1

You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub https://github.com/deeplearning4j/deeplearning4j/issues/4258, or mute the thread https://github.com/notifications/unsubscribe-auth/ALru_71XJ206o52m0fxAUYjCoL_GtYreks5s0GTsgaJpZM4QU19G.

@raver119,
Thank you for your reply, but do you have any plan to add this service soon or not in your plan?
Thank you again

FYI, it wouldn't be too hard getting a Java interface to the C++ library, if that's all you're looking for:
https://github.com/bytedeco/javacpp-presets/issues/346

We can import fast text from keras though. /cc @maxpumperla

Hi @TamouzeAssi, I also need this and I'm about to try JFastText java interface to the C++ library
https://github.com/vinhkhuc/JFastText/

Thanks @namsor!

The JNI interface is built using javacpp.

Would be nice to add it to the presets!

/cc @vinhkhuc

Thank you Sir @namsor! Im trying to install it but i got some error. Hope it will work :) thankyou again

For me, it worked. It went rather well to install on Linux but was a pain to compile FastText and the native wrapper on Windows 10. What is your target build?

@namsor,
When runing JFastText we get the following error: JFastText: jniFastTextWrapper in java.library.path

Did you face something like that. It seems that the JFastext project is not suppoted.

Hi ! No we don't have that error. All the native libs should be included in the JAR files under JFastText\target directory : jfasttext-0.4-SNAPSHOT.jar or jfasttext-0.4-SNAPSHOT-jar-with-dependencies.jar

BTW/ I was thinking of writing a native Java reader for FastText BIN files, but there seem to be already something at https://github.com/ivanhk/fastText_java

@namsor,

So so thanks, i just copied the two jar you mentionnned and use them in my jython project based on deeplearning4j project and it works very well

My last question please if you knwo please! we can get the the embedding of a given word by jft.getVector(word), and if the word isnt in the vocab then we will get a zeros vector. But im looking to get the embedding for an n-gram.

for example if the given word is "music" how can i get the embedding for "mus" ngram? i tried and i got also a zers vector

Anything was developed in this direction? To use fasttext to calculate embeddings for n-gram of characters?

As a workaround, if you know your vocabulary in advance and have access to pre-trained embeddings, you can extract the necessary subset from the .bin model and get a format readable by deeplearning4j via:

./fasttext print-word-vectors model.bin < vocabulary.txt > vectors.tsv
  • model.bin is the full model with subword information (trained via fasttext).
  • vocabulary.txt is a text file with your vocabulary (one token type per line).
  • vectors.tsv is a space-separated text file with sub-word embeddings for your vocabulary.

The file vectors.tsv can then be loaded into deeplearning4j via WordVectorSerializer.loadStaticModel(vectors.tsv) (tested on version 0.9.1).

Was this page helpful?
0 / 5 - 0 ratings