Deeplearning4j: ngram vectors of FastText

Created on 7 Nov 2017 · 14Comments · Source: eclipse/deeplearning4j

Issue Description

Hello,
Im looking for using FastText generated model in my code based on deeplearning4j. As you know that
FastText models support vector lookups for out-of-vocabulary words by summing up character ngrams belonging to the word.

So, The error i got is that if a give an word which is not in the model deeplearning4j generat an exsception.

SO please, how can let deeplearning4j to handle the sub-words (ngram)?

Thank you

Version Information

Please indicate relevant versions, including, if relevant:

Deeplearning4j version: 0.9.1

Enhancement

Source

ali3assi

👍3

Most helpful comment

As a workaround, if you know your vocabulary in advance and have access to pre-trained embeddings, you can extract the necessary subset from the .bin model and get a format readable by deeplearning4j via:

./fasttext print-word-vectors model.bin < vocabulary.txt > vectors.tsv

model.bin is the full model with subword information (trained via fasttext).
vocabulary.txt is a text file with your vocabulary (one token type per line).
vectors.tsv is a space-separated text file with sub-word embeddings for your vocabulary.

The file vectors.tsv can then be loaded into deeplearning4j via WordVectorSerializer.loadStaticModel(vectors.tsv) (tested on version 0.9.1).

michelole on 23 Apr 2019

👍2

All 14 comments

Sorry, at this moment we don’t have FastText implementation.

7 нояб. 2017 г., в 17:10, TamouzeAssi notifications@github.com написал(а):

Issue Description

Hello,
Im looking for using FastText generated model in my code based on deeplearning4j. As you know that
FastText models support vector lookups for out-of-vocabulary words by summing up character ngrams belonging to the word.

So, The error i got is that if a give an word which is not in the model deeplearning4j generat an exsception.

SO please, how can let deeplearning4j to handle the sub-words (ngram)?

Thank you

Version Information

Please indicate relevant versions, including, if relevant:

Deeplearning4j version: 0.9.1
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub https://github.com/deeplearning4j/deeplearning4j/issues/4258, or mute the thread https://github.com/notifications/unsubscribe-auth/ALru_71XJ206o52m0fxAUYjCoL_GtYreks5s0GTsgaJpZM4QU19G.

raver119 on 7 Nov 2017

@raver119,
Thank you for your reply, but do you have any plan to add this service soon or not in your plan?
Thank you again

ali3assi on 7 Nov 2017

FYI, it wouldn't be too hard getting a Java interface to the C++ library, if that's all you're looking for:
https://github.com/bytedeco/javacpp-presets/issues/346

saudet on 8 Nov 2017

We can import fast text from keras though. /cc @maxpumperla

agibsonccc on 8 Nov 2017

Hi @TamouzeAssi, I also need this and I'm about to try JFastText java interface to the C++ library
https://github.com/vinhkhuc/JFastText/

namsor on 8 Nov 2017

Thanks @namsor!

The JNI interface is built using javacpp.

Would be nice to add it to the presets!

/cc @vinhkhuc

saudet on 9 Nov 2017

Thank you Sir @namsor! Im trying to install it but i got some error. Hope it will work :) thankyou again

ali3assi on 9 Nov 2017

For me, it worked. It went rather well to install on Linux but was a pain to compile FastText and the native wrapper on Windows 10. What is your target build?

namsor on 9 Nov 2017

@namsor,
When runing JFastText we get the following error: JFastText: jniFastTextWrapper in java.library.path

Did you face something like that. It seems that the JFastext project is not suppoted.

ali3assi on 17 Nov 2017

Hi ! No we don't have that error. All the native libs should be included in the JAR files under JFastText\target directory : jfasttext-0.4-SNAPSHOT.jar or jfasttext-0.4-SNAPSHOT-jar-with-dependencies.jar

namsor on 17 Nov 2017

BTW/ I was thinking of writing a native Java reader for FastText BIN files, but there seem to be already something at https://github.com/ivanhk/fastText_java

namsor on 17 Nov 2017

@namsor,

So so thanks, i just copied the two jar you mentionnned and use them in my jython project based on deeplearning4j project and it works very well

My last question please if you knwo please! we can get the the embedding of a given word by jft.getVector(word), and if the word isnt in the vocab then we will get a zeros vector. But im looking to get the embedding for an n-gram.

for example if the given word is "music" how can i get the embedding for "mus" ngram? i tried and i got also a zers vector

ali3assi on 17 Nov 2017

Anything was developed in this direction? To use fasttext to calculate embeddings for n-gram of characters?

avanco on 13 Apr 2018

👍1

./fasttext print-word-vectors model.bin < vocabulary.txt > vectors.tsv

model.bin is the full model with subword information (trained via fasttext).
vocabulary.txt is a text file with your vocabulary (one token type per line).
vectors.tsv is a space-separated text file with sub-word embeddings for your vocabulary.

The file vectors.tsv can then be loaded into deeplearning4j via WordVectorSerializer.loadStaticModel(vectors.tsv) (tested on version 0.9.1).

michelole on 23 Apr 2019

👍2

Was this page helpful?

0 / 5 - 0 ratings