I am currently working on a sentiment analysis project, I heard about the FastText classifier and wanted to see how well it performs on my dataset.
So i went through the code proposed on https://pypi.python.org/pypi/fasttext :
Let's say I have my text data and labels stored in 'data', a pandas dataframe
Now i split it into some train/test sets and write them into two .txt files
Each element of my dataset corresponds to one line of the .txt file, with the label as '__label__pos' or '__label__neg' (like in the examples from the fastText documentation)
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(np.asarray(data['comment']),
np.asarray(data['target']), test_size=0.2)
with open('train__Data.txt', 'a') as f:
for i,X in enumerate(x_train):
f.write(y_train[i] + ' ' + X)
f.write('\n')
with open('test__Data.txt', 'a') as f:
for i,X in enumerate(x_test):
f.write(y_test[i] + ' ' + X)
f.write('\n')
So now, I should have everything prepared to run the training session.
import fastext
classifier = fasttext.supervised('train__Data.txt', 'model')
That piece of code actually does run and takes a little bit of time. It also successfully creates the model.bin file (it does'nt create model.vec, but i've read that was normal since they removed if for the supervised mode)
Now I should be ready to test it and/or get some predictions. Let's first run :
result = classifier.test('test__Data.txt')
print('P@1:', result.precision)
print('R@1:', result.recall)
print('Number of examples:', result.nexamples)
Which gives me the following output:
P@1: nan
R@1: nan
Number of examples: 0
My .txt files are well encoded as 'utf8'. I tried to experiment fastText with some other goals, like getting some embeddings from a train file and the same thing happened, the code seems to run just fine, takes a little bit of time, but then it can not output any vector representation.
train__Data.txt and test__Data.txt are about 70 and 20 Ko, the model.bin output about 800 Ko. So something does happen, I just can't figure how to test it/print some results!
I've went through the reported issues but couldn't find a similar case.
Am i missing something ?
Hello @aylliote, thank you for your post. Since we don't own these bindings we won't be able to help you resolve this issue. Please follow up with the maintainers of the Python bindings that you are using. I'm going to close this issue for now, but please feel free to reopen it at any time if the issue can be traced back to the fastText binary itself.
Thank you for your reply.
I understand, but I actually don't suspect that the problem might come from the pandas librairies or the python i/o functions I'm using, since my final files train__Data.txt and test__Data.txt have exactly the same structure and encoding as the files that are shown in the fasttext tutorial examples
So if someone got the same issue I would be grateful for his help
Thank you
Hey @aylliote, can you reproduce the issue using the fasttext binary itself?
I'll give it a try tomorrow and report here how it went, worth trying for sure
It actually worked without any problem using the fasttext binary, so that must be a problem with the jupyter notebook environment I guess.
And i understand this is not a 'fasttext' issue, so I close it, thank for replies!
Hi aylliote,
I ran into the same issue like many other people, would you care to explain the solution, or any other ideas that you may have on this matter. I want to use fasttext in my python project. But, i cannot do it because of the same above mentioned issue.
regards,
Hi sibi123,
Unfortunately I couldnt solve the problem I reported here. Fact is, as cpuhrsch said, the problem doesn't come from the fasttext project itself, but from the python wrapper we are using. So it would be better to find a solution on the issues part of the python wrapper (https://github.com/salestock/fastText.py/issues).
What I did to solve my problem is that i wrote my own Python wrapper, well i just execute the commands of the fasttext binary from some python code, which is not optimal i guess but was okay for the tests i wanted to do. Let me know if you are interested in that piece of code.
I hope you'll be able to solve your problem
Hi aylliote,
Yes, I am interested in the code (optimal or not), I have to check the performance of fasttext for my project. I highly appreciate your help and contribution.
regards,
Sibtain
Here is the code I used, i tried to comment it a little bit but i dont know if everything is clear. Feel free to ask if theres something you dont get. Also the paths are hard-coded, you will have to replace them to suit the paths on your system.
Also, i saw you're using pandas dataframe, well the code above works if arguments are python list or arrays, you will have to take care of this
Thanks a lot aylliote, I will come back to you if I have any more questions.
i face the same issues,the env is:windows,python 3.5.