Spacy: Feature request: pretrained English sentiment model

Created on 22 Jan 2017  Â·  10Comments  Â·  Source: explosion/spaCy

Having pretrained GloVe vectors easily accessible is great for quick experimentation. It would be great if there was a pretrained sentiment model I could use too. Right now all the lexemes in the english vocabulary have sentiment set to 0.

enhancement

Most helpful comment

Hi Pete,

First, thanks for your work on React :). I remember seeing your talks right after it came out and thinking it really made sense.

Pre-trained models for a variety of languages, genres and use-cases are actually the main commercial offering we're working on for spaCy. You'll be able to download a data pack for a small one-time fee, and you'll get 12 months of upgrades as they're published.

After you download the data, it's yours — you can run the model however you like, without pinging an external service. Crucially, you'll also be able to backpropagate into it, something that no cloud provider will be able to offer you.

Timelines are always tricky, but think weeks, not months :).

Our data packs will have sentiment models you'll be able to use out-of-the-box. However, the model will get much much better on your use case if you "fine tune" it on your own data. There's not really any such thing as "sentiment", in general. The exact behaviours you need from the model will be specific to your application. The design we're going for is that the pre-trained model gives you the basic knowledge about the language and the world, and your own data programs the system to do what you need.

To get you moving for now, the code in this example is a pretty good sentiment model for long texts: https://github.com/explosion/spaCy/blob/master/examples/deep_learning_keras.py . It projects the labels down from the document level to the sentence, and then uses a bidirectional LSTM model to encode position-sensitive features onto the words. This means that the model is capable of seeing that "charge" is positive in some contexts as a noun, but "charge back" is almost always negative. The position-sensitive features are then pooled, and a model predicts over the resulting vector for each sentence. The document prediction is a simple sum of the sentence predictions.

I would recommend using this "bag of sentences" approach in most long-document scenarios. Current models don't get useful signal between sentences, and predicting the sentences in parallel is a huge improvement for tractability.

All 10 comments

Hi Pete,

First, thanks for your work on React :). I remember seeing your talks right after it came out and thinking it really made sense.

Pre-trained models for a variety of languages, genres and use-cases are actually the main commercial offering we're working on for spaCy. You'll be able to download a data pack for a small one-time fee, and you'll get 12 months of upgrades as they're published.

After you download the data, it's yours — you can run the model however you like, without pinging an external service. Crucially, you'll also be able to backpropagate into it, something that no cloud provider will be able to offer you.

Timelines are always tricky, but think weeks, not months :).

Our data packs will have sentiment models you'll be able to use out-of-the-box. However, the model will get much much better on your use case if you "fine tune" it on your own data. There's not really any such thing as "sentiment", in general. The exact behaviours you need from the model will be specific to your application. The design we're going for is that the pre-trained model gives you the basic knowledge about the language and the world, and your own data programs the system to do what you need.

To get you moving for now, the code in this example is a pretty good sentiment model for long texts: https://github.com/explosion/spaCy/blob/master/examples/deep_learning_keras.py . It projects the labels down from the document level to the sentence, and then uses a bidirectional LSTM model to encode position-sensitive features onto the words. This means that the model is capable of seeing that "charge" is positive in some contexts as a noun, but "charge back" is almost always negative. The position-sensitive features are then pooled, and a model predicts over the resulting vector for each sentence. The document prediction is a simple sum of the sentence predictions.

I would recommend using this "bag of sentences" approach in most long-document scenarios. Current models don't get useful signal between sentences, and predicting the sentences in parallel is a huge improvement for tractability.

Cool, thanks for letting me know! And thanks for the kind words. Sounds like we will be paying you sometime soon... :)

Thank you for your work @honnibal, maybe you should put the url of the dataset your used for the example (I guess it was the IMDB one).
Also noticed that you could by default open the file by encoding in utf-8 here:

with filename.open(encoding='utf-8') as file_:

I hope it helps :)

Cheers

After a brief search I haven't found any pre-trained sentiment models from you guys. Is there an update on this?

Cheers

@vcovo There's no update on this sorry --- we haven't been able to find a publicly available dataset that we were happy with, and we didn't want to put out something that we didn't think would be useful.

This applied more generally across the idea of the data store mentioned above: for almost anything we wanted to do, we found we wanted to annotate fresh data. We therefore put our annotation tool project Prodigy ahead of the data store in our work queue -- now that Prodigy's out and being used, the data store is back on the agenda.

We do have the text classifier in spaCy, so training it yourself on any of the publicly available datasets should be quite easy. See here for the example on training on IMDB: https://github.com/explosion/spaCy/blob/master/examples/training/train_textcat.py . Training should complete in a few hours on a CPU.

Be sure to also benchmark the models you trained against the text classification functions in other open-source libraries, especially Vowpal Wabbit, scikit-learn and FastText. I've set up spaCy's text classifier in a way that I've found to be generally good on the problems I've been working on, and it's particularly well suited for short texts. However, one model isn't best across the board --- so you'll do well to check the other open-source solutions as well, which are faster due to different algorithmic choices.

If you need to annotate data as well as use existing resources, do have a look at Prodigy --- it's very efficient at training a new model.

@honnibal No problem, I can totally understand. Thank you for the pointers. I'll make sure to check Prodigy out.

Deep learning with keras:

Why you use to two diffrent 'en' model:

def evaluate(model_dir, texts, labels, max_length=100):
def create_pipeline(nlp):
'''
This could be a lambda, but named functions are easier to read in Python.
'''
return [nlp.tagger, nlp.parser, SentimentAnalyser.load(model_dir, nlp,
max_length=max_length)]
nlp = spacy.load('en')
nlp.pipeline = create_pipeline(nlp)
correct = 0
i = 0
for doc in nlp.pipe(texts, batch_size=1000, n_threads=4):


IN Training time:

def train(train_texts, train_labels, dev_texts, dev_labels,
lstm_shape, lstm_settings, lstm_optimizer, batch_size=100,
nb_epoch=5, by_sentence=True):
print("Loading spaCy")
nlp = spacy.load('en_vectors_web_lg')
nlp.add_pipe(nlp.create_pipe('sentencizer'))
embeddings = get_embeddings(nlp.vocab)
model = compile_lstm(embeddings, lstm_shape, lstm_settings)

Got diffrent error that error like "Dimension 0 in both shapes must be equal, but are 1070971 and 0"
and load other so much error for same code.

Maybe the HuggingFace release could help with this? https://github.com/huggingface/pytorch-transformers

Closing this, especially since transfer learning techniques make this even less relevant, as @chrisranderson suggests.

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

TropComplique picture TropComplique  Â·  3Comments

curiousgeek0 picture curiousgeek0  Â·  3Comments

melanietosik picture melanietosik  Â·  3Comments

besirkurtulmus picture besirkurtulmus  Â·  3Comments

muzaluisa picture muzaluisa  Â·  3Comments