Flair: Why does the _viterbi_decode() method uses NumPy ndarray instead of PyTorch Tensors?

Created on 22 Apr 2021 · 2Comments · Source: flairNLP/flair

Hello there,

I've been repurposing the Sequence Tagger model. Basically, it will involve stacking a few layers on top of the NER model. That is, passing the labeled sentences to the next module, using the_obtain_labels() method. Of course, I would be calculating the loss at the top, to then do a backward pass, which should propagate the errors through the whole ensemble, including the transition matrix of the CRF part, if I'm getting it right.

However,_viterbi_decode(), the method which predicts the labels, uses NumPy ndarrays and operations instead of PyTorch Tensors, therefore detaching them from the computation graph. So it looks like autograd won't be able to calculate the gradient for the transition matrix.

I noticed there's a method for calculating loss in the original model, and this one doesn't use _viterbi_decode() but instead _forward_alg(), which does use PyTorch tensors.

So my questions are:

What's the difference between these two methods?
Why not just predict labels with _viterbi_decode() and then use a plain loss function like log-likelihood?
Why use NumPy over PyTorch in _viterbi_decode()?

Thanks for your time!
Greetings.

Santiago.

PD: I'm new to PyTorch...

question

Source

s-glitch

Most helpful comment

Hello @s-glitch,

the loss of a crf-NN is not only the _forward_alg() but (as shown here the difference between _forward_alg() and _score_sentence

_score_sentence calculates the engery of the labeled sentence, while _forward_alg() calculates the total engergy for any labeling.
on the other hand _viterbi_encode() calculates the the most likely sequence given the weights and input features.

Yes, in theory you could simply use _viterbi_decode() and cross_entropy to get the same result, however _viterbi_decode() simply takes longer to compute. It's way simpler to only calculate the total loss, instead of all individual losses for each token.

_viterbi_decode() uses numpy instead of pytorch, bc that algorithm is not paralellizable and runs faster on the cpu than on the (usually prefered) gpu. there is a really interesting blog post about the optimisation progress: https://towardsdatascience.com/why-we-switched-from-spacy-to-flair-to-anonymize-french-legal-cases-e7588566825f