Flair: FlairEmbeddings for long sequences

Created on 13 Jan 2019  路  18Comments  路  Source: flairNLP/flair

Several users have reported that if a paragraph consists of a very long sequence of words and we are training on GPU, we get CUDA out of memory issues (#332 #376). The problem is that a batch gets padded to the longest sequence and the full sequence x batch then put through the LSTM at the same time.

An alternative implementation would split sequences into chunks of a maximum length and put them through the LSTM sequentially, by always initializing its hidden state with the output state of the previous chunk. Basically, the same way we currently do it for training the language model.

enhancement

Most helpful comment

We'll make it a priority! :) Hoping to have this ready for 0.4.1!

All 18 comments

This would be a great fix! As temp. workaround I tried using fold and sed to limit the characters per line. Tip: do not use fold, because this could lead to broken utf-8 characters!

In a recent experiment I limited the max. characters per line to 10,000 using: sed -e 's/.\{10000\}/&\n/g'. One disadvantage was, that this took several hours for a dataset that has a size of 7G.

So a "native" solution in flair would be highly appreciated :)

We'll make it a priority! :) Hoping to have this ready for 0.4.1!

Long "words" could also be a problem, e.g. I had the following "words" in my training examples:

Format: number of characters, word

348 (10E,14E,16E,22Z)-2,6-Didesoxi-4-O-(2,6-didesoxi-3-O-metil-伪-L-arabino-hexopiranosil)-3-O-metil-伪-L-arabino-hexopiran贸sido(1R,4S,5'S,6S,6'R,8R,12S,13S,20R,21R,24S)-21,24-di-hidroxi-6'-isopropil-5',11,13,22-tetrametil-2-oxo-3,7,19-trioxatetraciclo[15.6.1.14,8.020,24]pentacosa-10,14,16,22-tetraeno-6-espiro-2'-(5',6'-di-hidro-2'H-piran)-12-铆lico
288 N-acetil-O-terc-butil-L-tirosil-O-terc-butil-L-treonil-O-terc-butil-L-seril-L-leucil-L-isoleucil-N1-tritil-L-histidil-O-terc-butil-L-seril-L-leucil-L-isoleucil-伪-L-glutamil-伪-L-glutamil-O-terc-butil-L-seril-N-tritil-L-glutaminil-N-tritil-L-asparaginil-N-tritil-L-glutaminil-L-glutamina,
283 (2S,3aR,5aS,5bS,9S,13S,14R,16aS,16bS)-2-(6-deoxy-2,3,4-tri-O-methyl-伪-l-mannopyranosyloxy)-13-(4-dimethylamino-2,3,4,6-tetradeoxy-尾-d-erythropyranosyloxy)-9-ethyl-2,3,3a,5a,5b,6,7,9,10,11,12,13,14,15,16a,16b-hexadecahydro-4,14-dimethyl-1H-8-oxacyclododeca[b]as-indacene-7,15-dione;
278 (2R,3aS,5aR,5bS,9S,13S,14R,16aS,16bR)-2-(6-deoxy-2,3,4-tri-O-methyl-伪-l-mannopyranosyloxy)-13-(4-dimethylamino-2,3,4,6-tetradeoxy-尾-d-erythropyranosyloxy)-9-ethyl-2,3,3a,5a,5b,6,7,9,10,11,12,13,14,15,16a,16b-hexadecahydro-14-methyl-1H-8-oxacyclododeca[b]as-indacene-7,15-dione

That could also lead to out of memory errors 馃槖

Oh wow - yeah words like this could be problematic. We have to think about how best to handle this...

The just merged PR hopefully addresses this problem and should also work for long words.

@alanakbik Still having the same problem. The CUDA gpu memory gets full and the training gets stuck. The data may have both long words and long sentences.

How to eliminate the long sentences from the corpus? Directly modifying the corpus.train attribute is not allowed.

@alanakbik Still having the same problem. The CUDA gpu memory gets full and the training gets stuck. The data may have both long words and long sentences.

you might try this solution.

@alipetiwala I think you could also use:

sent_1 = Sentence("That is a very very long sentence")
sent_2 = Sentence("Short sentence")
corpus = [sent_1, sent_2]

limit = 4

corpus = [x for x in corpus if len(x.tokens) <= limit]

To filter the corpus.train object.

@alipetiwala I think you could also use:

sent_1 = Sentence("That is a very very long sentence")
sent_2 = Sentence("Short sentence")
corpus = [sent_1, sent_2]

limit = 4

corpus = [x for x in corpus if len(x.tokens) <= limit]

To filter the corpus.train object.
@stefan-it
I had tried this, but it does not allow to set the train attribute :

corpus: TaggedCorpus = NLPTaskDataFetcher.load_column_corpus(...)
corpus.train=[x for x in corpus if len(x.tokens) <= limit]

Throws an error.

What about this:

corpus.train = [x for x in corpus.train if len(x.tokens) <= limit]

This is the line which causes error. You cannot set attribute.

----> 1 corpus.train=filt_train

AttributeError: can't set attribute

@stefan-it

Sorry,

here's a code snippet that actually works:

from typing import List

from flair.data_fetcher import NLPTaskDataFetcher, NLPTask
from flair.data import TaggedCorpus
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings, CharLMEmbeddings, CharacterEmbeddings
from flair.training_utils import EvaluationMetric
from flair.visual.training_curves import Plotter

# 1. get the corpus
corpus: TaggedCorpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_ENGLISH)
print(corpus)

max_tokens = 5

corpus._test = [x for x in corpus.test if len(x) <= 5]

._test needs to be re-assigned, not .test which is a kind of getter...

Sorry,

here's a code snippet that actually works:

from typing import List

from flair.data_fetcher import NLPTaskDataFetcher, NLPTask
from flair.data import TaggedCorpus
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings, CharLMEmbeddings, CharacterEmbeddings
from flair.training_utils import EvaluationMetric
from flair.visual.training_curves import Plotter

# 1. get the corpus
corpus: TaggedCorpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_ENGLISH)
print(corpus)

max_tokens = 5

corpus._test = [x for x in corpus.test if len(x) <= 5]

._test needs to be re-assigned, not .test which is a kind of getter...

Hi @stefan-it, I have a trite question. Will the corpus now have filtered sentences?
I didn't get how we reassign the values in _test to test.
馃檲 馃檲 馃檲

Hi @nightlessbaron ,

so it's working because of the property annotation decoration:

https://github.com/flairNLP/flair/blob/2e3f6a0d29b56db72432beb97742dc024dc9c4fd/flair/data.py#L1022-L1024

You re-assign something to corpus._test and the test() function call will return _.test then :)

Ahh got it!
Thanks for the quick clarification!

I am still facing this issue.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jannenev picture jannenev  路  3Comments

jewl123 picture jewl123  路  3Comments

Rahulvks picture Rahulvks  路  3Comments

frtacoa picture frtacoa  路  3Comments

happypanda5 picture happypanda5  路  3Comments