Flair: FlairEmbeddings for long sequences

Created on 13 Jan 2019 · 18Comments · Source: flairNLP/flair

Several users have reported that if a paragraph consists of a very long sequence of words and we are training on GPU, we get CUDA out of memory issues (#332 #376). The problem is that a batch gets padded to the longest sequence and the full sequence x batch then put through the LSTM at the same time.

An alternative implementation would split sequences into chunks of a maximum length and put them through the LSTM sequentially, by always initializing its hidden state with the output state of the previous chunk. Basically, the same way we currently do it for training the language model.

enhancement

Source

alanakbik

👍5

Most helpful comment

We'll make it a priority! :) Hoping to have this ready for 0.4.1!

alanakbik on 21 Jan 2019

❤2

All 18 comments

This would be a great fix! As temp. workaround I tried using fold and sed to limit the characters per line. Tip: do not use fold, because this could lead to broken utf-8 characters!

In a recent experiment I limited the max. characters per line to 10,000 using: sed -e 's/.\{10000\}/&\n/g'. One disadvantage was, that this took several hours for a dataset that has a size of 7G.

So a "native" solution in flair would be highly appreciated :)

stefan-it on 21 Jan 2019

We'll make it a priority! :) Hoping to have this ready for 0.4.1!

alanakbik on 21 Jan 2019

❤2

Long "words" could also be a problem, e.g. I had the following "words" in my training examples:

Format: number of characters, word

348 (10E,14E,16E,22Z)-2,6-Didesoxi-4-O-(2,6-didesoxi-3-O-metil-α-L-arabino-hexopiranosil)-3-O-metil-α-L-arabino-hexopiranósido(1R,4S,5'S,6S,6'R,8R,12S,13S,20R,21R,24S)-21,24-di-hidroxi-6'-isopropil-5',11,13,22-tetrametil-2-oxo-3,7,19-trioxatetraciclo[15.6.1.14,8.020,24]pentacosa-10,14,16,22-tetraeno-6-espiro-2'-(5',6'-di-hidro-2'H-piran)-12-ílico
288 N-acetil-O-terc-butil-L-tirosil-O-terc-butil-L-treonil-O-terc-butil-L-seril-L-leucil-L-isoleucil-N1-tritil-L-histidil-O-terc-butil-L-seril-L-leucil-L-isoleucil-α-L-glutamil-α-L-glutamil-O-terc-butil-L-seril-N-tritil-L-glutaminil-N-tritil-L-asparaginil-N-tritil-L-glutaminil-L-glutamina,
283 (2S,3aR,5aS,5bS,9S,13S,14R,16aS,16bS)-2-(6-deoxy-2,3,4-tri-O-methyl-α-l-mannopyranosyloxy)-13-(4-dimethylamino-2,3,4,6-tetradeoxy-β-d-erythropyranosyloxy)-9-ethyl-2,3,3a,5a,5b,6,7,9,10,11,12,13,14,15,16a,16b-hexadecahydro-4,14-dimethyl-1H-8-oxacyclododeca[b]as-indacene-7,15-dione;
278 (2R,3aS,5aR,5bS,9S,13S,14R,16aS,16bR)-2-(6-deoxy-2,3,4-tri-O-methyl-α-l-mannopyranosyloxy)-13-(4-dimethylamino-2,3,4,6-tetradeoxy-β-d-erythropyranosyloxy)-9-ethyl-2,3,3a,5a,5b,6,7,9,10,11,12,13,14,15,16a,16b-hexadecahydro-14-methyl-1H-8-oxacyclododeca[b]as-indacene-7,15-dione

That could also lead to out of memory errors 😒

stefan-it on 31 Jan 2019

Oh wow - yeah words like this could be problematic. We have to think about how best to handle this...

alanakbik on 31 Jan 2019

The just merged PR hopefully addresses this problem and should also work for long words.

alanakbik on 2 Feb 2019

@alanakbik Still having the same problem. The CUDA gpu memory gets full and the training gets stuck. The data may have both long words and long sentences.

alipetiwala on 26 Mar 2019

How to eliminate the long sentences from the corpus? Directly modifying the corpus.train attribute is not allowed.

alipetiwala on 26 Mar 2019

@alanakbik Still having the same problem. The CUDA gpu memory gets full and the training gets stuck. The data may have both long words and long sentences.

you might try this solution.

shoegazerstella on 26 Mar 2019

@alipetiwala I think you could also use:

sent_1 = Sentence("That is a very very long sentence")
sent_2 = Sentence("Short sentence")
corpus = [sent_1, sent_2]

limit = 4

corpus = [x for x in corpus if len(x.tokens) <= limit]

To filter the corpus.train object.

stefan-it on 26 Mar 2019

@alipetiwala I think you could also use:
sent_1 = Sentence("That is a very very long sentence")
sent_2 = Sentence("Short sentence")
corpus = [sent_1, sent_2]

limit = 4

corpus = [x for x in corpus if len(x.tokens) <= limit]
To filter the corpus.train object.
@stefan-it
I had tried this, but it does not allow to set the train attribute :

corpus: TaggedCorpus = NLPTaskDataFetcher.load_column_corpus(...)
corpus.train=[x for x in corpus if len(x.tokens) <= limit]

Throws an error.

alipetiwala on 26 Mar 2019

What about this:

corpus.train = [x for x in corpus.train if len(x.tokens) <= limit]

stefan-it on 26 Mar 2019

This is the line which causes error. You cannot set attribute.

alipetiwala on 27 Mar 2019

----> 1 corpus.train=filt_train

AttributeError: can't set attribute

@stefan-it

alipetiwala on 27 Mar 2019

Sorry,

here's a code snippet that actually works:

from typing import List

from flair.data_fetcher import NLPTaskDataFetcher, NLPTask
from flair.data import TaggedCorpus
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings, CharLMEmbeddings, CharacterEmbeddings
from flair.training_utils import EvaluationMetric
from flair.visual.training_curves import Plotter

# 1. get the corpus
corpus: TaggedCorpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_ENGLISH)
print(corpus)

max_tokens = 5

corpus._test = [x for x in corpus.test if len(x) <= 5]

._test needs to be re-assigned, not .test which is a kind of getter...

stefan-it on 27 Mar 2019

👍1

Sorry,

here's a code snippet that actually works:

from typing import List

from flair.data_fetcher import NLPTaskDataFetcher, NLPTask
from flair.data import TaggedCorpus
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings, CharLMEmbeddings, CharacterEmbeddings
from flair.training_utils import EvaluationMetric
from flair.visual.training_curves import Plotter

# 1. get the corpus
corpus: TaggedCorpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_ENGLISH)
print(corpus)

max_tokens = 5

corpus._test = [x for x in corpus.test if len(x) <= 5]

._test needs to be re-assigned, not .test which is a kind of getter...

Hi @stefan-it, I have a trite question. Will the corpus now have filtered sentences?
I didn't get how we reassign the values in _test to test.
🙈 🙈 🙈