Several users have reported that if a paragraph consists of a very long sequence of words and we are training on GPU, we get CUDA out of memory issues (#332 #376). The problem is that a batch gets padded to the longest sequence and the full sequence x batch then put through the LSTM at the same time.
An alternative implementation would split sequences into chunks of a maximum length and put them through the LSTM sequentially, by always initializing its hidden state with the output state of the previous chunk. Basically, the same way we currently do it for training the language model.
This would be a great fix! As temp. workaround I tried using fold and sed to limit the characters per line. Tip: do not use fold, because this could lead to broken utf-8 characters!
In a recent experiment I limited the max. characters per line to 10,000 using: sed -e 's/.\{10000\}/&\n/g'. One disadvantage was, that this took several hours for a dataset that has a size of 7G.
So a "native" solution in flair would be highly appreciated :)
We'll make it a priority! :) Hoping to have this ready for 0.4.1!
Long "words" could also be a problem, e.g. I had the following "words" in my training examples:
Format: number of characters, word
348 (10E,14E,16E,22Z)-2,6-Didesoxi-4-O-(2,6-didesoxi-3-O-metil-伪-L-arabino-hexopiranosil)-3-O-metil-伪-L-arabino-hexopiran贸sido(1R,4S,5'S,6S,6'R,8R,12S,13S,20R,21R,24S)-21,24-di-hidroxi-6'-isopropil-5',11,13,22-tetrametil-2-oxo-3,7,19-trioxatetraciclo[15.6.1.14,8.020,24]pentacosa-10,14,16,22-tetraeno-6-espiro-2'-(5',6'-di-hidro-2'H-piran)-12-铆lico
288 N-acetil-O-terc-butil-L-tirosil-O-terc-butil-L-treonil-O-terc-butil-L-seril-L-leucil-L-isoleucil-N1-tritil-L-histidil-O-terc-butil-L-seril-L-leucil-L-isoleucil-伪-L-glutamil-伪-L-glutamil-O-terc-butil-L-seril-N-tritil-L-glutaminil-N-tritil-L-asparaginil-N-tritil-L-glutaminil-L-glutamina,
283 (2S,3aR,5aS,5bS,9S,13S,14R,16aS,16bS)-2-(6-deoxy-2,3,4-tri-O-methyl-伪-l-mannopyranosyloxy)-13-(4-dimethylamino-2,3,4,6-tetradeoxy-尾-d-erythropyranosyloxy)-9-ethyl-2,3,3a,5a,5b,6,7,9,10,11,12,13,14,15,16a,16b-hexadecahydro-4,14-dimethyl-1H-8-oxacyclododeca[b]as-indacene-7,15-dione;
278 (2R,3aS,5aR,5bS,9S,13S,14R,16aS,16bR)-2-(6-deoxy-2,3,4-tri-O-methyl-伪-l-mannopyranosyloxy)-13-(4-dimethylamino-2,3,4,6-tetradeoxy-尾-d-erythropyranosyloxy)-9-ethyl-2,3,3a,5a,5b,6,7,9,10,11,12,13,14,15,16a,16b-hexadecahydro-14-methyl-1H-8-oxacyclododeca[b]as-indacene-7,15-dione
That could also lead to out of memory errors 馃槖
Oh wow - yeah words like this could be problematic. We have to think about how best to handle this...
The just merged PR hopefully addresses this problem and should also work for long words.
@alanakbik Still having the same problem. The CUDA gpu memory gets full and the training gets stuck. The data may have both long words and long sentences.
How to eliminate the long sentences from the corpus? Directly modifying the corpus.train attribute is not allowed.
@alanakbik Still having the same problem. The CUDA gpu memory gets full and the training gets stuck. The data may have both long words and long sentences.
you might try this solution.
@alipetiwala I think you could also use:
sent_1 = Sentence("That is a very very long sentence")
sent_2 = Sentence("Short sentence")
corpus = [sent_1, sent_2]
limit = 4
corpus = [x for x in corpus if len(x.tokens) <= limit]
To filter the corpus.train object.
@alipetiwala I think you could also use:
sent_1 = Sentence("That is a very very long sentence") sent_2 = Sentence("Short sentence") corpus = [sent_1, sent_2] limit = 4 corpus = [x for x in corpus if len(x.tokens) <= limit]To filter the
corpus.trainobject.
@stefan-it
I had tried this, but it does not allow to set the train attribute :
corpus: TaggedCorpus = NLPTaskDataFetcher.load_column_corpus(...)
corpus.train=[x for x in corpus if len(x.tokens) <= limit]
Throws an error.
What about this:
corpus.train = [x for x in corpus.train if len(x.tokens) <= limit]
This is the line which causes error. You cannot set attribute.
----> 1 corpus.train=filt_train
AttributeError: can't set attribute
@stefan-it
Sorry,
here's a code snippet that actually works:
from typing import List
from flair.data_fetcher import NLPTaskDataFetcher, NLPTask
from flair.data import TaggedCorpus
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings, CharLMEmbeddings, CharacterEmbeddings
from flair.training_utils import EvaluationMetric
from flair.visual.training_curves import Plotter
# 1. get the corpus
corpus: TaggedCorpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_ENGLISH)
print(corpus)
max_tokens = 5
corpus._test = [x for x in corpus.test if len(x) <= 5]
._test needs to be re-assigned, not .test which is a kind of getter...
Sorry,
here's a code snippet that actually works:
from typing import List from flair.data_fetcher import NLPTaskDataFetcher, NLPTask from flair.data import TaggedCorpus from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings, CharLMEmbeddings, CharacterEmbeddings from flair.training_utils import EvaluationMetric from flair.visual.training_curves import Plotter # 1. get the corpus corpus: TaggedCorpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_ENGLISH) print(corpus) max_tokens = 5 corpus._test = [x for x in corpus.test if len(x) <= 5]
._testneeds to be re-assigned, not.testwhich is a kind of getter...
Hi @stefan-it, I have a trite question. Will the corpus now have filtered sentences?
I didn't get how we reassign the values in _test to test.
馃檲 馃檲 馃檲
Hi @nightlessbaron ,
so it's working because of the property annotation decoration:
https://github.com/flairNLP/flair/blob/2e3f6a0d29b56db72432beb97742dc024dc9c4fd/flair/data.py#L1022-L1024
You re-assign something to corpus._test and the test() function call will return _.test then :)
Ahh got it!
Thanks for the quick clarification!
I am still facing this issue.
Most helpful comment
We'll make it a priority! :) Hoping to have this ready for 0.4.1!