Flair: sqlite3.OperationalError: database or disk is full

Created on 26 Jan 2019  路  11Comments  路  Source: flairNLP/flair

Hello and thanks for making this tool public.
So, I am using the example provided here to get document embeddings.
I am running this on an Amazon Sagemaker ml.p2.8xlarge instance with a volume of 500 additional GB.
I have 106779 documents and the error happens after 1892 have been computed. It is the following:

File "embeddings.py", line 168, in <module>
main()
File "embeddings.py", line 140, in main
document_embeddings.embed(sentence)
File "/usr/local/lib/python3.6/dist-packages/flair/embeddings.py", line 1374, in embed
self.embeddings.embed(sentences)
File "/usr/local/lib/python3.6/dist-packages/flair/embeddings.py", line 123, in embed
embedding.embed(sentences)
File "/usr/local/lib/python3.6/dist-packages/flair/embeddings.py", line 56, in embed
self._add_embeddings_internal(sentences)
File "/usr/local/lib/python3.6/dist-packages/flair/embeddings.py", line 612, in _add_embeddings_internal
embeddings = self.cache.get(key)
File "/usr/lib/python3.6/_collections_abc.py", line 660, in get
return self[key]
File "/usr/local/lib/python3.6/dist-packages/sqlitedict.py", line 243, in __getitem__
item = self.conn.select_one(GET_ITEM, (key,))
File "/usr/local/lib/python3.6/dist-packages/sqlitedict.py", line 515, in select_one
return next(iter(self.select(req, arg)))
File "/usr/local/lib/python3.6/dist-packages/sqlitedict.py", line 504, in select
self.execute(req, arg, res)
File "/usr/local/lib/python3.6/dist-packages/sqlitedict.py", line 481, in execute
self.check_raise_error()
File "/usr/local/lib/python3.6/dist-packages/sqlitedict.py", line 475, in check_raise_error
reraise(e_type, e_value, e_tb)
File "/usr/local/lib/python3.6/dist-packages/sqlitedict.py", line 71, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/sqlitedict.py", line 409, in run
cursor.execute(req, arg)
sqlite3.OperationalError: database or disk is full

The Code

fastText_embedding = WordEmbeddings('news')
flair_embedding_forward = FlairEmbeddings('news-forward') 
flair_embedding_backward = FlairEmbeddings('news-backward')

# initialize the document embeddings
document_embeddings = DocumentPoolEmbeddings([fastText_embedding,
                                                    flair_embedding_backward,
                                                    flair_embedding_forward])

vecs = []

for em in progressbar.progressbar(df['documents'].values):
            sentence = Sentence(em)
            document_embeddings.embed(sentence)
            vecs.append(sentence.get_embedding().squeeze().tolist())

Environment:
The script runs in a Docker container created with tensorflow:latest-gpu

  • OS: Ubuntu
  • Version: pip version of flair

    • Python version: Python 3.6

Might this issue be related to #387?

bug

All 11 comments

Hello @shoegazerstella - I think the problem is due to incorrect default behavior in release 0.4 we have since fixed. In 0.4, FlairEmbeddings are stored on disk by default, meaning that they can take a lot of disk space in use cases such as yours. In current master and the upcoming 0.4.1, they will only be stored on disk if specifically requested.

For now, you can do the following: Initialize the embeddings with the use_cache parameter set to False, then it won't save them to disk, like this:

fastText_embedding = WordEmbeddings('news')
flair_embedding_forward = FlairEmbeddings('news-forward', use_cache=False)
flair_embedding_backward = FlairEmbeddings('news-backward', use_cache=False)

Also, you probably need to free up disk space. Default directory for storing embeddings is under '~/.flair/embeddings/' so you should delete the big files there.

Does this work?

Hi, I am trying as you suggested but at the time of word embeddings initialization flair still downloads stuff in the directory you mentioned, some logs here:

2019-01-28 10:32:04,892 copying /tmp/tmpl87klvw8 to cache at /root/.flair/embeddings/lm-news-english-forward-v0.2rc.pt
2019-01-28 10:32:04,958 removing temp file /tmp/tmpl87klvw8
2019-01-28 10:32:13,402 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/lm-news-english-backward-v0.2rc.pt not found in cache, downloading to /tmp/tmp85oeg4kt 

It's quite a while it's running on AWS and it seems to be hanging downloading things. I am stuck with the following progression bar since almost 1,5 hours now.

2019-01-28 10:32:18,401 copying /tmp/tmp85oeg4kt to cache at /root/.flair/embeddings/lm-news-english-backward-v0.2rc.pt

2019-01-28 10:32:18,465 removing temp file /tmp/tmp85oeg4kt

#015 0%| | 0/72405799 [00:00<?, ?B/s]#015 0%| | 52224/72405799 [00:00<03:59, 302052.08B/s]#015 0%| | 226304/72405799 [00:00<03:03, 393599.65B/s]#015 1%| | 608256/72405799 [00:00<02:15, 531521.16B/s]#015 2%|1 | 1183744/72405799 [00:00<01:38, 722149.58B/s]#015 4%|3 | 2819072/72405799 [00:00<01:09, 1006791.11B/s]#015 7%|6 | 4867072/72405799 [00:00<00:48, 1399786.51B/s]#015 10%|9 | 7141376/72405799 [00:00<00:33, 1948296.79B/s]#015 13%|#2 | 9126912/72405799 [00:01<00:23, 2652454.70B/s]#015 16%|#5 | 11224064/72405799 [00:01<00:17, 3540116.03B/s]#015 19%|#8 | 13494272/72405799 [00:01<00:12, 4740494.50B/s]#015 21%|##1 | 15483904/72405799 [00:01<00:09, 6046891.30B/s]#015 24%|##4 | 17581056/72405799 [00:01<00:07, 7443192.43B/s]#015 27%|##7 | 19833856/72405799 [00:01<00:05, 9314053.61B/s]#015 30%|### | 21840896/72405799 [00:01<00:04, 10775058.34B/s]#015 33%|###3 | 23938048/72405799 [00:01<00:04, 11967959.97B/s]#015 36%|###6 | 26231808/72405799 [00:01<00:03, 13972290.37B/s]#015 39%|###8 | 28197888/72405799 [00:02<00:03, 14698872.24B/s]#015 42%|####1 | 30295040/72405799 [00:02<00:02, 15119716.05B/s]#015 45%|####4 | 32578560/72405799 [00:02<00:02, 16825018.75B/s]#015 48%|####7 | 34554880/72405799 [00:02<00:02, 16816030.78B/s]#015 51%|##### | 36652032/72405799 [00:02<00:02, 16620747.04B/s]#015 54%|#####3 | 38851584/72405799 [00:02<00:01, 17935453.40B/s]#015 57%|#####6 | 40911872/72405799 [00:02<00:01, 17811073.73B/s]#015 59%|#####9 | 43009024/72405799 [00:02<00:01, 17261415.73B/s]#015 62%|######2 | 45159424/72405799 [00:03<00:01, 18347024.67B/s]#015 65%|######5 | 47268864/72405799 [00:03<00:01, 18243079.18B/s]#015 68%|######8 | 49366016/72405799 [00:03<00:01, 17558551.50B/s]#015 71%|#######1 | 51528704/72405799 [00:03<00:01, 18608520.78B/s]#015 74%|#######4 | 53625856/72405799 [00:03<00:01, 18376104.07B/s]#015 77%|#######6

Is this an expected behaviour?

Just for me to understand better your last suggestion. You are suggesting to delete all the files downloaded and stored in the folder '~/.flair/embeddings/', before processing my text, is that right?

Thanks a lot for your help!

Hello @shoegazerstella hm no that is strange. Is the disk full? Is something somehow blocking the download process?

Yes, if you delete '~/.flair/embeddings/' you should free up some space. Then next time you launch it should trigger a re-download of the models (which is ok), but this time it will not materialize the embeddings to disk so you should be ok disk-space wise.

That is odd because as you can see from my first post the download was working fine and the embedding extraction also, until the database or disk is full error.
Now it seems blocked, I have no log or error and the instance still runs.
I am now trying to add some more space (1TB) to a new instance to see what happens and if it hangs again.

After 4 hours downloading stuff I finally have some new logs, it's extracting the embeddings properly now. I launched a new instance (exact same code and docker container) and this time the download was faster, maybe there is something odd happening inside amazon at the time of download.
I think the issue is solved, thanks a lot for your help @alanakbik :)

Sorry to reopen the issue but this error occurred:

% (203 of 844) |### | Elapsed Time: 8:13:59 ETA: 1 day, 13:24:58#015 #015#015 24% (204 of 844) |##### | Elapsed Time: 8:15:29 ETA: 15:57:02#015 #015#015 24% (205 of 844) |### | Elapsed Time: 8:24:13 ETA: 3 days, 20:59:38#015 #015#015 24% (206 of 844) |### | Elapsed Time: 8:30:24 ETA: 2 days, 17:45:41#015 #015#015 24% (207 of 844) |### | Elapsed Time: 8:30:26 ETA: 1 day, 9:00:04#015 #015#015 24% (208 of 844) |### | Elapsed Time: 8:34:39 ETA: 1 day, 20:45:38#015 #015#015 24% (209 of 844) |### | Elapsed Time: 8:39:28 ETA: 2 days, 2:59:21#015 #015#015 24% (210 of 844) |### | Elapsed Time: 8:43:20 ETA: 1 day, 16:49:40#015 #015#015 25% (211 of 844) |### | Elapsed Time: 8:47:46 ETA: 1 day, 22:47:05#015 #015#015 25% (212 of 844) |### | Elapsed Time: 8:52:07 ETA: 1 day, 21:48:40#015 #015#015 25% (213 of 844) |##### | Elapsed Time: 8:52:11 ETA: 0:42:22#015 #015#015 25% (214 of 844) |### | Elapsed Time: 8:56:26 ETA: 1 day, 20:29:12#015 
File "embeddings.py", line 172, in <module>
main()
File "embeddings.py", line 144, in main
document_embeddings.embed(sentence)
File "/usr/local/lib/python3.6/dist-packages/flair/embeddings.py", line 1374, in embed
self.embeddings.embed(sentences)
File "/usr/local/lib/python3.6/dist-packages/flair/embeddings.py", line 123, in embed
embedding.embed(sentences)
File "/usr/local/lib/python3.6/dist-packages/flair/embeddings.py", line 56, in embed
self._add_embeddings_internal(sentences)
File "/usr/local/lib/python3.6/dist-packages/flair/embeddings.py", line 647, in _add_embeddings_internal
all_hidden_states_in_lm = self.lm.get_representation(sentences_padded, self.detach)
File "/usr/local/lib/python3.6/dist-packages/flair/models/language_model.py", line 105, in get_representation
prediction, rnn_output, hidden = self.forward(batch, hidden)
File "/usr/local/lib/python3.6/dist-packages/flair/models/language_model.py", line 76, in forward
output, hidden = self.rnn(emb, hidden)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py", line 179, in forward
self.dropout, self.training, self.bidirectional, self.batch_first)
RuntimeError: CUDA out of memory. Tried to allocate 12.64 GiB (GPU 0; 11.17 GiB total capacity; 783.32 MiB already allocated; 9.90 GiB free; 182.43 MiB cached)

Seems related to #248.

It's possible that this happens when you have a very large document. We're currently looking into a solution to produce embeddings for large documents without getting CUDA oom errors (see #387).

Could you check what the largest document in terms of number of characters is in your corpus?

Hi @alanakbik,

So, I have 2 datasets:

  • In D1 there are large documents indeed. The len() of the longest string is 1656872
  • in D2 there are a few phrases and the longest has length 12586

I run the embeddings extraction on both, but I only succedeed when I used the data from D2.
I was using the GPU but it took 2 days to complete the extraction.
I had the CUDA error while working with D1.

I guess I will be waiting for your solution then, thanks a lot for working on this!

A question: it is possible to choose the dimension of the embeddings? The default one is about 4k.

Thanks for the information! If you use:

flair_embedding_forward = FlairEmbeddings('news-forward-fast', use_cache=False)
flair_embedding_backward = FlairEmbeddings('news-backward-fast', use_cache=False)

instead, i.e. with the "-fast" as part of the model name, you get embeddings that are half the typical size. So, your current options are 4k and 2k dimensions. Also embeddings are produced much faster.

We just merged a PR into master branch that should make it possible to get flair embeddings for very long sequences.

Initialize like this:

flair_embedding_forward = FlairEmbeddings('news-forward-fast', use_cache=False, chars_per_chunk=512)
flair_embedding_backward = FlairEmbeddings('news-backward-fast', use_cache=False, chars_per_chunk=512)

The chars_per_chunk parameter controls the speed / memory tradeoff. If you set this high, chunks are large, i.e. greater memory requirements but faster speed.

@shoegazerstella @alanakbik I had tried with chars_per_chunk=128 yet it had the same issue. use_cache=False was not used though.

Was this page helpful?
0 / 5 - 0 ratings