Spacy: Property for max doc length

Created on 5 Jul 2018  ·  6Comments  ·  Source: explosion/spaCy

Feature description

Processing a large corpus of emails I kept getting a memory error. I checked the size of the document and it was 31,888,895 characters, so it didn't surprise me that it failed with a memory error.

This failure could've been avoided with an upperbound for document size. This could ship as a reasonable default (10,000 characters?) and be user modifiable depending on the system/situation.

With this upperbound in place, spacy would perform a length check prior to processing. If this document exceeds the upperbound, spacy could throw an exception which would allow the process to recover and move on to the next item or raise the exception and halt.

Could the feature be a custom component or spaCy plugin?

If so, we will tag it as project idea so other users can take it on.

It doesn't seem that it needs to be a custom component.

enhancement feat / doc

Most helpful comment

Confimed in 2.0.11, I get the ValueError when the string is too long. Thanks!

All 6 comments

But isn't this already built in? I get the following ValueError if I try to parse a string longer then one million characters:

ValueError: [E088] Text of length 1000000 exceeds maximum of 1000000. The v2.x parser and NER
models require roughly 1GB of temporary memory per 100,000 characters in the input. This means
long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably 
safe to increase the 'nlp.max_length' limit. The limit is in number of characters, so you can check
whether your inputs are too long by checking 'len(text)'.

I created the EN nlp object disabling NER & textcat:
nlp = spacy.load("en_core_web_lg", disable=['ner', 'textcat'])

But I didn't get the error you described. I'm currently using 2.0.6. I'll upgrade to 2.0.11 and see if that changes things.

@craigpfeifer I'm pretty sure the explicit maximum length error was added in the most recent version (or one of the more recent ones, likely after v2.0.6), so this might be why!

Confimed in 2.0.11, I get the ValueError when the string is too long. Thanks!

Thanks for updating and also thanks for the feature request – even though it was already implemented, it's nice to hear that you came to the same conclusion! 👍

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings