Hi, I'm currently trying to train a custom model with over 125 labels and I encounter the following error:
Process finished with exit code -1073740791 (0xC0000409)
*** stack smashing detected ***: <unknown> terminated
Aborted (core dumped)
There seems to be a limit. Under 125 labels it works and over it, it crashes.
def __train_model(self, train_data, entity_types):
nlp = spacy.blank("en")
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)
for entity_type in list(entity_types):
ner.add_label(entity_type)
optimizer = nlp.begin_training()
# Start training
for i in range(20):
losses = {}
index = 0
random.shuffle(train_data)
for statement, entities in train_data:
nlp.update([statement], [entities], sgd=optimizer, losses=losses, drop=0.5)
return nlp
def test_train_with_max_supported_entity_types(self):
train_data = TrainData()
train_data.extend([("One sentence", {"entities": []})])
entity_types = {i for i in range(125)}
model = self.train_model_processor.train(train_data, entity_types)
assert_is_not_none(model)
So in the unit test whenever entity_types length is beyond 125, it crashes.
Python version: 3.7.0
Environment Information:
16gb RAM, CPU: i7-3630QM
Any idea if there is a limit of labels ? If so, should it return an error message describing the error instead of crashing ?
~Trying to reproduce this now, but at first glance it looks like the problem is that your labels are integers, where they should be either strings, or the hash of those strings. The integer 125 is going to resolve to one of the reserved symbols, and I think that's what's confusing it.~
Edit: Aaah, nevermind. I found a place in the code where I'd lazily used a stack-allocated array during development, and had not replaced it. Apologies for the inconvenience, and thanks for the test case.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
~Trying to reproduce this now, but at first glance it looks like the problem is that your labels are integers, where they should be either strings, or the hash of those strings. The integer 125 is going to resolve to one of the reserved symbols, and I think that's what's confusing it.~
Edit: Aaah, nevermind. I found a place in the code where I'd lazily used a stack-allocated array during development, and had not replaced it. Apologies for the inconvenience, and thanks for the test case.