Describe the bug
When training the NER model on CoNLL files with BIOUL NER tags and specifying the the coding scheme as such, an error is thrown when to_bioul() is called on something that is already formatted in BIOUL
To Reproduce
Steps to reproduce the behavior
"dataset_reader": {
"type": "conll2003",
"tag_label": "ner",
"coding_scheme": "BIOUL",
....
},
allennlp train on the config with a BIOUL-labeled CoNLL input file0it [00:00, ?it/s]2018-10-23 15:59:50,380 - INFO - allennlp.data.dataset_readers.conll2003 - Reading instances from lines in file at: ../data/train/ner-conll/320_extraction.conll
Traceback (most recent call last):
File "/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/anaconda3/lib/python3.7/site-packages/allennlp/run.py", line 18, in <module>
main(prog="allennlp")
File "/anaconda3/lib/python3.7/site-packages/allennlp/commands/__init__.py", line 70, in main
args.func(args)
File "/anaconda3/lib/python3.7/site-packages/allennlp/commands/train.py", line 102, in train_model_from_args
args.recover)
File "/anaconda3/lib/python3.7/site-packages/allennlp/commands/train.py", line 132, in train_model_from_file
return train_model(params, serialization_dir, file_friendly_logging, recover)
File "/anaconda3/lib/python3.7/site-packages/allennlp/commands/train.py", line 261, in train_model
all_datasets = datasets_from_params(params)
File "/anaconda3/lib/python3.7/site-packages/allennlp/commands/train.py", line 149, in datasets_from_params
train_data = dataset_reader.read(train_data_path)
File "/anaconda3/lib/python3.7/site-packages/allennlp/data/dataset_readers/dataset_reader.py", line 73, in read
instances = [instance for instance in Tqdm.tqdm(instances)]
File "/anaconda3/lib/python3.7/site-packages/allennlp/data/dataset_readers/dataset_reader.py", line 73, in <listcomp>
instances = [instance for instance in Tqdm.tqdm(instances)]
File "/anaconda3/lib/python3.7/site-packages/tqdm/_tqdm.py", line 937, in __iter__
for obj in iterable:
File "/anaconda3/lib/python3.7/site-packages/allennlp/data/dataset_readers/conll2003.py", line 119, in _read
yield self.text_to_instance(tokens, pos_tags, chunk_tags, ner_tags)
File "/anaconda3/lib/python3.7/site-packages/allennlp/data/dataset_readers/conll2003.py", line 139, in text_to_instance
encoding=self._original_coding_scheme) if ner_tags is not None else None
File "/anaconda3/lib/python3.7/site-packages/allennlp/data/dataset_readers/dataset_utils/span_utils.py", line 367, in to_bioul
raise InvalidTagSequence(tag_sequence)
allennlp.data.dataset_readers.dataset_utils.span_utils.InvalidTagSequence: O O O O O B-DATA L-DATA O O O O O
Expected behavior
The trainer should run as expected.
System (please complete the following information):
Additional context
Not defining the coding scheme in the config solves the issue, but it isn't clearly documented (as far as I could tell) that this would be the behavior of specifying BIOUL as the coding scheme. It's probably not super important as I'm told the generally accepted format in CoNLL files is IOB1 anyway but @joelgrus recommended I open an issue for it to at least start the conversation.
This dataset reader assumes IOB1 input, and then accordingly re-codes to BIOUL or whatever other tagging scheme.
it _should_ be ok to use IOB1 coding (the default) for this case since it basically means that we use the labels unmodified. But it'd probably be quite confusing for future folks looking at your code to see IOB1 coding used on a BIOUL formatted file if they don't have this context...
Maybe the parameter should be named "convert_to_coding_scheme": "bioul"?
but then the default would be "convert_to_coding_scheme": "iob1", which implies that you'd convert from BIOUL->IOB1 (in this particular case, if you leave the default as-is), but in reality the labels are just used as is.
after thinking about it, I think the current behavior is fine (since the conll2003 format is IOB), I think the issue is more around the documentation: I think we should be clearer that the dataset reader expects labels in IOB format, and that specifying 'BIOUL' will convert the labels from IOB to BIOUL.
@nelson-liu, if we change that parameter, we also change the default to be None.
We think (discussed amongst the core team) we should rename the parameter to something like "convert_to_coding_scheme": "bioul" and improve the documentation.
Hello,
I'd like to work on this.
If I understand correctly, the following needs to be done:
coding_scheme should be renamed to convert_to_coding_scheme with the default value as bioul.Conll2003DatasetReader class should be improved, mentioning that the reader expects only IOB1 formatted input.@kushalchauhan98: thanks! I agree with your outlined steps, but my understanding of the discussion above is that the default should be None / the only acceptable values are the allowed coding schemes excepting IOB1 (since we will be converting from IOB1 to something else if a value is supplied, else no conversion is done)
@nelson-liu In that case, the only possible value will be 'BIOUL' (if not None) since there isn't any third coding scheme (allennlp/data/dataset_readers/dataset_utils/span_utils.py only implements functionality to convert IOB1 to BIOUL).
@nelson-liu In that case, the only possible value will be 'BIOUL' (if not None) since there isn't any third coding scheme (
allennlp/data/dataset_readers/dataset_utils/span_utils.pyonly implements functionality to convert IOB1 to BIOUL).
@schmmd @nelson-liu Is it okay if the reader has this behaviour?
@kushalchauhan98 yes, that sounds fine (though I'd probably recommend having a list of "allowed values" that is easily extended if a new iob1_to_x function is created). Thanks!
Hi, apperently this hasn't been changed yet? I'm new to AllenNLP, trying to train a model on my input which is annotated with BIOUL tags. Do I have to change my input to IOB1 so I can use it? Sorry, I'm a little confused by how everything works and just trying to get anything to run, so it's probably a stupid question...
Most helpful comment
after thinking about it, I think the current behavior is fine (since the conll2003 format is IOB), I think the issue is more around the documentation: I think we should be clearer that the dataset reader expects labels in IOB format, and that specifying 'BIOUL' will convert the labels from IOB to BIOUL.