Transformers: run_ner.py --do_predict inference mode errors. Right data format?

Created on 26 Nov 2019 · 8Comments · Source: huggingface/transformers

❓ Questions & Help

Hello again,

I'm here to bother you one more time.

I fine-tuned preloaded BioBERT weights on a custom dataset to run biomedical NER.

Now I want to use the model for inference mode on a 'raw' set of documents. I renamed this set 'test.txt' and formatted it the following way (documents are separated by '-DOCSTART- (num_doc)' lines):

to O
be O
referred O
to O
the O
location O
of O
the O
disease O
in O
the O
skeletal O
structures O
examined O
; O

unchanged O
the O
areas O
of O
bone O
rarefaction O
reported O
to O
the O
sternum O
as O
a O
result O
of O
median O
sternotomy O
. O

I had to add the 'fake' labels on the right and place a space " " between col1 and col2.

The error I now get is:

Traceback (most recent call last):
  File "run_ner.py", line 531, in <module>
    main()
  File "run_ner.py", line 522, in main
    output_line = line.split()[0] + " " + predictions[example_id].pop(0) + "\n"
IndexError: list index out of range

Many thanks again.

wontfix

Source

zampierimatteo91

All 8 comments

Three questions:

the error occurs at line 522, so at line 507 you've saved the file called test_results.txt. Do you see the content of this file and whether is it correct?
the input file has been formatted as CoNLL-2003?
N.B: moreover, the code block from line 512 to line 525 save to .txt file the predictions obtained from the NER. But already at line 505 you have the predictions variable. Have you seen the content of this variable? Maybe, it is only a saving problem.

Questions & Help

Hello again,

I'm here to bother you one more time.

I fine-tuned preloaded BioBERT weights on a custom dataset to run biomedical NER.

Now I want to use the model for inference mode on a 'raw' set of documents. I renamed this set 'test.txt' and formatted it the following way (documents are separated by '-DOCSTART- (num_doc)' lines):
to O
be O
referred O
to O
the O
location O
of O
the O
disease O
in O
the O
skeletal O
structures O
examined O
; O

unchanged O
the O
areas O
of O
bone O
rarefaction O
reported O
to O
the O
sternum O
as O
a O
result O
of O
median O
sternotomy O
. O
I had to add the 'fake' labels on the right and place a space " " between col1 and col2.

The error I now get is:
Traceback (most recent call last):
  File "run_ner.py", line 531, in <module>
    main()
  File "run_ner.py", line 522, in main
    output_line = line.split()[0] + " " + predictions[example_id].pop(0) + "\n"
IndexError: list index out of range
Many thanks again.

TheEdoardo93 on 27 Nov 2019

Ciao @TheEdoardo93 ,
Thanks for your support!

I formatted the test set trying to follow the indications from the tutorial on the german-eval, with the first column being the token and the second being the B-I-O tags (in this set it's just a pile of Os to fill the column). They are space-separated.
- test_results.txt is saved and shows precision, recall, f-1, and loss. All are terrible of course, as the test set was actually filled with the dummy BIO tags.
- test_predictions.txt is truncated after about 50 lines of token+BIO prediction.

I'm now trying to print the content of predictions, I'll let you know.

zampierimatteo91 on 27 Nov 2019

👍1

I wait your predictions variable content :D
We can implement a saving method that works as we expect (and not using the code lines in the run_ner.py script and see what happens!)

Ciao @TheEdoardo93 ,
Thanks for your support!

I formatted the test set trying to follow the indications from the tutorial on the german-eval, with the first column being the token and the second being the B-I-O tags (in this set it's just a pile of Os to fill the column). They are space-separated.

test_results.txt is saved and shows precision, recall, f-1, and loss. All are terrible of course, as the test set was actually filled with the dummy BIO tags.

test_predictions.txt is truncated after about 50 lines of token+BIO prediction.

I'm now trying to print the content of predictions, I'll let you know.

TheEdoardo93 on 27 Nov 2019

I'm back.

What I did was: changed columns' separation from tab to space (I was wrong in the previous comment, I thought I already changed it).

Now the code runs properly and test_predictions.txt is complete.
This is a snapshot of print(predictions):

[['O', 'O', 'B-Organism_subdivision', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Organ', 'O', 'B-Organ', 'O', 'B-Multi-tissue_structure', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Organism_subdivision', 'I-Organism_subdivision', 'O', 'O', 'O', 'O', 'O', 'B-Cancer', 'I-Cancer', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'B-Immaterial_anatomical_entity', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'B-Multi-tissue_structure', 'B-Multi-tissue_structure', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Immaterial_anatomical_entity', 'O', 'O'], ..., ['O', 'O', 'O', 'O', 'B-Organ', 'O', 'B-Organ', 'O', 'B-Organ', 'O', 'B-Organ', 'O'],  ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Tissue', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Tissue', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Tissue', 'O', 'O', 'O', 'O', 'B-Multi-tissue_structure', 'O', 'O', 'O', 'O', 'O', 'O', 'O']]

There is another minor issue I guess, i.e. a long series of warnings about no predictions because of exceeded maximum sequence length. The non-predicted tokens appear to be not really relevant for my final interest, but I'd like to have a complete output nonetheless.
I will try to place a newline not only after usual end-of-sentence punctuation (.!?), but also after semi-colons and colons, in order to split each document in more pieces.
Is it a strategy that makes sense or have I misinterpreted the meaning of the maximum sequence length?

zampierimatteo91 on 27 Nov 2019

👍1

Do you have sentences with length greater than 512 tokens? BioBERT allows to have sentences with 512 tokens length maximum, as stated in the paper.

The maximum sequence length was fixed to 512

If you have sentences with more than 512 tokens, you have to apply different workaround, e.g. splitting a sentence length 1024 in two different sentences of 512 length and combine in some manner their output.

However, the strategy you've proposed (e.g. split by comma, dot, semi-column, etc.) works! Try to follow this approach and share the results with us! I suggest you to do a visual evaluation/comparison between the current output and the output you'll obtain with the strategy highlight by you.

I'm back.

What I did was: changed columns' separation from tab to space (I was wrong in the previous comment, I thought I already changed it).

Now the code runs properly and test_predictions.txt is complete.
This is a snapshot of print(predictions):
[['O', 'O', 'B-Organism_subdivision', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Organ', 'O', 'B-Organ', 'O', 'B-Multi-tissue_structure', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Organism_subdivision', 'I-Organism_subdivision', 'O', 'O', 'O', 'O', 'O', 'B-Cancer', 'I-Cancer', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'B-Immaterial_anatomical_entity', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'B-Multi-tissue_structure', 'B-Multi-tissue_structure', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Immaterial_anatomical_entity', 'O', 'O'], ..., ['O', 'O', 'O', 'O', 'B-Organ', 'O', 'B-Organ', 'O', 'B-Organ', 'O', 'B-Organ', 'O'],  ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Tissue', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Tissue', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Tissue', 'O', 'O', 'O', 'O', 'B-Multi-tissue_structure', 'O', 'O', 'O', 'O', 'O', 'O', 'O']]
There is another minor issue I guess, i.e. a long series of warnings about no predictions because of exceeded maximum sequence length. The non-predicted tokens appear to be not really relevant for my final interest, but I'd like to have a complete output nonetheless.
I will try to place a newline not only after usual end-of-sentence punctuation (.!?), but also after semi-colons and colons, in order to split each document in more pieces.
Is it a strategy that makes sense or have I misinterpreted the meaning of the maximum sequence length?

TheEdoardo93 on 27 Nov 2019

Quite funnily, now a lot more tokens are without predictions.
What I did was just adding a newline after each semicolon with sed.

A question that I thought was easy to answer to: what constitutes a sequences in BERT relative to this task? Is it a sequence of tokens between empty lines? Or between defined punctuation?

zampierimatteo91 on 27 Nov 2019

👍1

Taken from the official BERT paper:

Throughout this work, a “sentence” can be an arbitrary span of contiguous text, rather than an actual linguistic sentence. A “sequence” refers to the input token sequence to BERT, which may be a single sentence or two sentences packed together.

TheEdoardo93 on 28 Nov 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.