Hello again,
I'm here to bother you one more time.
I fine-tuned preloaded BioBERT weights on a custom dataset to run biomedical NER.
Now I want to use the model for inference mode on a 'raw' set of documents. I renamed this set 'test.txt' and formatted it the following way (documents are separated by '-DOCSTART- (num_doc)' lines):
to O
be O
referred O
to O
the O
location O
of O
the O
disease O
in O
the O
skeletal O
structures O
examined O
; O
unchanged O
the O
areas O
of O
bone O
rarefaction O
reported O
to O
the O
sternum O
as O
a O
result O
of O
median O
sternotomy O
. O
I had to add the 'fake' labels on the right and place a space " " between col1 and col2.
The error I now get is:
Traceback (most recent call last):
File "run_ner.py", line 531, in <module>
main()
File "run_ner.py", line 522, in main
output_line = line.split()[0] + " " + predictions[example_id].pop(0) + "\n"
IndexError: list index out of range
Many thanks again.
Three questions:
test_results.txt. Do you see the content of this file and whether is it correct?predictions variable. Have you seen the content of this variable? Maybe, it is only a saving problem.Questions & Help
Hello again,
I'm here to bother you one more time.
I fine-tuned preloaded BioBERT weights on a custom dataset to run biomedical NER.
Now I want to use the model for inference mode on a 'raw' set of documents. I renamed this set 'test.txt' and formatted it the following way (documents are separated by '-DOCSTART- (num_doc)' lines):
to O be O referred O to O the O location O of O the O disease O in O the O skeletal O structures O examined O ; O unchanged O the O areas O of O bone O rarefaction O reported O to O the O sternum O as O a O result O of O median O sternotomy O . OI had to add the 'fake' labels on the right and place a space " " between col1 and col2.
The error I now get is:
Traceback (most recent call last): File "run_ner.py", line 531, in <module> main() File "run_ner.py", line 522, in main output_line = line.split()[0] + " " + predictions[example_id].pop(0) + "\n" IndexError: list index out of rangeMany thanks again.
Ciao @TheEdoardo93 ,
Thanks for your support!
test_results.txt is saved and shows precision, recall, f-1, and loss. All are terrible of course, as the test set was actually filled with the dummy BIO tags.test_predictions.txt is truncated after about 50 lines of token+BIO prediction.I'm now trying to print the content of predictions, I'll let you know.
I wait your predictions variable content :D
We can implement a saving method that works as we expect (and not using the code lines in the run_ner.py script and see what happens!)
Ciao @TheEdoardo93 ,
Thanks for your support!
- I formatted the test set trying to follow the indications from the tutorial on the german-eval, with the first column being the token and the second being the B-I-O tags (in this set it's just a pile of Os to fill the column). They are space-separated.
test_results.txtis saved and shows precision, recall, f-1, and loss. All are terrible of course, as the test set was actually filled with the dummy BIO tags.test_predictions.txtis truncated after about 50 lines of token+BIO prediction.I'm now trying to print the content of
predictions, I'll let you know.
I'm back.
What I did was: changed columns' separation from tab to space (I was wrong in the previous comment, I thought I already changed it).
Now the code runs properly and test_predictions.txt is complete.
This is a snapshot of print(predictions):
[['O', 'O', 'B-Organism_subdivision', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Organ', 'O', 'B-Organ', 'O', 'B-Multi-tissue_structure', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Organism_subdivision', 'I-Organism_subdivision', 'O', 'O', 'O', 'O', 'O', 'B-Cancer', 'I-Cancer', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'B-Immaterial_anatomical_entity', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'B-Multi-tissue_structure', 'B-Multi-tissue_structure', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Immaterial_anatomical_entity', 'O', 'O'], ..., ['O', 'O', 'O', 'O', 'B-Organ', 'O', 'B-Organ', 'O', 'B-Organ', 'O', 'B-Organ', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Tissue', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Tissue', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Tissue', 'O', 'O', 'O', 'O', 'B-Multi-tissue_structure', 'O', 'O', 'O', 'O', 'O', 'O', 'O']]
There is another minor issue I guess, i.e. a long series of warnings about no predictions because of exceeded maximum sequence length. The non-predicted tokens appear to be not really relevant for my final interest, but I'd like to have a complete output nonetheless.
I will try to place a newline not only after usual end-of-sentence punctuation (.!?), but also after semi-colons and colons, in order to split each document in more pieces.
Is it a strategy that makes sense or have I misinterpreted the meaning of the maximum sequence length?
Do you have sentences with length greater than 512 tokens? BioBERT allows to have sentences with 512 tokens length maximum, as stated in the paper.
The maximum sequence length was fixed to 512
If you have sentences with more than 512 tokens, you have to apply different workaround, e.g. splitting a sentence length 1024 in two different sentences of 512 length and combine in some manner their output.
However, the strategy you've proposed (e.g. split by comma, dot, semi-column, etc.) works! Try to follow this approach and share the results with us! I suggest you to do a visual evaluation/comparison between the current output and the output you'll obtain with the strategy highlight by you.
I'm back.
What I did was: changed columns' separation from tab to space (I was wrong in the previous comment, I thought I already changed it).
Now the code runs properly and
test_predictions.txtis complete.
This is a snapshot ofprint(predictions):[['O', 'O', 'B-Organism_subdivision', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Organ', 'O', 'B-Organ', 'O', 'B-Multi-tissue_structure', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Organism_subdivision', 'I-Organism_subdivision', 'O', 'O', 'O', 'O', 'O', 'B-Cancer', 'I-Cancer', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'B-Immaterial_anatomical_entity', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'B-Multi-tissue_structure', 'B-Multi-tissue_structure', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Immaterial_anatomical_entity', 'O', 'O'], ..., ['O', 'O', 'O', 'O', 'B-Organ', 'O', 'B-Organ', 'O', 'B-Organ', 'O', 'B-Organ', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Tissue', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Tissue', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Tissue', 'O', 'O', 'O', 'O', 'B-Multi-tissue_structure', 'O', 'O', 'O', 'O', 'O', 'O', 'O']]There is another minor issue I guess, i.e. a long series of warnings about no predictions because of exceeded maximum sequence length. The non-predicted tokens appear to be not really relevant for my final interest, but I'd like to have a complete output nonetheless.
I will try to place a newline not only after usual end-of-sentence punctuation (.!?), but also after semi-colons and colons, in order to split each document in more pieces.
Is it a strategy that makes sense or have I misinterpreted the meaning of the maximum sequence length?
Quite funnily, now a lot more tokens are without predictions.
What I did was just adding a newline after each semicolon with sed.
A question that I thought was easy to answer to: what constitutes a sequences in BERT relative to this task? Is it a sequence of tokens between empty lines? Or between defined punctuation?
Taken from the official BERT paper:
Throughout this work, a “sentence” can be an arbitrary span of contiguous text, rather than an actual linguistic sentence. A “sequence” refers to the input token sequence to BERT, which may be a single sentence or two sentences packed together.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.