Model I am using (Bert, XLNet ...): bert
Language I am using the model on (English, Chinese ...): English
The problem arises when using:
The tasks I am working on is:
Steps to reproduce the behavior:
Running the run_tf_ner example raises the following exception:
Traceback (most recent call last):
File "run_tf_ner.py", line 282, in <module>
main()
File "run_tf_ner.py", line 213, in main
trainer.train()
File "venv/lib/python3.7/site-packages/transformers/trainer_tf.py", line 308, in train
logger.info("Epoch {} Step {} Train Loss {:.4f}".format(epoch, step, training_loss.numpy()))
TypeError: unsupported format string passed to numpy.ndarray.__format__
This issue was reported by multiple people:
https://github.com/numpy/numpy/issues/12491
https://github.com/numpy/numpy/issues/5543
I think the easiest solution is to avoid using the numpy format string this way in TFTrainer.
transformers version: 2.1.0same here
How did you resolve it? Do we have to wait till TFTrainer developer change the code in trainer_tf.py?
Is there a way around it?
Hello,
Can you give a sample of data and a command line with which I can reproduce the issue? Thanks!
Hello,
Can you give a sample of data and a command line with which I can reproduce the issue? Thanks!
I run into the same problem with the exact setting as I posted here. #4664 (comment)
After I fixed the TFTrainer parameter issue, the training started but it returned this error after the normal BERT log.
Sorry I don't succeed to reproduce the issue with the command line in #4664 for me it works. Which dataset are you using? Germeval?
yes, I followed all the process here: https://github.com/huggingface/transformers/tree/master/examples/token-classification
Sorry, impossible to reproduce the issue :(
I tried with different NER datasets including germeval and everything goes well.
I would suggest to wait the next version of the TF Trainer to see if it solves your problem or not. It should arrives soon. Sorry :(
I am trying to reproduce it to see where the glitch is. Unfortunately colab gpu is too busy for me to get connected at the moment. I will post here once I locate the problem.
No worries. Thanks for bring out the TFTrainer!
I experienced the same issue while trying the latest run_tf_ner.py. I have almost no problem with the old version (months ago) of run_tf_ner.py and utils_ner.py, Trained several models and got very good predictions. But after update to the latest run_tf_ner.py, I got several problems: (1) logging_dir none (this already solved by passing the parameter) (2) the value of pad_token_label_id. In the old version I used this value was set to 0, but in the latest run_tf_ner.py it set to -1, but I got wrong prediction results if this set to -1. (3) The third issue is this.
To force the training process moving, I created a new class inherit from TFTrainer, modified the train method --> added except TypeError logger.info("Epoch {} Step {} Train Loss {}".format(epoch, step, 'TypeError'))
Here is the training_loss and trining_loss.numpy() printed
[3.86757078e-04 6.49182359e-04 1.50194198e-01 1.72556902e-03
7.37545686e-03 7.55832903e-03 2.59326249e-01 1.65126711e-01
1.45479038e-01 2.91670375e-02 1.02433632e-03 1.09142391e-03
7.45586725e-03 1.56116625e-03 6.97672069e-02 6.09296076e-02
1.59586817e-02 2.96084117e-02 3.36027122e-04 2.67877331e-04
2.72625312e-02 3.24607291e-03 2.79245054e-04 8.95933714e-04
1.38876194e-05 4.55974305e-06 7.18232468e-06 6.49688218e-06
4.67895006e-06 4.67895188e-06 4.08290907e-06 5.72202407e-06
5.99023815e-06 5.48360913e-06 1.09671510e-05 1.32022615e-05
7.30153261e-06 4.67895097e-06 4.88756723e-06 4.73855425e-06
4.70875511e-06 5.33459615e-06 4.35112906e-06 8.13599218e-06
4.14251372e-06 3.48686262e-06 7.68894461e-06 4.14251281e-06
4.55974168e-06 4.29152169e-06 9.68567110e-06 2.68220538e-06
3.63587583e-06 4.14251235e-06 3.18884304e-06 4.38093048e-06
4.52994209e-06 4.70875284e-06 3.30805187e-06 5.63261574e-06
3.15904026e-06 6.55648546e-06 5.87103386e-06 4.14251190e-06
3.81468908e-06 3.39745884e-06 4.47033653e-06 6.49688172e-06
6.25846224e-06 4.08290816e-06 4.08290680e-06 3.69548002e-06
4.35112725e-06 3.60607328e-06 4.97697329e-06 6.88430828e-06
5.72202634e-06 4.79816072e-06 5.75182776e-06 6.43727981e-06
3.78488676e-06 1.53479104e-05 6.70549389e-06 7.03331716e-06
3.18884258e-06 7.18232604e-06 5.27499060e-06 6.07965376e-06
3.72528302e-06 9.03003547e-06 5.03657793e-06 6.43727435e-06
5.33459661e-06 4.85776036e-06 9.38766698e-06 4.11270958e-06
3.36765652e-06 5.42400539e-06 5.18558409e-06 6.73529667e-06
9.03001182e-06 4.47033699e-06 3.51666586e-06 5.15578267e-06
3.87429282e-06 3.39745884e-06 4.08290725e-06 7.48034654e-06
7.71875875e-06 3.75508489e-06 3.60607396e-06 3.72528302e-06
5.84123518e-06 2.89082072e-06 4.32132674e-06 6.37766652e-06
4.64915001e-06 7.03332262e-06 3.99350029e-06 9.14925931e-06
4.32132583e-06 5.66242352e-06 3.75508489e-06 6.10945517e-06
4.85776673e-06 5.60281842e-06 4.70875375e-06 3.75508534e-06]
tf.Tensor(
[3.86757078e-04 6.49182359e-04 1.50194198e-01 1.72556902e-03
7.37545686e-03 7.55832903e-03 2.59326249e-01 1.65126711e-01
1.45479038e-01 2.91670375e-02 1.02433632e-03 1.09142391e-03
7.45586725e-03 1.56116625e-03 6.97672069e-02 6.09296076e-02
1.59586817e-02 2.96084117e-02 3.36027122e-04 2.67877331e-04
2.72625312e-02 3.24607291e-03 2.79245054e-04 8.95933714e-04
1.38876194e-05 4.55974305e-06 7.18232468e-06 6.49688218e-06
4.67895006e-06 4.67895188e-06 4.08290907e-06 5.72202407e-06
5.99023815e-06 5.48360913e-06 1.09671510e-05 1.32022615e-05
7.30153261e-06 4.67895097e-06 4.88756723e-06 4.73855425e-06
4.70875511e-06 5.33459615e-06 4.35112906e-06 8.13599218e-06
4.14251372e-06 3.48686262e-06 7.68894461e-06 4.14251281e-06
4.55974168e-06 4.29152169e-06 9.68567110e-06 2.68220538e-06
3.63587583e-06 4.14251235e-06 3.18884304e-06 4.38093048e-06
4.52994209e-06 4.70875284e-06 3.30805187e-06 5.63261574e-06
3.15904026e-06 6.55648546e-06 5.87103386e-06 4.14251190e-06
3.81468908e-06 3.39745884e-06 4.47033653e-06 6.49688172e-06
6.25846224e-06 4.08290816e-06 4.08290680e-06 3.69548002e-06
4.35112725e-06 3.60607328e-06 4.97697329e-06 6.88430828e-06
5.72202634e-06 4.79816072e-06 5.75182776e-06 6.43727981e-06
3.78488676e-06 1.53479104e-05 6.70549389e-06 7.03331716e-06
3.18884258e-06 7.18232604e-06 5.27499060e-06 6.07965376e-06
3.72528302e-06 9.03003547e-06 5.03657793e-06 6.43727435e-06
5.33459661e-06 4.85776036e-06 9.38766698e-06 4.11270958e-06
3.36765652e-06 5.42400539e-06 5.18558409e-06 6.73529667e-06
9.03001182e-06 4.47033699e-06 3.51666586e-06 5.15578267e-06
3.87429282e-06 3.39745884e-06 4.08290725e-06 7.48034654e-06
7.71875875e-06 3.75508489e-06 3.60607396e-06 3.72528302e-06
5.84123518e-06 2.89082072e-06 4.32132674e-06 6.37766652e-06
4.64915001e-06 7.03332262e-06 3.99350029e-06 9.14925931e-06
4.32132583e-06 5.66242352e-06 3.75508489e-06 6.10945517e-06
4.85776673e-06 5.60281842e-06 4.70875375e-06 3.75508534e-06], shape=(128,), dtype=float32)
@xl2602 Thanks for your feedback, -1 was also the default value of pad_token_label_id in the previous version of the script.
@jx669 and @xl2602 Can you try to add the --mode token-classification parameter?
@jplu I think this has nothing to do with the context of the training script. To reproduce, just run logging.info("Here is an error example {:.4f}".format(np.array([1,2,3]))) in python console. Maybe this is related to the numpy version. I've tried 1.16.4 and 1.18 and they both failed.
Tested with 1.18.4 only, I'm gonna try with other versions to see if I succeed to get the same issue.
numpy 1.18.4 is the same as what I installed.
I just reproduced the same error message with colab gpu:
These are what I installed:
!pip install transformers
!pip install seqeval
!pip install wandb; wandb login
I did not install numpy or TF separately. think they come with the transformers package.
I checked the numpy version:
'1.18.4'
TF version:
'2.2.0'
Ok, still don't get any error, including with different versions of Numpy.
@jx669 @xl2602 @VDCN12593 Can you please tell me if you do the exact same thing than in this colab please https://colab.research.google.com/drive/19zAfUN8EEmiT4imwzLeFv6q1PJ5CgcRb?usp=sharing
It might have something to do with the mode command: --mode token-classification
If you remove that line in your colab notebook, the same error message will reoccur.
Cool! Happy we found the problem.
When you run the TF Trainer you have to specify over which task it will be trained on, here for example it is token-classification when it on text content it will be text-classification (the default) and the same for the two other tasks QA and MC.
This behavior will be removed in the next version of the TF trainer.
I see. Good to learn. Thanks!
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.