Transformers: Numpy format string issue in TFTrainer

Created on 28 May 2020 · 19Comments · Source: huggingface/transformers

🐛 Bug

Information

Model I am using (Bert, XLNet ...): bert

Language I am using the model on (English, Chinese ...): English

The problem arises when using:

[x] the official example scripts: (give details below)
[ ] my own modified scripts: (give details below)

The tasks I am working on is:

[x] an official GLUE/SQUaD task: tf_ner
[ ] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

Running the run_tf_ner example raises the following exception:

Traceback (most recent call last): File "run_tf_ner.py", line 282, in <module> main() File "run_tf_ner.py", line 213, in main trainer.train() File "venv/lib/python3.7/site-packages/transformers/trainer_tf.py", line 308, in train logger.info("Epoch {} Step {} Train Loss {:.4f}".format(epoch, step, training_loss.numpy())) TypeError: unsupported format string passed to numpy.ndarray.__format__

This issue was reported by multiple people:
https://github.com/numpy/numpy/issues/12491
https://github.com/numpy/numpy/issues/5543

I think the easiest solution is to avoid using the numpy format string this way in TFTrainer.

Environment info

transformers version: 2.1.0
Platform: Ubuntu-18.04
Python version: 3.7.7
PyTorch version (GPU?): N/A
Tensorflow version (GPU?): 2.1.0
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

wontfix

Source

VDCN12593

👍1

All 19 comments

same here

jx669 on 29 May 2020

How did you resolve it? Do we have to wait till TFTrainer developer change the code in trainer_tf.py?
Is there a way around it?

jx669 on 29 May 2020

Hello,

Can you give a sample of data and a command line with which I can reproduce the issue? Thanks!

jplu on 29 May 2020

Hello,

Can you give a sample of data and a command line with which I can reproduce the issue? Thanks!

I run into the same problem with the exact setting as I posted here. #4664 (comment)
After I fixed the TFTrainer parameter issue, the training started but it returned this error after the normal BERT log.

jx669 on 29 May 2020

Sorry I don't succeed to reproduce the issue with the command line in #4664 for me it works. Which dataset are you using? Germeval?

jplu on 29 May 2020

yes, I followed all the process here: https://github.com/huggingface/transformers/tree/master/examples/token-classification

jx669 on 29 May 2020

Sorry, impossible to reproduce the issue :(

I tried with different NER datasets including germeval and everything goes well.

jplu on 29 May 2020

I would suggest to wait the next version of the TF Trainer to see if it solves your problem or not. It should arrives soon. Sorry :(

jplu on 29 May 2020

I am trying to reproduce it to see where the glitch is. Unfortunately colab gpu is too busy for me to get connected at the moment. I will post here once I locate the problem.

No worries. Thanks for bring out the TFTrainer!

jx669 on 29 May 2020

I experienced the same issue while trying the latest run_tf_ner.py. I have almost no problem with the old version (months ago) of run_tf_ner.py and utils_ner.py, Trained several models and got very good predictions. But after update to the latest run_tf_ner.py, I got several problems: (1) logging_dir none (this already solved by passing the parameter) (2) the value of pad_token_label_id. In the old version I used this value was set to 0, but in the latest run_tf_ner.py it set to -1, but I got wrong prediction results if this set to -1. (3) The third issue is this.

To force the training process moving, I created a new class inherit from TFTrainer, modified the train method --> added except TypeError logger.info("Epoch {} Step {} Train Loss {}".format(epoch, step, 'TypeError'))

Here is the training_loss and trining_loss.numpy() printed

[3.86757078e-04 6.49182359e-04 1.50194198e-01 1.72556902e-03
7.37545686e-03 7.55832903e-03 2.59326249e-01 1.65126711e-01
1.45479038e-01 2.91670375e-02 1.02433632e-03 1.09142391e-03
7.45586725e-03 1.56116625e-03 6.97672069e-02 6.09296076e-02
1.59586817e-02 2.96084117e-02 3.36027122e-04 2.67877331e-04
2.72625312e-02 3.24607291e-03 2.79245054e-04 8.95933714e-04
1.38876194e-05 4.55974305e-06 7.18232468e-06 6.49688218e-06
4.67895006e-06 4.67895188e-06 4.08290907e-06 5.72202407e-06
5.99023815e-06 5.48360913e-06 1.09671510e-05 1.32022615e-05
7.30153261e-06 4.67895097e-06 4.88756723e-06 4.73855425e-06
4.70875511e-06 5.33459615e-06 4.35112906e-06 8.13599218e-06
4.14251372e-06 3.48686262e-06 7.68894461e-06 4.14251281e-06
4.55974168e-06 4.29152169e-06 9.68567110e-06 2.68220538e-06
3.63587583e-06 4.14251235e-06 3.18884304e-06 4.38093048e-06
4.52994209e-06 4.70875284e-06 3.30805187e-06 5.63261574e-06
3.15904026e-06 6.55648546e-06 5.87103386e-06 4.14251190e-06
3.81468908e-06 3.39745884e-06 4.47033653e-06 6.49688172e-06
6.25846224e-06 4.08290816e-06 4.08290680e-06 3.69548002e-06
4.35112725e-06 3.60607328e-06 4.97697329e-06 6.88430828e-06
5.72202634e-06 4.79816072e-06 5.75182776e-06 6.43727981e-06
3.78488676e-06 1.53479104e-05 6.70549389e-06 7.03331716e-06
3.18884258e-06 7.18232604e-06 5.27499060e-06 6.07965376e-06
3.72528302e-06 9.03003547e-06 5.03657793e-06 6.43727435e-06
5.33459661e-06 4.85776036e-06 9.38766698e-06 4.11270958e-06
3.36765652e-06 5.42400539e-06 5.18558409e-06 6.73529667e-06
9.03001182e-06 4.47033699e-06 3.51666586e-06 5.15578267e-06
3.87429282e-06 3.39745884e-06 4.08290725e-06 7.48034654e-06
7.71875875e-06 3.75508489e-06 3.60607396e-06 3.72528302e-06
5.84123518e-06 2.89082072e-06 4.32132674e-06 6.37766652e-06
4.64915001e-06 7.03332262e-06 3.99350029e-06 9.14925931e-06
4.32132583e-06 5.66242352e-06 3.75508489e-06 6.10945517e-06
4.85776673e-06 5.60281842e-06 4.70875375e-06 3.75508534e-06]
tf.Tensor(
[3.86757078e-04 6.49182359e-04 1.50194198e-01 1.72556902e-03
7.37545686e-03 7.55832903e-03 2.59326249e-01 1.65126711e-01
1.45479038e-01 2.91670375e-02 1.02433632e-03 1.09142391e-03
7.45586725e-03 1.56116625e-03 6.97672069e-02 6.09296076e-02
1.59586817e-02 2.96084117e-02 3.36027122e-04 2.67877331e-04
2.72625312e-02 3.24607291e-03 2.79245054e-04 8.95933714e-04
1.38876194e-05 4.55974305e-06 7.18232468e-06 6.49688218e-06
4.67895006e-06 4.67895188e-06 4.08290907e-06 5.72202407e-06
5.99023815e-06 5.48360913e-06 1.09671510e-05 1.32022615e-05
7.30153261e-06 4.67895097e-06 4.88756723e-06 4.73855425e-06
4.70875511e-06 5.33459615e-06 4.35112906e-06 8.13599218e-06
4.14251372e-06 3.48686262e-06 7.68894461e-06 4.14251281e-06
4.55974168e-06 4.29152169e-06 9.68567110e-06 2.68220538e-06
3.63587583e-06 4.14251235e-06 3.18884304e-06 4.38093048e-06
4.52994209e-06 4.70875284e-06 3.30805187e-06 5.63261574e-06
3.15904026e-06 6.55648546e-06 5.87103386e-06 4.14251190e-06
3.81468908e-06 3.39745884e-06 4.47033653e-06 6.49688172e-06
6.25846224e-06 4.08290816e-06 4.08290680e-06 3.69548002e-06
4.35112725e-06 3.60607328e-06 4.97697329e-06 6.88430828e-06
5.72202634e-06 4.79816072e-06 5.75182776e-06 6.43727981e-06
3.78488676e-06 1.53479104e-05 6.70549389e-06 7.03331716e-06
3.18884258e-06 7.18232604e-06 5.27499060e-06 6.07965376e-06
3.72528302e-06 9.03003547e-06 5.03657793e-06 6.43727435e-06
5.33459661e-06 4.85776036e-06 9.38766698e-06 4.11270958e-06
3.36765652e-06 5.42400539e-06 5.18558409e-06 6.73529667e-06
9.03001182e-06 4.47033699e-06 3.51666586e-06 5.15578267e-06
3.87429282e-06 3.39745884e-06 4.08290725e-06 7.48034654e-06
7.71875875e-06 3.75508489e-06 3.60607396e-06 3.72528302e-06
5.84123518e-06 2.89082072e-06 4.32132674e-06 6.37766652e-06
4.64915001e-06 7.03332262e-06 3.99350029e-06 9.14925931e-06
4.32132583e-06 5.66242352e-06 3.75508489e-06 6.10945517e-06
4.85776673e-06 5.60281842e-06 4.70875375e-06 3.75508534e-06], shape=(128,), dtype=float32)

xl2602 on 29 May 2020

👍1

@xl2602 Thanks for your feedback, -1 was also the default value of pad_token_label_id in the previous version of the script.

@jx669 and @xl2602 Can you try to add the --mode token-classification parameter?

jplu on 29 May 2020

👍1

@jplu I think this has nothing to do with the context of the training script. To reproduce, just run logging.info("Here is an error example {:.4f}".format(np.array([1,2,3]))) in python console. Maybe this is related to the numpy version. I've tried 1.16.4 and 1.18 and they both failed.

VDCN12593 on 29 May 2020

Tested with 1.18.4 only, I'm gonna try with other versions to see if I succeed to get the same issue.

jplu on 29 May 2020

numpy 1.18.4 is the same as what I installed.

I just reproduced the same error message with colab gpu:

These are what I installed:
!pip install transformers
!pip install seqeval
!pip install wandb; wandb login

I did not install numpy or TF separately. think they come with the transformers package.
I checked the numpy version:
'1.18.4'
TF version:
'2.2.0'

jx669 on 30 May 2020

Ok, still don't get any error, including with different versions of Numpy.

@jx669 @xl2602 @VDCN12593 Can you please tell me if you do the exact same thing than in this colab please https://colab.research.google.com/drive/19zAfUN8EEmiT4imwzLeFv6q1PJ5CgcRb?usp=sharing

jplu on 30 May 2020

It might have something to do with the mode command: --mode token-classification

If you remove that line in your colab notebook, the same error message will reoccur.

jx669 on 30 May 2020

Cool! Happy we found the problem.

When you run the TF Trainer you have to specify over which task it will be trained on, here for example it is token-classification when it on text content it will be text-classification (the default) and the same for the two other tasks QA and MC.

This behavior will be removed in the next version of the TF trainer.

jplu on 30 May 2020

👍1

I see. Good to learn. Thanks!

jx669 on 30 May 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.