Tensor2tensor: How to measure the number of training epochs

Created on 13 Nov 2017 · 22Comments · Source: tensorflow/tensor2tensor

In order to compare with other NMT frameworks, I would like to know how many training epochs (i.e. passes over the whole training data) are done at the moment.
I can see the number of training (global) steps and I guess epochs = steps * batch_size / training_subwords.
So the questions boils down to: How to make T2T report (e.g. in the log) the number of subwords in the training data?

feature request

Source

martinpopel

Most helpful comment

Yeah, this seems like a reasonable thing to want but unfortunately not simple to do currently. The variable batch size because of bucketing examples by sequence length complicates the picture.

Counting the number of subwords would need a pass through the data on disk, probably best done by a separate script.

rsepassi on 13 Nov 2017

👍2

All 22 comments

Yeah, this seems like a reasonable thing to want but unfortunately not simple to do currently. The variable batch size because of bucketing examples by sequence length complicates the picture.

Counting the number of subwords would need a pass through the data on disk, probably best done by a separate script.

rsepassi on 13 Nov 2017

👍2

@rsepassi hi, "batch_size" is the number of subwords of source and target sentences in a batch?
or only the number of subwords of source sentences in a batch?
thanks a lot.

yuimo on 29 Nov 2017

@yuimo: it is the maximum of source and target subwords, for each sentence. See https://github.com/tensorflow/tensor2tensor/blob/92983eaaa457ec18729b1883ba5ae4a6614bdcb5/tensor2tensor/utils/data_reader.py#L145-L153

martinpopel on 29 Nov 2017

@martinpopel i got it, thanks a lot

yuimo on 30 Nov 2017

@martinpopel , you meant to write:

epochs = steps * batch_size * worker_gpu / training_subwords

Right?

ndvbd on 13 Feb 2018

epochs = steps * batch_size * worker_gpu / training_subwords

Yes, exactly. In other words epochs = steps * effective_batch_size / training_subwords.

I wrote a simple script t2t_text2subwords.py for computing the number of subwords in train/test data, but I had not enough time to tidy it, document and send as a PR.

martinpopel on 13 Feb 2018

@martinpopel It would be nice if the T2T will show in the tensorboard how many epochs were done during training.

Do you know if there are any rules of thumb in respect to how many epochs should be done during NMT tasks?

In addition, do you know if T2T runs on the training data in a deterministic way, or in a randomized way? (meaning if 2 training invocations should yield the exact same model?)

ndvbd on 13 Feb 2018

It would be nice if the T2T will show in the tensorboard how many epochs were done during training.

Yes, that would be nice, but there are two problems:

How to compute the number of epochs exactly? The formula above does not handle zero-padding, so it is just an upper bound on the number of epochs (I think). TensorBoard reports input_stats/targets_nonpadding_fraction and input_stats/inputs_nonpadding_fraction, so there is a way how to compute the number of epochs. Ideally t2t-datagen should report to stderr the number of subwords (as my script does) and t2t-trainer should report the number of epochs (or how many steps are in one epoch after the first epoch has ended).
How to present this number in TensorBoard? Currently, TensorBoard offers just "Step", "Relative" and "Wall" as the options for the x-axis and I doubt there is a way to provide other options (maybe plugins?). Also, I am not sure what is more helpful: epochs or number of training examples? For a given training data, these two options don't change the curves, just the x-axis labels, but when comparing experiments with different training data size, I guess the number of training examples is more relevant.

Do you know if there are any rules of thumb in respect to how many epochs should be done during NMT tasks?

The standard&naive answer is "until converged on dev set", but this is difficult to measure (how to set early stopping parameters) and achieve. My training data has about half a gigaword and even 18 epochs (11 days of training on 8 GPUs) were not enough to reach the highest possible BLEU.

In addition, do you know if T2T runs on the training data in a deterministic way, or in a randomized way? (meaning if 2 training invocations should yield the exact same model?)

It should be randomized and deterministic (thanks to the fixed rand seed), but I am waiting for the ultimate answer from the T2T authors, see https://github.com/tensorflow/tensor2tensor/issues/556#issuecomment-364303805 and the posts below.

martinpopel on 13 Feb 2018

@martinpopel, why not to go with simply the % of sentences (examples) completed, instead of subwords?

If we completed 100% of the cases in the training data -> We reached to 1.0 epochs, and so on?
I don't think we need to go into the subwords resolution.

ndvbd on 21 Feb 2018

@nadavb T2T computes batch_size in subwords (for translation problems with variable length). One batch may contain a small number of long sentences or a high number of short sentences.
T2T does not report the number of sentences processed, it reports just the number of steps (batches).
Thus, we need to know the total number of subwords in the training data, in order to estimate the number of epochs.
Of course, if you know the total number of sentences in the training data, you could estimate that x % of sentences are processed when x % of subwords are processed.

martinpopel on 21 Feb 2018

I probably don't understand something. Why do we care about subwords when we talk about epochs?
In the training data, we have input and output sentences.
During training, these sentences are being converted to subwords and then being sent to the different GPUs. The code that takes these sentences know how many sentences it took (and converted to subwords) in each step. We can simply have a counter counting the number of sentences passed. It must be somewhere anyhow, in order not to process a sentence twice. (Some Data Reader). That's it. I don't understand why it is so difficult to keep track on how many sentences we read from the training files. I understand that "One batch may contain a small number of long sentences or a high number of short sentences." - but we don't care how many sentences are in a batch. We only want to know how many sentences we took from the training data set before we converted them and send them to the GPU, hold a counter, and that's it.

ndvbd on 27 Feb 2018

👍1

We can simply have a counter counting the number of sentences passed.

Yes, you can implement such counter and send a PR. That would be great (and more precise than my subword-based estimates that are biased because of not taking into account zero-padding).

martinpopel on 27 Feb 2018

Where should I modify the code about set training steps?

prigioni on 21 Apr 2018

I don't know where is the exact location for adding the epoch counter (but I have not spent much time searching it), otherwise I would do it myself. Maybe it is possible to solve it with a hook in utils/trainer_lib.py. Note that tf.contrib.learn.Experiment is deprecated and should be replaced soon, but it seems that tf.estimator does not support continuous_train_and_eval. As this schedule is not intended for distributed train&eval anyway, I would suggest to get rid of tf.contrib.learn.Experiment and reimplement in pure Tensorflow, where it is much easier to count the number of epochs.

martinpopel on 21 Apr 2018

@martinpopel With "number of training subwords" you mean the sum of all source texts subwords plus all target texts subwords used for training, is that right?

DonPex on 6 Sep 2018

@DonPex: No. It is the maximum of source and target subwords, for each sentence. See the discussion above.

martinpopel on 6 Sep 2018

👍1

@martinpopel Thank you. I used your script to compute the maximum number of subwords, but you said that it should be only an estimate because of padding tokens.
So I should check this input_stats/inputs_nonpadding_fraction and input_stats/targets_nonpadding_fraction, multiplying them with the number of subwords to obtain the real number of subwords without padding?

I am using Google Colab, so I would know if it's possible to train a Transformer at least one epoch in 12 hours (maximum time allowed on Colab) with a custom dataset using a specific batch size.

DonPex on 7 Sep 2018

Yes, considering nonpadding_fraction should result in a more precise estimate.
I am not sure why "at least one epoch" is important in your use case. Usually you need more epochs anyway for good results (unless the task is simple and data large, in which case you may overfit well before reaching one epoch).
If you can store checkpoints and continue training in another Colab session, then you can try it anyway (T2T starts from a random part of the training data and shuffles the training files by default, I think).

martinpopel on 7 Sep 2018

My target is just to feed the model with the highest number of subwords in the dataset possible, so if I couldn't complete one epoch in less than 12h, I should use another Colab session and start the training from another random part, in this way I may skip some fractions of the dataset due to randomness.

DonPex on 7 Sep 2018

epochs = steps * batch_size * worker_gpu / training_subwords

Yes, exactly. In other words epochs = steps * effective_batch_size / training_subwords.

I wrote a simple script t2t_text2subwords.py for computing the number of subwords in train/test data, but I had not enough time to tidy it, document and send as a PR.

@martinpopel may I ask something regarding the above formula? When training on a single TPU (v2), is the effective_batch_size equal to the batch_size, or to batch_size*8?
In other words, a single TPU has 8 cores. If batch_size is 2048, does it mean that each core handles 2048 (so effective_batch_size is 2048*8), or this 2048 is splitted between the cores?
Thank you!

coder1248 on 7 Jan 2020

@coder1248: I would guess batch_size*8, but I am not sure as I have never used TPUs for real training. I know, T2T treats TPUs differently than CPU and GPU in several aspects (e.g. preferring/requiring fixed number of sentences per batch and "packed" problems, which perhaps influences also the estimate of number of epochs).

martinpopel on 7 Jan 2020

Thanks again for your help martinpopel!
@rsepassi @lukaszkaiser could you kindly verify martinpopel's answer? Thanks!

coder1248 on 7 Jan 2020

Was this page helpful?

0 / 5 - 0 ratings