In order to compare with other NMT frameworks, I would like to know how many training epochs (i.e. passes over the whole training data) are done at the moment.
I can see the number of training (global) steps and I guess epochs = steps * batch_size / training_subwords.
So the questions boils down to: How to make T2T report (e.g. in the log) the number of subwords in the training data?
Yeah, this seems like a reasonable thing to want but unfortunately not simple to do currently. The variable batch size because of bucketing examples by sequence length complicates the picture.
Counting the number of subwords would need a pass through the data on disk, probably best done by a separate script.
@rsepassi hi, "batch_size" is the number of subwords of source and target sentences in a batch?
or only the number of subwords of source sentences in a batch?
thanks a lot.
@yuimo: it is the maximum of source and target subwords, for each sentence. See https://github.com/tensorflow/tensor2tensor/blob/92983eaaa457ec18729b1883ba5ae4a6614bdcb5/tensor2tensor/utils/data_reader.py#L145-L153
@martinpopel i got it, thanks a lot
@martinpopel , you meant to write:
epochs = steps * batch_size * worker_gpu / training_subwords
Right?
epochs = steps * batch_size * worker_gpu / training_subwords
Yes, exactly. In other words epochs = steps * effective_batch_size / training_subwords.
I wrote a simple script t2t_text2subwords.py for computing the number of subwords in train/test data, but I had not enough time to tidy it, document and send as a PR.
@martinpopel It would be nice if the T2T will show in the tensorboard how many epochs were done during training.
Do you know if there are any rules of thumb in respect to how many epochs should be done during NMT tasks?
In addition, do you know if T2T runs on the training data in a deterministic way, or in a randomized way? (meaning if 2 training invocations should yield the exact same model?)
It would be nice if the T2T will show in the tensorboard how many epochs were done during training.
Yes, that would be nice, but there are two problems:
input_stats/targets_nonpadding_fraction and input_stats/inputs_nonpadding_fraction, so there is a way how to compute the number of epochs. Ideally t2t-datagen should report to stderr the number of subwords (as my script does) and t2t-trainer should report the number of epochs (or how many steps are in one epoch after the first epoch has ended).Do you know if there are any rules of thumb in respect to how many epochs should be done during NMT tasks?
The standard&naive answer is "until converged on dev set", but this is difficult to measure (how to set early stopping parameters) and achieve. My training data has about half a gigaword and even 18 epochs (11 days of training on 8 GPUs) were not enough to reach the highest possible BLEU.
In addition, do you know if T2T runs on the training data in a deterministic way, or in a randomized way? (meaning if 2 training invocations should yield the exact same model?)
It should be randomized and deterministic (thanks to the fixed rand seed), but I am waiting for the ultimate answer from the T2T authors, see https://github.com/tensorflow/tensor2tensor/issues/556#issuecomment-364303805 and the posts below.
@martinpopel, why not to go with simply the % of sentences (examples) completed, instead of subwords?
If we completed 100% of the cases in the training data -> We reached to 1.0 epochs, and so on?
I don't think we need to go into the subwords resolution.
@nadavb T2T computes batch_size in subwords (for translation problems with variable length). One batch may contain a small number of long sentences or a high number of short sentences.
T2T does not report the number of sentences processed, it reports just the number of steps (batches).
Thus, we need to know the total number of subwords in the training data, in order to estimate the number of epochs.
Of course, if you know the total number of sentences in the training data, you could estimate that x % of sentences are processed when x % of subwords are processed.
I probably don't understand something. Why do we care about subwords when we talk about epochs?
In the training data, we have input and output sentences.
During training, these sentences are being converted to subwords and then being sent to the different GPUs. The code that takes these sentences know how many sentences it took (and converted to subwords) in each step. We can simply have a counter counting the number of sentences passed. It must be somewhere anyhow, in order not to process a sentence twice. (Some Data Reader). That's it. I don't understand why it is so difficult to keep track on how many sentences we read from the training files. I understand that "One batch may contain a small number of long sentences or a high number of short sentences." - but we don't care how many sentences are in a batch. We only want to know how many sentences we took from the training data set before we converted them and send them to the GPU, hold a counter, and that's it.
We can simply have a counter counting the number of sentences passed.
Yes, you can implement such counter and send a PR. That would be great (and more precise than my subword-based estimates that are biased because of not taking into account zero-padding).
Where should I modify the code about set training steps?
I don't know where is the exact location for adding the epoch counter (but I have not spent much time searching it), otherwise I would do it myself. Maybe it is possible to solve it with a hook in utils/trainer_lib.py. Note that tf.contrib.learn.Experiment is deprecated and should be replaced soon, but it seems that tf.estimator does not support continuous_train_and_eval. As this schedule is not intended for distributed train&eval anyway, I would suggest to get rid of tf.contrib.learn.Experiment and reimplement in pure Tensorflow, where it is much easier to count the number of epochs.
@martinpopel With "number of training subwords" you mean the sum of all source texts subwords plus all target texts subwords used for training, is that right?
@DonPex: No. It is the maximum of source and target subwords, for each sentence. See the discussion above.
@martinpopel Thank you. I used your script to compute the maximum number of subwords, but you said that it should be only an estimate because of padding tokens.
So I should check this input_stats/inputs_nonpadding_fraction and input_stats/targets_nonpadding_fraction, multiplying them with the number of subwords to obtain the real number of subwords without padding?
I am using Google Colab, so I would know if it's possible to train a Transformer at least one epoch in 12 hours (maximum time allowed on Colab) with a custom dataset using a specific batch size.
Yes, considering nonpadding_fraction should result in a more precise estimate.
I am not sure why "at least one epoch" is important in your use case. Usually you need more epochs anyway for good results (unless the task is simple and data large, in which case you may overfit well before reaching one epoch).
If you can store checkpoints and continue training in another Colab session, then you can try it anyway (T2T starts from a random part of the training data and shuffles the training files by default, I think).
My target is just to feed the model with the highest number of subwords in the dataset possible, so if I couldn't complete one epoch in less than 12h, I should use another Colab session and start the training from another random part, in this way I may skip some fractions of the dataset due to randomness.
epochs = steps * batch_size * worker_gpu / training_subwords
Yes, exactly. In other words
epochs = steps * effective_batch_size / training_subwords.I wrote a simple script t2t_text2subwords.py for computing the number of subwords in train/test data, but I had not enough time to tidy it, document and send as a PR.
@martinpopel may I ask something regarding the above formula? When training on a single TPU (v2), is the effective_batch_size equal to the batch_size, or to batch_size*8?
In other words, a single TPU has 8 cores. If batch_size is 2048, does it mean that each core handles 2048 (so effective_batch_size is 2048*8), or this 2048 is splitted between the cores?
Thank you!
@coder1248: I would guess batch_size*8, but I am not sure as I have never used TPUs for real training. I know, T2T treats TPUs differently than CPU and GPU in several aspects (e.g. preferring/requiring fixed number of sentences per batch and "packed" problems, which perhaps influences also the estimate of number of epochs).
Thanks again for your help martinpopel!
@rsepassi @lukaszkaiser could you kindly verify martinpopel's answer? Thanks!
Most helpful comment
Yeah, this seems like a reasonable thing to want but unfortunately not simple to do currently. The variable batch size because of bucketing examples by sequence length complicates the picture.
Counting the number of subwords would need a pass through the data on disk, probably best done by a separate script.