We tried running language modeling with languagemodel_ptb10k and the transformer_small as recommended in the README. No errors / tensorboard training curves looked fine, but the decoder output is something like: "the the the the the the the" (and identical every time).
We looked through the code and found --hparams='sampling_method=random', but it still seems to be argmaxing instead of sampling (or maybe something else is wrong?). We have also tried with languagemodel_ptb_characters and with transformer_base and attention_lm with similar results (no sampling, same degenerate output every time).
Is there something flag that we are missing? Code below.
Thanks for the help in advance!
...
OS: Ubuntu 14.04
$ pip freeze | grep tensor
tensor2tensor==1.6.5
tensorboard==1.8.0
tensorflow==1.8.0
$ python -V
# Python 3.6.5 :: Anaconda, Inc.
...
PROBLEM=languagemodel_ptb10k
MODEL=transformer
HPARAMS=transformer_small
DATA_DIR=$HOME/t2t_data
TMP_DIR=/tmp/t2t_datagen
TRAIN_DIR=$HOME/t2t_train/$PROBLEM/$MODEL-$HPARAMS
mkdir -p $DATA_DIR $TMP_DIR $TRAIN_DIR
t2t-datagen \
--data_dir=$DATA_DIR \
--tmp_dir=$TMP_DIR \
--problem=$PROBLEM
t2t-trainer \
--data_dir=$DATA_DIR \
--problem=$PROBLEM \
--model=$MODEL \
--hparams_set=$HPARAMS \
--output_dir=$TRAIN_DIR
BEAM_SIZE=4
ALPHA=0.6
t2t-decoder \
--data_dir=$DATA_DIR \
--problem=$PROBLEM \
--model=$MODEL \
--hparams_set=$HPARAMS \
--hparams='sampling_method=random' \
--output_dir=$TRAIN_DIR \
--decode_hparams="beam_size=$BEAM_SIZE,alpha=$ALPHA" \
--decode_from_file=input.txt \
--decode_to_file=output.txt
input.txt is a blank file with a dozen empty lines
if you trace the hparams through the various layers of modification you'll see transformer_small-> transformer_base-> transformer_base_v2-> transformer_base_v1->common_hparams.basic_params1. In basic_params1, sampling_method is set to argmax: https://github.com/tensorflow/tensor2tensor/blob/6969fab42200a7da11bc40c9537b76b0a204b46a/tensor2tensor/layers/common_hparams.py#L90 and is never changed as the hparam set is modified into transformer_small. The same is true for transformer_base and the attention_lm.py file's preset hparams.
Stanley, thanks for your response!
We saw that hyperparameter and tried to change it on the t2t-decoder (also tried on the t2t-train but that didn't work and we thought maybe its not necessary since you don't sample at train time anyways).
I also did the nuclear option of installing tensor2tensor from source and manually changing sampling_method="random", # "argmax" or "random" in case the hyperparam passing in wasn't working, but the results are all the same.
have you tried logging/printing some things around here: https://github.com/tensorflow/tensor2tensor/blob/a4fa55a3f128753d006d26ba8691eb97d14fbcfc/tensor2tensor/utils/t2t_model.py#L1087
to see what the distribution you're sampling out of looks like? Does the code get to this function?
I have found two mirror issues when I use a trained language model to decode a sentence.
the demo problem languagemodel_ptb10k generate vocabulary file that has word the with id->0, thus <pad>'s is 1, <EOS>'s is 2, so this line will give wrong eos_id to beam_search decoding processing. It results wrong terminal state. https://github.com/tensorflow/tensor2tensor/blob/1de75bda4bd4c98ca50bcdbcf5e94b388bf9a044/tensor2tensor/models/transformer.py#L812
language model problem has only targets, so if the model decodes those targets words, it will be striped, see this line:
https://github.com/tensorflow/tensor2tensor/blob/57444300243f068bad88eb5ed51a9793c4bde172/tensor2tensor/models/transformer.py#L442 . However, in the preprocessing, <EOS> is automatically added to the targets, the model will then always decodes <pad> after<EOS> . Thus nothing is outputed.
Quite strange -- could anyone figure out why "the" ends up with id = 0? We can look into it but would appreciate any help to fix it !
Thanks to everyone for the debugging.
@lukaszkaiser @rsepassi
I noticed that if I use a beam_size of 1 then it goes into the "greedy" decoding, however it will look at the sampling_temp hyperparameter and if I specify a value of 1.0, it seems to correctly sample random tokens (which is great). Am I correct that one needs to specify a beam_size of 1 and a non-zero sampling_temp to generate random text? If so, perhaps there should be a warning if the sampling_method is "random" but the beam_size is not 1 or if the sampling_temp is 0?