@schani now ,I want to traning my transformer model wtih corpus chinese-japanese,I have corpus about 10 million,
1 ,generator traing and dev data,the code adding my data in word2def.py , as follows:
from __future__ import division
from __future__ import print_function
import os
import tarfile
from tensor2tensor.data_generators import problem
from tensor2tensor.data_generators import text_encoder
from tensor2tensor.data_generators.translate import character_generator
from tensor2tensor.utils import registry
from tensor2tensor.models import transformer
from tensor2tensor.data_generators import generator_utils
from tensor2tensor.data_generators import problem
from tensor2tensor.data_generators import text_encoder
from tensor2tensor.data_generators import translate
import tensorflow as tf
EOS = text_encoder.EOS_ID
_WORD2DEF_TRAIN_DATASETS = [["/training-parallel-jp-ch.tgz",("training/translate_train_jpch.jp","training/translate_train_jpch.ch")]]
_WORD2DEF_TEST_DATASETS = [["/dev-parallel-jp-ch.tgz",("dev/translate_dev_jpch.jp","dev/translate_dev_jpch.ch")]]
@registry.register_problem()
class word2def(translate.TranslateProblem):
"""Problem spec for English word to dictionary definition."""
@property
def targeted_vocab_size(self):
return 2**16+20000 # 32768
@property
def vocab_name(self):
return "vocab.jpch"
def generator(self, data_dir, tmp_dir, train):
symbolizer_vocab = generator_utils.get_or_generate_vocab(data_dir, tmp_dir, self.vocab_file, self.targeted_vocab_size,_WORD2DEF_TRAIN_DATASETS)
datasets = _WORD2DEF_TRAIN_DATASETS if train else _WORD2DEF_TEST_DATASETS
tag = "train" if train else "dev"
data_path = translate.compile_data(tmp_dir, datasets,"wmt_jpch_tok_%s" % tag)
return translate.token_generator(data_path + ".lang1", data_path + ".lang2",symbolizer_vocab, EOS)
@property
def input_space_id(self):
return problem.SpaceID.EN_CHR
@property
def target_space_id(self):
return problem.SpaceID.EN_CHR
@registry.register_hparams
def word2def_hparams(self):
hparams = transformer.transformer_base_single_gpu() # Or whatever you'd like to build off.
hparams.batch_size = 1024
return hparams
""" Problem definition for word to dictionary definition.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import tarfile
from tensor2tensor.data_generators import problem
from tensor2tensor.data_generators import text_encoder
from tensor2tensor.data_generators.translate import character_generator
from tensor2tensor.utils import registry
from tensor2tensor.models import transformer
from tensor2tensor.data_generators import generator_utils
from tensor2tensor.data_generators import problem
from tensor2tensor.data_generators import text_encoder
from tensor2tensor.data_generators import translate
import tensorflow as tf
EOS = text_encoder.EOS_ID
_WORD2DEF_TRAIN_DATASETS = [["/training-parallel-jp-ch.tgz",("training/translate_train_jpch.jp","training/translate_train_jpch.ch")]]
_WORD2DEF_TEST_DATASETS = [["/dev-parallel-jp-ch.tgz",("dev/translate_dev_jpch.jp","dev/translate_dev_jpch.ch")]]
@registry.register_problem()
class word2def(translate.TranslateProblem):
"""Problem spec for English word to dictionary definition."""
@property
def targeted_vocab_size(self):
return 2**16
@property
def vocab_name(self):
return "vocab.jpch"
def generator(self, data_dir, tmp_dir, train):
symbolizer_vocab = generator_utils.get_or_generate_vocab(data_dir, tmp_dir, self.vocab_file, self.targeted_vocab_size,_WORD2DEF_TRAIN_DATASETS)
datasets = _WORD2DEF_TRAIN_DATASETS if train else _WORD2DEF_TEST_DATASETS
tag = "train" if train else "dev"
data_path = translate.compile_data(tmp_dir, datasets,"wmt_jpch_tok_%s" % tag)
return translate.token_generator(data_path + ".lang1", data_path + ".lang2",symbolizer_vocab, EOS)
@property
def input_space_id(self):
return problem.SpaceID.EN_CHR
@property
def target_space_id(self):
return problem.SpaceID.EN_CHR
@registry.register_hparams
def word2def_hparams(self):
hparams = transformer.transformer_base_single_gpu() # Or whatever you'd like to build off.
hparams.batch_size = 1024
return hparams
from . import word2def
PROBLEM=word2def
MODEL=transformer
HPARAMS=transformer_base_single_gpu
DATA_DIR=t2t_data
TMP_DIR=t2t_datagen
TRAIN_DIR=t2t_train/$PROBLEM/$MODEL-$HPARAMS
t2t-datagen --data_dir=$DATA_DIR --tmp_dir=$TMP_DIR --problem=$PROBLEM --t2t_usr_dir=/dnn4/dnn4_added/zhangxiaolei/env150zxlpy36-980
now I have a problem why the vocab_size is 2**16 in the code ,however ,at last the vocab_size in the file only 47000
how to expand the size of vocab
You need to increase the file_byte_budget.
Obviously, 1e6 bytes with your training data is not enough to get more than 47k subwords.
That said, I am not sure if a so big vocabulary pays off: the training and decoding is slower and it needs more memory (so you must use smaller batch_size, which seems to affect also the quality, see #444).
@martinpopel At first ,I have not change any parameters ,only added my dataset ,however ,after training the model when I calculate the blue_score ,the performance is very bad,the blue_score is lower than Groundhog translation model base on the same dataset .I do not what to do to improve the performance,
can you give some advices,
thank you
First, I see you use SpaceID.EN_CHR for both input and output, but actually you don't want character-based translation but rather subwords. I think for two-languages translation (non-multitask) the SpaceID does not matter, but I am not sure.
My T2T know-how: start with 32k vocabulary, make sure that the final min_count is not too low when building the subword vocab (otherwise increase file_byte_budget), set the batch_size as high as possible (without hitting OOM, but keep some reserve), use transformer_big_single_gpu, store checkpoints each hour (instead of each 10 minutes), which improves both the training speed and the final averaging. Check the training loss and test-metric (approx_bleu or real BLEU) in TensorBoard. If the training diverges, increase learning_rate_warmup_steps and start again from scratch. Increase training_steps e.g. to 1M - you can always kill the training when you see the BLEU curve is flat or even decreasing.
Finally, I am not sure this is the best place to discuss such general knowhow. Github issues should be for reporting bugs, feature request or very specific questions.
@martinpopel thanks if I want to increase the file_byte_budget.what shoud I do .
t2t-datagen --data_dir=$DATA_DIR --tmp_dir=$TMP_DIR --problem=$PROBLEM --t2t_usr_dir=/dnn4/dnn4_added/zhangxiaolei/env63zxlpy36tf14 --hparams='file_byte_budget=10000000'
is it right?
do you have count in https://gitter.im/tensor2tensor/Lobby
No. You'll need to modify the file_size_budget yourself when making vocab. See: https://github.com/twairball/t2t_wmt_zhen/blob/master/data_generators/utils.py#L130 for example.
Most helpful comment
First, I see you use SpaceID.EN_CHR for both input and output, but actually you don't want character-based translation but rather subwords. I think for two-languages translation (non-multitask) the SpaceID does not matter, but I am not sure.
My T2T know-how: start with 32k vocabulary, make sure that the final min_count is not too low when building the subword vocab (otherwise increase file_byte_budget), set the batch_size as high as possible (without hitting OOM, but keep some reserve), use transformer_big_single_gpu, store checkpoints each hour (instead of each 10 minutes), which improves both the training speed and the final averaging. Check the training loss and test-metric (approx_bleu or real BLEU) in TensorBoard. If the training diverges, increase learning_rate_warmup_steps and start again from scratch. Increase training_steps e.g. to 1M - you can always kill the training when you see the BLEU curve is flat or even decreasing.
Finally, I am not sure this is the best place to discuss such general knowhow. Github issues should be for reporting bugs, feature request or very specific questions.