Tensor2tensor: t2t transformer model

Created on 28 Nov 2017 · 5Comments · Source: tensorflow/tensor2tensor

@schani now ,I want to traning my transformer model wtih corpus chinese-japanese,I have corpus about 10 million,
1 ,generator traing and dev data,the code adding my data in word2def.py , as follows:

from future import absolute_import

from __future__ import division
from __future__ import print_function
import os
import tarfile

from tensor2tensor.data_generators import problem
from tensor2tensor.data_generators import text_encoder
from tensor2tensor.data_generators.translate import character_generator

from tensor2tensor.utils import registry
from tensor2tensor.models import transformer
from tensor2tensor.data_generators import generator_utils
from tensor2tensor.data_generators import problem
from tensor2tensor.data_generators import text_encoder
from tensor2tensor.data_generators import translate
import tensorflow as tf

EOS = text_encoder.EOS_ID

English Word2def datasets

LOCATION_OF_DATA='/dnn4/dnn4_added/zhangxiaolei/env63zxlpy36tf14/'

_WORD2DEF_TRAIN_DATASETS = [["/training-parallel-jp-ch.tgz",("training/translate_train_jpch.jp","training/translate_train_jpch.ch")]]

_WORD2DEF_TEST_DATASETS = [["/dev-parallel-jp-ch.tgz",("dev/translate_dev_jpch.jp","dev/translate_dev_jpch.ch")]]

@registry.register_problem()
class word2def(translate.TranslateProblem):
"""Problem spec for English word to dictionary definition."""

@property
def targeted_vocab_size(self):
return 2**16+20000 # 32768

@property
def vocab_name(self):
return "vocab.jpch"

def generator(self, data_dir, tmp_dir, train):
symbolizer_vocab = generator_utils.get_or_generate_vocab(data_dir, tmp_dir, self.vocab_file, self.targeted_vocab_size,_WORD2DEF_TRAIN_DATASETS)
datasets = _WORD2DEF_TRAIN_DATASETS if train else _WORD2DEF_TEST_DATASETS
tag = "train" if train else "dev"
data_path = translate.compile_data(tmp_dir, datasets,"wmt_jpch_tok_%s" % tag)
return translate.token_generator(data_path + ".lang1", data_path + ".lang2",symbolizer_vocab, EOS)

@property
def input_space_id(self):
return problem.SpaceID.EN_CHR

@property
def target_space_id(self):
return problem.SpaceID.EN_CHR

@registry.register_hparams
def word2def_hparams(self):
hparams = transformer.transformer_base_single_gpu() # Or whatever you'd like to build off.
hparams.batch_size = 1024
return hparams
""" Problem definition for word to dictionary definition.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import os
import tarfile

from tensor2tensor.data_generators import problem
from tensor2tensor.data_generators import text_encoder
from tensor2tensor.data_generators.translate import character_generator

EOS = text_encoder.EOS_ID

English Word2def datasets

LOCATION_OF_DATA='/dnn4/dnn4_added/zhangxiaolei/env63zxlpy36tf14/'

_WORD2DEF_TRAIN_DATASETS = [["/training-parallel-jp-ch.tgz",("training/translate_train_jpch.jp","training/translate_train_jpch.ch")]]

_WORD2DEF_TEST_DATASETS = [["/dev-parallel-jp-ch.tgz",("dev/translate_dev_jpch.jp","dev/translate_dev_jpch.ch")]]

@registry.register_problem()
class word2def(translate.TranslateProblem):
"""Problem spec for English word to dictionary definition."""

@property
def targeted_vocab_size(self):
return 2**16

@property
def vocab_name(self):
return "vocab.jpch"

@property
def input_space_id(self):
return problem.SpaceID.EN_CHR

@property
def target_space_id(self):
return problem.SpaceID.EN_CHR

@registry.register_hparams
def word2def_hparams(self):
hparams = transformer.transformer_base_single_gpu() # Or whatever you'd like to build off.
hparams.batch_size = 1024
return hparams

__init__.py as follows:

encoding:utf-8

from . import word2def

to genereate data

PROBLEM=word2def
MODEL=transformer
HPARAMS=transformer_base_single_gpu

DATA_DIR=t2t_data
TMP_DIR=t2t_datagen
TRAIN_DIR=t2t_train/$PROBLEM/$MODEL-$HPARAMS
t2t-datagen --data_dir=$DATA_DIR --tmp_dir=$TMP_DIR --problem=$PROBLEM --t2t_usr_dir=/dnn4/dnn4_added/zhangxiaolei/env150zxlpy36-980

now I have a problem why the vocab_size is 2**16 in the code ,however ,at last the vocab_size in the file only 47000
how to expand the size of vocab

question

Source

zll0000

Most helpful comment

First, I see you use SpaceID.EN_CHR for both input and output, but actually you don't want character-based translation but rather subwords. I think for two-languages translation (non-multitask) the SpaceID does not matter, but I am not sure.
My T2T know-how: start with 32k vocabulary, make sure that the final min_count is not too low when building the subword vocab (otherwise increase file_byte_budget), set the batch_size as high as possible (without hitting OOM, but keep some reserve), use transformer_big_single_gpu, store checkpoints each hour (instead of each 10 minutes), which improves both the training speed and the final averaging. Check the training loss and test-metric (approx_bleu or real BLEU) in TensorBoard. If the training diverges, increase learning_rate_warmup_steps and start again from scratch. Increase training_steps e.g. to 1M - you can always kill the training when you see the BLEU curve is flat or even decreasing.

Finally, I am not sure this is the best place to discuss such general knowhow. Github issues should be for reporting bugs, feature request or very specific questions.

martinpopel on 29 Nov 2017

👍3

All 5 comments

You need to increase the file_byte_budget.
Obviously, 1e6 bytes with your training data is not enough to get more than 47k subwords.
That said, I am not sure if a so big vocabulary pays off: the training and decoding is slower and it needs more memory (so you must use smaller batch_size, which seems to affect also the quality, see #444).

martinpopel on 28 Nov 2017

👍2

@martinpopel At first ,I have not change any parameters ,only added my dataset ,however ,after training the model when I calculate the blue_score ,the performance is very bad,the blue_score is lower than Groundhog translation model base on the same dataset .I do not what to do to improve the performance,
can you give some advices,
thank you

zll0000 on 29 Nov 2017

Finally, I am not sure this is the best place to discuss such general knowhow. Github issues should be for reporting bugs, feature request or very specific questions.

martinpopel on 29 Nov 2017

👍3

@martinpopel thanks if I want to increase the file_byte_budget.what shoud I do .
t2t-datagen --data_dir=$DATA_DIR --tmp_dir=$TMP_DIR --problem=$PROBLEM --t2t_usr_dir=/dnn4/dnn4_added/zhangxiaolei/env63zxlpy36tf14 --hparams='file_byte_budget=10000000'

is it right?

do you have count in https://gitter.im/tensor2tensor/Lobby

zll0000 on 30 Nov 2017

No. You'll need to modify the file_size_budget yourself when making vocab. See: https://github.com/twairball/t2t_wmt_zhen/blob/master/data_generators/utils.py#L130 for example.