Assuming the source and target texts, both training and dev, are tokenized and cleaned, how to register a problem that applies byte pair encoding and vocab building to the texts, and then trains on those texts with the the same hyper parameters as in wmt_ende_8k? Thanks!
First of all. it's best if you have the text in non-tokenized form, otherwise you'll be throwing apart the subword tokenizer. But that's a small thing.
If you have this, then just register a problem like the one in WMT here:
https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/wmt.py#L366
You can see that the only function you need to implement is train_generator and its very simple:
symbolizer_vocab = generator_utils.get_or_generate_vocab(
data_dir, tmp_dir, self.vocab_file, self.targeted_vocab_size)
datasets = _ENDE_TRAIN_DATASETS if train else _ENDE_TEST_DATASETS
tag = "train" if train else "dev"
data_path = _compile_data(tmp_dir, datasets, "wmt_ende_tok_%s" % tag)
return token_generator(data_path + ".lang1", data_path + ".lang2",
symbolizer_vocab, EOS)
In your case, you probably don't need to call _compile_data -- it just puts multiple files together in "data.lang1" and "data.lang2". If you already have these files, then just pass them to token_generator.
Hope that helps, we're working on documenting this better, so I'll leave the issue open for now to track progress on that. Feel free to ask questions of course!
Thanks! It seems like method token_generator() still requires a vocab as input, which is supposedly generated by get_or_generator_vocab_inner(), which carries out the the generation of the subword unit vocab, right? In this case, text_encoder.SubwordTextEncoder.build_to_target_size() is called inside get_or_generator_vocab_inner() and requires an "upper bound for the minimum token count" of 1e3. So my question boils down to why does it require such a number? Does this number apply to translation tasks with other languages (untokenized, raw texts; according to your suggestion)?
It seems like method token_generator() still requires a vocab as input, which is supposedly generated by get_or_generator_vocab_inner(), which carries out the the generation of the subword unit vocab, right?
Yes.
requires an "upper bound for the minimum token count" of 1e3.
This is an implementation detail. You decide that the BPE (or wordpiece) vocabulary should have about 8k (or 32k) items. So you need to find such min_count so that the vocabulary size is closest to 8k. The min_count is found by bisecting, starting with 1...1000, where 1000 is a high enough upper estimate on the min_count, even for huge datasets and small vocabulary size (like 8k).
I have another suggestion to consider for the Tensor2tensor authors: for newcomers it seems misleading that the problem is called "tokens" (where tokens in NLP mostly means words/punctuation), but it actually means wordpieces aka subwords. At least it should be documented at all the places.
Dear all,
So I do not need to meddle with the problem_hparams.py? I am actually trying to look for the hparams for character model but I cannot find it at the problem_hparams.py, would be great to know where to look. Many thanks!
Colman Tse
The best way is to subclass Text2TextProblem, then you don't need to change problem_hparams.py at all, you only need to provide your data generator. I think it's easiest to copy one of the WMT translation problems: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/wmt.py
We're working on adding docs with an explanation, hopefully soon, for now I think wmt.py is the best example, and don't hesitate to ask!
Great thanks! I was looking at the problem_hparams because I was wondering if I add a new model that would subclass the transformer.py I don't see a way to add in new params.
Thank you for pointing me to the text2textproblem though.
@lukaszkaiser @martinpopel wmt problems use either a character-based vocab or a bpe-based vocab. For a text2text problem, is there an example to encode texts to a "regular" token-based vocab?
What I have tried is using TokenTextEncoder as in the penn tree bank example (take the top n most frequent words in the training set as the vocab). However, I don't know how to integrate the vocab created that way with the token_generator function from wmt, which takes a source file, a target file, a vocab object and an EOS symbol (index). My question basically is how to encode texts to a "regular" token-based vocab? Also, how to name the vocab when there are two different vocabularies for source and target?
@anglil:
For translation problems, I would recommend to use the T2T internal wordpieces, that is the text_encoder.SubwordTextEncoder, which is used in TranslateEndeWmt8k and other problems in wmt.py. Then you can use raw (untokenized, unBPEd) senteces as inputs to training and decoding.
If you want to use an external vocabulary (either with words or subwords), it is possible, but
text_encoder.TokenTextEncoder. For example, see my own my_registration.py, where I defined a problem with external vocabulary t2t_data/vocab.encs.bpe.33420 and training data t2t_tmp/czeng16/train.bpe.en and t2t_tmp/czeng16/train.bpe.cs (these files are not available online, I had to store them in these locations manually):import os
from tensor2tensor.utils import registry
from tensor2tensor.data_generators.wmt import TranslateProblem, token_generator, EOS
from tensor2tensor.data_generators import problem
from tensor2tensor.data_generators import text_encoder
@registry.register_problem
class TranslateEncsCzeng30mBpe33k(TranslateProblem):
"""Problem spec for WMT English-Czech translation."""
@property
def targeted_vocab_size(self):
return 33420
@property
def vocab_name(self):
return "vocab.encs.bpe"
def generator(self, data_dir, tmp_dir, train):
input_path = os.path.join(tmp_dir, "czeng16", "train.bpe." if train else "dev.bpe.")
vocab = text_encoder.TokenTextEncoder(vocab_filename=os.path.join(data_dir, self.vocab_file))
return token_generator(input_path + "en", input_path + "cs", vocab, EOS)
@property
def input_space_id(self):
return problem.SpaceID.EN_BPE_TOK
@property
def target_space_id(self):
return problem.SpaceID.CS_TOK # TODO CS_BPE_TOK
@property
def use_subword_tokenizer(self):
return False
Thanks! @martinpopel Looks like exactly what I was looking for; and that pointer to another thread is useful. Will try them out.
Hi @martinpopel
Followed your script registering a problem spec. using a BPEfied version train, dev and already built vocabulary. There is this issue taking a bit time to figure out, after the data generation
ValueError: No data files found in ./t2t_data/translate_enit_bpe8k-train*
The main cause looks like the data set generated in path "t2t_data/" looks like this:
Looking at the generator method is there a way to have data shuffling, basically before or after the token_generator
return token_generator(input_path + "en", input_path + "it", vocab, EOS)
Thanks!
The shuffling should be done immediately after generating the unshuffled files (and once the shuffling is done, the unshuffled files are deleted).
This is how it works for the internal subwords and I think it was the case also for the external BPEs (but I may be wrong).
Most helpful comment
@anglil:
For translation problems, I would recommend to use the T2T internal wordpieces, that is the text_encoder.SubwordTextEncoder, which is used in TranslateEndeWmt8k and other problems in wmt.py. Then you can use raw (untokenized, unBPEd) senteces as inputs to training and decoding.
If you want to use an external vocabulary (either with words or subwords), it is possible, but
text_encoder.TokenTextEncoder. For example, see my ownmy_registration.py, where I defined a problem with external vocabularyt2t_data/vocab.encs.bpe.33420and training datat2t_tmp/czeng16/train.bpe.enandt2t_tmp/czeng16/train.bpe.cs(these files are not available online, I had to store them in these locations manually):