Hi everyone,
I am trying to generate text with the pre-trained transformer XL model in a similar way to how we do with the GPT-2 model. But I guess there is a bug in the sample_sequence function after I adjusted to the transformer XL architecture. But the generated text is completely random in general and with respect to the context as well.
The core sampling loop looks very similar to the gpt-2 one:
with torch.no_grad():
for i in trange(length):
logits, past = model(prev, mems=past)
logits = logits[:, -1, :] / temperature
logits = top_k_logits(logits, k=top_k)
log_probs = F.softmax(logits, dim=-1)
if sample:
prev = torch.multinomial(log_probs, num_samples=1)
else:
_, prev = torch.topk(log_probs, k=1, dim=-1)
output = torch.cat((output, prev), dim=1)
What is the bug that I'm missing?
Here's an example聽of text generation, picks second most likely word at each step
tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
model = TransfoXLLMHeadModel.from_pretrained('transfo-xl-wt103')
line = "Cars were invented in"
line_tokenized = tokenizer.tokenize(line)
line_indexed = tokenizer.convert_tokens_to_ids(line_tokenized)
tokens_tensor = torch.tensor([line_indexed])
tokens_tensor = tokens_tensor.to(device)
max_predictions = 50
mems = None
for i in range(max_predictions):
predictions, mems = model(tokens_tensor, mems=mems)
predicted_index = torch.topk(predictions[0, -1, :],5)[1][1].item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
print(predicted_token)
predicted_index = torch.tensor([[predicted_index]]).to(device)
tokens_tensor = torch.cat((tokens_tensor, predicted_index), dim=1)
Should produce
and
America
,
but
the
first
two
cars
had
to
have
been
a
"
Turbo
Yeah figured it out. Thanks nevertheless @yaroslavvb !
@yaroslavvb I think, there is a bug in the code, you shared
predicted_index = torch.topk(predictions[0, -1, :],5)[1][1].item()why is it not predicted_index = torch.topk(predictions[0, -1, :],5)[1][0].item() or probably its not a bug
@yaroslavvb Why in the text generation with Transformer-XL there is a loop over the number of predictions requested, like max_predictions?
Given a fixed input like line = "Cars were invented in", which is 21 characters or 4 words (depending if trained for character output or word output), say, why one cannot generate say the next 21 characters or 4 words directly from the T-XL output all at once? Then generate another set of 21 characters or 4 words again in the next iteration?
I thought one advantage of the T-XL vs the vanilla Transformer was this ability to predict a whole next sequence without having to loop by adding character by character or word by word at the input?
Isn't the T-XL trained by computing the loss over the whole input and whole target (label) without looping?
Thus why would it be different during text generation? To provide a more accurate context along the prediction by adding the previous prediction one by one?
@shashwath94 Could you please post your fix, so that we can learn by example? Thanks.
@gussmith you could do it this way, but empirically the results are very bad. The model loss is trained to maximize probability of "next token prediction". What looks like loss over a loss over whole sequence is actually a parallelization trick to compute many "next token prediction" losses in a single pass.
Most helpful comment
Here's an example聽of text generation, picks second most likely word at each step
Should produce