Transformers: Generating text with Transformer XL

Created on 17 Apr 2019 · 6Comments · Source: huggingface/transformers

Hi everyone,

I am trying to generate text with the pre-trained transformer XL model in a similar way to how we do with the GPT-2 model. But I guess there is a bug in the sample_sequence function after I adjusted to the transformer XL architecture. But the generated text is completely random in general and with respect to the context as well.
The core sampling loop looks very similar to the gpt-2 one:

with torch.no_grad():
        for i in trange(length):
            logits, past = model(prev, mems=past)
            logits = logits[:, -1, :] / temperature
            logits = top_k_logits(logits, k=top_k)
            log_probs = F.softmax(logits, dim=-1)
            if sample:
                prev = torch.multinomial(log_probs, num_samples=1)
            else:
                _, prev = torch.topk(log_probs, k=1, dim=-1)
            output = torch.cat((output, prev), dim=1)

What is the bug that I'm missing?

Source

shashwath94

Most helpful comment

Here's an example of text generation, picks second most likely word at each step

tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
model = TransfoXLLMHeadModel.from_pretrained('transfo-xl-wt103')
line = "Cars were invented in"
line_tokenized = tokenizer.tokenize(line)
line_indexed = tokenizer.convert_tokens_to_ids(line_tokenized)
tokens_tensor = torch.tensor([line_indexed])
tokens_tensor = tokens_tensor.to(device)

max_predictions = 50
mems = None
for i in range(max_predictions):
    predictions, mems = model(tokens_tensor, mems=mems)
    predicted_index = torch.topk(predictions[0, -1, :],5)[1][1].item()
    predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
    print(predicted_token)
    predicted_index = torch.tensor([[predicted_index]]).to(device)
    tokens_tensor = torch.cat((tokens_tensor, predicted_index), dim=1)

Should produce

and
America
,
but
the
first
two
cars
had
to
have
been
a
"
Turbo

yaroslavvb on 18 Apr 2019

👍3 😕1

All 6 comments

Here's an example of text generation, picks second most likely word at each step

tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
model = TransfoXLLMHeadModel.from_pretrained('transfo-xl-wt103')
line = "Cars were invented in"
line_tokenized = tokenizer.tokenize(line)
line_indexed = tokenizer.convert_tokens_to_ids(line_tokenized)
tokens_tensor = torch.tensor([line_indexed])
tokens_tensor = tokens_tensor.to(device)

max_predictions = 50
mems = None
for i in range(max_predictions):
    predictions, mems = model(tokens_tensor, mems=mems)
    predicted_index = torch.topk(predictions[0, -1, :],5)[1][1].item()
    predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
    print(predicted_token)
    predicted_index = torch.tensor([[predicted_index]]).to(device)
    tokens_tensor = torch.cat((tokens_tensor, predicted_index), dim=1)

Should produce

and
America
,
but
the
first
two
cars
had
to
have
been
a
"
Turbo

yaroslavvb on 18 Apr 2019

👍3 😕1

Yeah figured it out. Thanks nevertheless @yaroslavvb !

shashwath94 on 18 Apr 2019

@yaroslavvb I think, there is a bug in the code, you shared
predicted_index = torch.topk(predictions[0, -1, :],5)[1][1].item()why is it not predicted_index = torch.topk(predictions[0, -1, :],5)[1][0].item() or probably its not a bug

phaniram-sayapaneni on 29 Jan 2020

@yaroslavvb Why in the text generation with Transformer-XL there is a loop over the number of predictions requested, like max_predictions?

Given a fixed input like line = "Cars were invented in", which is 21 characters or 4 words (depending if trained for character output or word output), say, why one cannot generate say the next 21 characters or 4 words directly from the T-XL output all at once? Then generate another set of 21 characters or 4 words again in the next iteration?

I thought one advantage of the T-XL vs the vanilla Transformer was this ability to predict a whole next sequence without having to loop by adding character by character or word by word at the input?

Isn't the T-XL trained by computing the loss over the whole input and whole target (label) without looping?
Thus why would it be different during text generation? To provide a more accurate context along the prediction by adding the previous prediction one by one?

gussmith on 20 Jul 2020

@shashwath94 Could you please post your fix, so that we can learn by example? Thanks.

gussmith on 20 Jul 2020

@gussmith you could do it this way, but empirically the results are very bad. The model loss is trained to maximize probability of "next token prediction". What looks like loss over a loss over whole sequence is actually a parallelization trick to compute many "next token prediction" losses in a single pass.

yaroslavvb on 21 Jul 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings