I would like to use the transformer architecture for a sequence-labeling problem. I have two files, one consisting of the input tokens, and the other one of the labels. The labels are short strings and there are about 100 different types of them. I guess I only need to the encoder and no decoder since the number of input tokens and output tokens is identical. For output, this could be realized by classes for each input token. Now my question might be trivial but how to do this in t2t? I have seen the tansformer_encoder used for phrase classification, but I am not clear on how to use it for classification of each individual token.
I'm working on a similar project, and I've found it helpful to think about this as a translation problem, in a way. Essentially you're "translating" from whatever original language you're working in to a new "language" of labels. In my case I used the transformer model with hparams transformer_big, but made a few customizations to fit my particular dataset. (Those customizations are pretty specific to my data/problem so I won't go into them).
I'm still using the decoder to make predictions. I don't think I understand how you can make predictions without using the decoder. I do use the flags --decode_hparams="extra_length=0,force_decode_length=True" in the decoder script to ensure that the prediction length matches the input length (and in practice it still sometimes predicts extra length, but that hasn't been affecting my results so I've just let that one go...).
The solution with a decoder with extra_length=0 seems to be suboptimal for sequence labeling tasks, where the number of output labels is granted to be the same as the number of input tokens. It makes no sense to train the enc-dec attention. A better approach would be to just add a softmax layer on top of the encoder (or you can call it a decoder).
in practice it still sometimes predicts extra length
In practice, there are multiple sentences of different lengths in a batch. So even with extra_length=0 any output label sequence may be as long as the longest input sequence in a given batch. You could try decoding with batch_size=1 - that should prevent the too long outputs, but there may be still too short outputs.
Of course, a properly trained model will learn to predict the same length in most cases, but it cannot be granted (the length depends also on the alpha beam-search parameter).
That's a very good point. Using the decoder is working well enough in my case, but you're right that it's probably unnecessary and I could get the same result much more efficiently with a softmax in place of the decoder step. And I wouldn't have those pesky samples with extra length.
Which I guess gets back to the original question of how to actually do that. I'm struggling to see how this would be implemented as a problem within tensor2tensor, though I'm probably just not understanding the library well enough yet. I see the Text2Class problem, but as the original issue points out, that's a single classification for the entire input. Setting the targets as a series of labels seems like it just gets you back to using the decoder.
I did see a possible solution in this issue: https://github.com/tensorflow/tensor2tensor/issues/813, but it would be nice to figure out how to do this within tensor2tensor.
I am here looking for the answer too. I have a problem similar to the sequence labeling, where for every word many labels are certainly wrong, and I need to exclude them before the prediction.