Allennlp: Is there a way to compute loss on a batch of instances instead of per instance?

Created on 12 Apr 2019 · 15Comments · Source: allenai/allennlp

System (please complete the following information):

OS: Linux
Python version: 3.6.6
AllenNLP version: v0.8.3
PyTorch version: 0.4.1

I am working on snli type pair of sentences labeling task. Essentially I have one premise and arbitrarily many hypothesis out of which only 1 can be correct. My variable size batch would look like

[(premise, hypothesis1)]
[(premise, hypothesis2)]
[(premise, hypothesis3)]
....
[(premise, hypothesisn)]

again n can be variable. I was planning to create a custom iterator for generating such a batch. I want to compute scores for all the instances in the batch and then apply softmax on the batch axis.
I am expecting a softmax vector of size n for the above example which I want to compare with the batch labels and compute the batch loss accordingly. Is it possible to implement this sort of batch-softmax and batch-loss in allennlp/pytorch?

Source

abaheti95

All 15 comments

It sounds like you're structuring your data such that you have one Instance being a (premise, hypothesis) pair. It'd probably be simpler for you to structure your data such that each Instance is a (premise, [hypotheses]) pair. Then the loss computation is simple, and everything matches what you want. Does this make sense?

matt-gardner on 12 Apr 2019

The problem with that is that the number of hypotheses can be huge (as much as 2000+). Secondly I'm using the decomposable attention model (decomposable_attention.py). I am not sure how to update that model in case I change the Instance to (premise, [hypotheses]) format.

abaheti95 on 12 Apr 2019

Are you expecting to have a batch size of 2000 when doing it in the way that you're thinking?

My point is that, conceptually, your input appears to be a premise and a list of hypotheses, with the output being a choice over the hypotheses. If this is true, you really want your Instances to be structured this way. First figure out what the right way is to think about your problem, _then_ figure out how to structure your code to match that, instead of seeing what code is available and trying to shoehorn your task in that format. Once you come up with a format that you're happy with and that matches the structure of your problem, if you have questions about how to implement that format in AllenNLP, we can try to answer your questions.

matt-gardner on 12 Apr 2019

Hi,
Thank you for the response.
I thought about the problem. I want to use a similar architecture as the decomposable_attention_model however I am now arranging the Instance as you suggested (premise, [hypotheses]). This is the code I have so far

class QuestionResponseSoftmaxReader(DatasetReader):
    def __init__(self,
                 tokenizer: Tokenizer = None,
                 token_indexers: Dict[str, TokenIndexer] = None,
                 lazy: bool = False) -> None:
        super().__init__(lazy)
        self._tokenizer = tokenizer or WordTokenizer()
        self._token_indexers = token_indexers or {'tokens': SingleIdTokenIndexer()}

    def _read(self, file_path: str):
        # if `file_path` is a URL, redirect to the cache
        file_path = cached_path(file_path)

        with open(file_path, 'r') as features_file:
            logger.info("Reading Generated Responses and questions instances from features file: %s", file_path)
            current_qa = None
            current_responses = list()
            current_labels = list()
            for i, line in enumerate(features_file):
                # TODO: remove this after debugging
                # if i==10000:
                #   break
                line = line.strip()
                row = re.split('\t|\\t', line)

                q = row[0].strip()
                q = q.lower()
                a = row[1].strip()
                a = a.lower()
                if current_qa != (q,a):
                    # send the previous batch
                    if len(current_responses) > 1:
                        yield self.text_to_instance(current_qa[0], current_responses, current_labels)
                    current_qa = (q,a)
                    current_responses = list()
                    current_labels = list()
                r = row[2].strip()
                r = r.lower()
                rule = row[3].strip()
                count = row[-1]
                if int(count) > 0:
                    label = "1"
                else:
                    label = "0"
                current_responses.append(r)
                current_labels.append(label)

            # yield the last batch
            if len(current_responses) > 1:
                yield self.text_to_instance(current_qa[0], current_responses, current_labels)
                current_qa = None
                current_responses = list()
                current_labels = list()

    def text_to_instance(self,  # type: ignore
                         premise: str,
                         hypotheses: List[str],
                         labels: List[str] = None) -> Instance:
        # pylint: disable=arguments-differ
        fields: Dict[str, Field] = {}
        premise_tokens = self._tokenizer.tokenize(premise)
        fields['premise'] = TextField(premise_tokens, self._token_indexers)
        all_hypotheses_fields = list()
        for hypothesis in hypotheses:
            hypothesis_tokens = self._tokenizer.tokenize(hypothesis)
            all_hypotheses_fields.append(TextField(hypothesis_tokens, self._token_indexers))
        fields['hypotheses'] = ListField(all_hypotheses_fields)
        if labels:
            all_labels_fields = list()
            for label in labels:
                all_labels_fields.append(LabelField(label))
            fields['labels'] = ListField(label)

        # metadata = {"premise_tokens": [x.text for x in premise_tokens],
        #           "hypothesis_tokens": [x.text for x in hypothesis_tokens]}
        # fields["metadata"] = MetadataField(metadata)
        return Instance(fields)

Am I creating the instance correctly here? Also I am getting the following error when I try to read my train data:

Traceback (most recent call last):
  File "decomposable_attention_model_softmax_training.py", line 149, in <module>
    vocab = Vocabulary.from_instances(train_dataset + val_dataset)
  File "/home/baheti/anaconda3/envs/allennlp/lib/python3.6/site-packages/allennlp/data/vocabulary.py", line 397, in from_instances
    instance.count_vocab_items(namespace_token_counts)
  File "/home/baheti/anaconda3/envs/allennlp/lib/python3.6/site-packages/allennlp/data/instance.py", line 57, in count_vocab_items
    field.count_vocab_items(counter)
  File "/home/baheti/anaconda3/envs/allennlp/lib/python3.6/site-packages/allennlp/data/fields/list_field.py", line 47, in count_vocab_items
    field.count_vocab_items(counter)
AttributeError: 'str' object has no attribute 'count_vocab_items'

What would be the correct Fields to use for the list hypotheses and the list labels? Kindly help me with this.

abaheti95 on 18 Apr 2019

you are creating a list of LabelFields but not using them, where you have

fields['labels'] = ListField(label)

it should be

fields['labels'] = ListField(all_labels_fields)

joelgrus on 18 Apr 2019

👍1

@joelgrus Thank you for the correction. I also want some help in the model development part. As I mentioned before I want to keep the components from Decomposable Attention Model intact and just change the objective and loss computation. I have copied the code from DecomposableAttention(Model) and tried to implement it in the forward() function like this:

class DecomposableAttentionSoftmax(Model):
    """
    This ``Model`` implements the Decomposable Attention model described in `"A Decomposable
    Attention Model for Natural Language Inference"
    <https://www.semanticscholar.org/paper/A-Decomposable-Attention-Model-for-Natural-Languag-Parikh-T%C3%A4ckstr%C3%B6m/07a9478e87a8304fc3267fa16e83e9f3bbd98b27>`_
    by Parikh et al., 2016, with some optional enhancements before the decomposable attention
    actually happens.  Parikh's original model allowed for computing an "intra-sentence" attention
    before doing the decomposable entailment step.  We generalize this to any
    :class:`Seq2SeqEncoder` that can be applied to the premise and/or the hypothesis before
    computing entailment.
    The basic outline of this model is to get an embedded representation of each word in the
    premise and hypothesis, align words between the two, compare the aligned phrases, and make a
    final entailment decision based on this aggregated comparison.  Each step in this process uses
    a feedforward network to modify the representation.
    Parameters
    ----------
    vocab : ``Vocabulary``
    text_field_embedder : ``TextFieldEmbedder``
        Used to embed the ``premise`` and ``hypothesis`` ``TextFields`` we get as input to the
        model.
    attend_feedforward : ``FeedForward``
        This feedforward network is applied to the encoded sentence representations before the
        similarity matrix is computed between words in the premise and words in the hypothesis.
    similarity_function : ``SimilarityFunction``
        This is the similarity function used when computing the similarity matrix between words in
        the premise and words in the hypothesis.
    compare_feedforward : ``FeedForward``
        This feedforward network is applied to the aligned premise and hypothesis representations,
        individually.
    aggregate_feedforward : ``FeedForward``
        This final feedforward network is applied to the concatenated, summed result of the
        ``compare_feedforward`` network, and its output is used as the entailment class logits.
    premise_encoder : ``Seq2SeqEncoder``, optional (default=``None``)
        After embedding the premise, we can optionally apply an encoder.  If this is ``None``, we
        will do nothing.
    hypothesis_encoder : ``Seq2SeqEncoder``, optional (default=``None``)
        After embedding the hypothesis, we can optionally apply an encoder.  If this is ``None``,
        we will use the ``premise_encoder`` for the encoding (doing nothing if ``premise_encoder``
        is also ``None``).
    initializer : ``InitializerApplicator``, optional (default=``InitializerApplicator()``)
        Used to initialize the model parameters.
    regularizer : ``RegularizerApplicator``, optional (default=``None``)
        If provided, will be used to calculate the regularization penalty during training.
    """
    def __init__(self, vocab: Vocabulary,
                 text_field_embedder: TextFieldEmbedder,
                 attend_feedforward: FeedForward,
                 similarity_function: SimilarityFunction,
                 compare_feedforward: FeedForward,
                 aggregate_feedforward: FeedForward,
                 premise_encoder: Optional[Seq2SeqEncoder] = None,
                 hypothesis_encoder: Optional[Seq2SeqEncoder] = None,
                 initializer: InitializerApplicator = InitializerApplicator(),
                 regularizer: Optional[RegularizerApplicator] = None) -> None:
        super(DecomposableAttention, self).__init__(vocab, regularizer)

        self._text_field_embedder = text_field_embedder
        self._attend_feedforward = TimeDistributed(attend_feedforward)
        self._matrix_attention = LegacyMatrixAttention(similarity_function)
        self._compare_feedforward = TimeDistributed(compare_feedforward)
        self._aggregate_feedforward = aggregate_feedforward
        self._premise_encoder = premise_encoder
        self._hypothesis_encoder = hypothesis_encoder or premise_encoder

        self._num_labels = vocab.get_vocab_size(namespace="labels")

        check_dimensions_match(text_field_embedder.get_output_dim(), attend_feedforward.get_input_dim(),
                               "text field embedding dim", "attend feedforward input dim")
        check_dimensions_match(aggregate_feedforward.get_output_dim(), self._num_labels,
                               "final output dimension", "number of labels")

        self._accuracy = CategoricalAccuracy()
        self._loss = torch.nn.CrossEntropyLoss()

        initializer(self)

    def forward(self,  # type: ignore
                premise: Dict[str, torch.LongTensor],
                hypotheses: List[Dict[str, torch.LongTensor]],
                labels: List[torch.IntTensor] = None,
                metadata: List[Dict[str, Any]] = None) -> Dict[str, torch.Tensor]:
        # pylint: disable=arguments-differ
        """
        Parameters
        ----------
        premise : Dict[str, torch.LongTensor]
            From a ``TextField``
        hypothesis : Dict[str, torch.LongTensor]
            From a ``TextField``
        label : torch.IntTensor, optional, (default = None)
            From a ``LabelField``
        metadata : ``List[Dict[str, Any]]``, optional, (default = None)
            Metadata containing the original tokenization of the premise and
            hypothesis with 'premise_tokens' and 'hypothesis_tokens' keys respectively.
        Returns
        -------
        An output dictionary consisting of:
        label_logits : torch.FloatTensor
            A tensor of shape ``(batch_size, num_labels)`` representing unnormalised log
            probabilities of the entailment label.
        label_probs : torch.FloatTensor
            A tensor of shape ``(batch_size, num_labels)`` representing probabilities of the
            entailment label.
        loss : torch.FloatTensor, optional
            A scalar loss to be optimised.
        """
        embedded_premise = self._text_field_embedder(premise)
        premise_mask = get_text_field_mask(premise).float()
        if self._premise_encoder:
            embedded_premise = self._premise_encoder(embedded_premise, premise_mask)
        projected_premise = self._attend_feedforward(embedded_premise)

        all_label_logits = list()
        all_label_probs = list()
        all_h2p_attention = list()
        all_p2h_attention = list()
        for hypothesis in hypotheses:
            embedded_hypothesis = self._text_field_embedder(hypothesis)
            hypothesis_mask = get_text_field_mask(hypothesis).float()

            if self._hypothesis_encoder:
                embedded_hypothesis = self._hypothesis_encoder(embedded_hypothesis, hypothesis_mask)

            projected_hypothesis = self._attend_feedforward(embedded_hypothesis)

            # Shape: (batch_size, premise_length, hypothesis_length)
            similarity_matrix = self._matrix_attention(projected_premise, projected_hypothesis)

            # Shape: (batch_size, premise_length, hypothesis_length)
            p2h_attention = masked_softmax(similarity_matrix, hypothesis_mask)
            all_p2h_attention.append(p2h_attention)
            # Shape: (batch_size, premise_length, embedding_dim)
            attended_hypothesis = weighted_sum(embedded_hypothesis, p2h_attention)

            # Shape: (batch_size, hypothesis_length, premise_length)
            h2p_attention = masked_softmax(similarity_matrix.transpose(1, 2).contiguous(), premise_mask)
            all_h2p_attention.append(h2p_attention)
            # Shape: (batch_size, hypothesis_length, embedding_dim)
            attended_premise = weighted_sum(embedded_premise, h2p_attention)

            premise_compare_input = torch.cat([embedded_premise, attended_hypothesis], dim=-1)
            hypothesis_compare_input = torch.cat([embedded_hypothesis, attended_premise], dim=-1)

            compared_premise = self._compare_feedforward(premise_compare_input)
            compared_premise = compared_premise * premise_mask.unsqueeze(-1)
            # Shape: (batch_size, compare_dim)
            compared_premise = compared_premise.sum(dim=1)

            compared_hypothesis = self._compare_feedforward(hypothesis_compare_input)
            compared_hypothesis = compared_hypothesis * hypothesis_mask.unsqueeze(-1)
            # Shape: (batch_size, compare_dim)
            compared_hypothesis = compared_hypothesis.sum(dim=1)

            aggregate_input = torch.cat([compared_premise, compared_hypothesis], dim=-1)
            label_logit = self._aggregate_feedforward(aggregate_input)
            all_label_logits.append(label_logits)

        # How to apply softmax on all_label_logits
        label_probs = torch.nn.functional.softmax(label_logits, dim=-1)

        output_dict = {"label_logits": all_label_logits,
                       "label_probs": label_probs,
                       "h2p_attention": all_h2p_attention,
                       "p2h_attention": all_p2h_attention}

        # How to compute correct loss here?
        if labels is not None:
            loss = self._loss(label_logits, labels.long().view(-1))
            self._accuracy(label_logits, label)
            output_dict["loss"] = loss

        # if metadata is not None:
        #     output_dict["premise_tokens"] = [x["premise_tokens"] for x in metadata]
        #     output_dict["hypothesis_tokens"] = [x["hypothesis_tokens"] for x in metadata]

        return output_dict

Sepecifically, I have following issues

Are the data types of hypotheses and labels correct in the forward() function
How to apply softmax on list of all_label_logits. Is creating a list a correct thing to do here?
How to correctly compute the loss after taking the softmax (what will be the correct syntax to use the list of labels)
Can I somehow modify the code so that it doesn't use a for-loop to iterate over all the hypotheses. I am guessing using for-loop will make it very inefficient

abaheti95 on 18 Apr 2019

Hmm, we definitely need better documentation on how putting things in lists work. I'm planning on making some better docs soon, and I'll be sure to include this.

When you make a ListField[TextField], you still get a Dict[str, Tensor] in the model, it's just that each Tensor has an extra dimension. You need to accompany this with passing the argument num_wrapping_dims=1 to the call to self._text_field_embedder(hypothesis). In the end, you'll have a tensor of shape (batch_size, num_hypotheses, num_hypothesis_tokens, embedding_dim), and you'll have to structure your operations to work with the extra dimension, and aggregate across the num_hypotheses dimension somehow when computing your loss.

matt-gardner on 21 Apr 2019

you are creating a list of LabelFields but not using them, where you have
fields['labels'] = ListField(label)
it should be
fields['labels'] = ListField(all_labels_fields)
@joelgrus @matt-gardner I was busy with something else so couldn't work more on this for a while. I facing some issue with the fields['labels']. Although I set this to ListField() I am getting a single tensor of size 1 when I load the batch. Could you tell me what I'm doing wrong here?

def forward(self,  # type: ignore
                premise: Dict[str, torch.LongTensor],
                hypotheses: Dict[str, torch.LongTensor],
                labels: List[torch.IntTensor] = None,
                metadata: List[Dict[str, Any]] = None) -> Dict[str, torch.Tensor]:

        print(hypotheses.keys())
        print("Hypotheses")
        print(type(hypotheses["tokens"]))
        print(hypotheses["tokens"].shape)
        print("Premise")
        print(type(premise["tokens"]))
        print(premise["tokens"].shape)
        print("Labels")
        print(type(labels))
        print(len(labels))
        exit()

This gives the output:

dict_keys(['tokens'])
Hypotheses
<class 'torch.Tensor'>
torch.Size([1, 25, 11])
Premise
<class 'torch.Tensor'>
torch.Size([1, 13])
Labels
<class 'torch.Tensor'>
1

I was expecting labels to be a List[torch.IntTensor]. Kindly help me with this.

abaheti95 on 1 May 2019

I also tried to send the labels from the metadata field and it is able to send the labels like I want. Essentially in my text_to_instance() function of reader I added metadata as follows

metadata = {"labels": all_labels_fields}
        fields["metadata"] = MetadataField(metadata)

in my model's forward() function I found metadata to be of the type list. It had one element only as I had set the batch-size to 1. First element was a dictionary with labels key and List[torch.Inttensor]as values (which is exactly what I want in the labels field). Currently myforward()` function looks like:

    def forward(self,  # type: ignore
                premise: Dict[str, torch.LongTensor],
                hypotheses: Dict[str, torch.LongTensor],
                labels: List[torch.IntTensor] = None,
                metadata: List[Dict[str, Any]] = None) -> Dict[str, torch.Tensor]:

        print(hypotheses.keys())
        print("Hypotheses")
        print(type(hypotheses["tokens"]))
        print(hypotheses["tokens"].shape)
        print("Premise")
        print(type(premise["tokens"]))
        print(premise["tokens"].shape)
        print("Labels")
        print(type(labels))
        print(len(labels))
        print("Metadata")
        print(type(metadata[0]))
        print(len(metadata[0]['labels']))
        exit()

and it gives the following output:

dict_keys(['tokens'])
Hypotheses
<class 'torch.Tensor'>
torch.Size([1, 46, 17])
Premise
<class 'torch.Tensor'>
torch.Size([1, 11])
Labels
<class 'torch.Tensor'>
1
Metadata
<class 'dict'>
46

Here I received 46 labels for 46 different hypotheses of my input. Kindly inform me of the correct way to do this.

abaheti95 on 1 May 2019

in general, a ListField doesn't generate a list of tensors, it stacks all the tensors together into a single tensor with one added dimension

joelgrus on 1 May 2019

Thank you for updating. However, I had assigned provided the ListField with a list of integers and for some reason it squashed it into a single tensor.

This is how I assign the 'labels' field in my instance creation

all_labels_fields = list()
            for label in labels:
                all_labels_fields.append(LabelField(label))
            fields['labels'] = ListField(all_labels_fields)

while I get a torch.Tensor for size 1 in my forward function (kindly refer to my previous 2 comments). Could you explain why is this happening?

abaheti95 on 2 May 2019

labels is a tensor; you shouldn't be printing len(labels), you should be printing labels.size(). You have batch size one, which is why you are seeing len(labels) == 1, because len(tensor) will tell you the size of the first dimension. If you were to print labels.size(), you should see something like torch.Size([1, 46]).

matt-gardner on 2 May 2019

👍1

Also, your type annotation is wrong - it should be labels: torch.IntTensor, not labels: List[torch.IntTensor]. We always batch things together into tensor for you, for efficient GPU computations, and the outer dimension is always the batch dimension. A ListField will add a dimension _after_ the batch dimension, as you see in the hypothesis field, which has shape (batch_size, num_hypotheses, hypothesis_length).

matt-gardner on 2 May 2019

👍1

@matt-gardner Thank you for the prompt responses. I deeply appreciate it. Your comments makes a lot of sense. I will now try to figure out how to make the softmax working. Hopefully, I won't face too many obstructions now that I have figured out all the dimensions. Thank you again for the prompt clarification!

abaheti95 on 2 May 2019

Pretty sure the issue here has been solved; closing this issue.

matt-gardner on 13 Jun 2019

Was this page helpful?

0 / 5 - 0 ratings