Allennlp: Handling unbalanced datasets in the CRF tagger

Created on 1 Sep 2020 · 5Comments · Source: allenai/allennlp

I'm using an unbalanced Corpus NER and I would like to add weights to the entities in the training step via nn.CrossEntropyLoss, I would like to know which .py file can I call my own lib and pass the weights to the model ???

Some tutorials point to: https://github.com/allenai/allennlp/blob/052353ed62e3a54fd7b39a660e65fc5dd2f91c7d/allennlp/nn/util.py#L628
Version Allennlp: 0.9.0
Script:

{
  "dataset_reader": {
    "type": "conll2003",
    "tag_label": "ner",
    "coding_scheme": "BIOUL",
    "token_indexers": {
      "tokens": {
        "type": "single_id",
        "lowercase_tokens": true
      },
      "token_characters": {
        "type": "characters",
        "min_padding_length": 3
      },
      "elmo": {
        "type": "elmo_characters"
     }
    }
  },
  "train_data_path": "train.txt",
  "validation_data_path": "dev.txt",
  "test_data_path": "test.txt",
  "model": {
    "type": "crf_tagger",
    "label_encoding": "BIOUL",
    "calculate_span_f1": true,
    "dropout": 0.5,
    "include_start_end_transitions": false,
    "text_field_embedder": {
      "token_embedders": {
        "tokens": {
            "type": "embedding",
            "embedding_dim": 300,
            "pretrained_file": "/model/glove/glove_s300.zip",
            "trainable": true
        },
        "elmo":{
          "type": "elmo_token_embedder",
          "options_file": "/model/elmo/options.json",
          "weight_file": "/model/elmo/elmo_pt_weights_dgx1.hdf5",
          "do_layer_norm": false,
          "dropout": 0.0
        },
        "token_characters": {
            "type": "character_encoding",
            "embedding": {
            "embedding_dim": 16
            },
            "encoder": {
            "type": "cnn",
            "embedding_dim": 16,
            "num_filters": 128,
            "ngram_filter_sizes": [3],
            "conv_layer_activation": "relu"
            }
        }
      }
    },
    "encoder": {
      "type": "lstm",
      "input_size": 1452,
      "hidden_size": 200,
      "num_layers": 2,
      "dropout": 0.5,
      "bidirectional": true
    },
    "verbose_metrics": true,
    "regularizer": [
      [
        "scalar_parameters",
        {
          "type": "l2",
          "alpha": 0.1
        }
      ]
    ]
  },
  "iterator": {
    "type": "basic",
    "batch_size":32
  },
  "trainer": {
    "optimizer": {
        "type": "adam",
        "lr": 0.001
    },
    "validation_metric": "+f1-measure-overall",
    "num_serialized_models_to_keep": 3,
    "num_epochs": 10,
    "grad_norm": 5.0,
    "patience": 25,
    "cuda_device":[0] 
  },
}

Contributions welcome Feature request

Source

calusbr

Most helpful comment

Let me add to this: In case you develop a general way of giving class weights to CrfTagger, I'd love to review the pull request for it :-)

dirkgr on 11 Sep 2020

👍3

All 5 comments

Hi @calusbr, the CrfTagger model doesn't currently have a way to specify these weights. I would recommend copying the code to your own repo and modifying it to add in the weighting that you want.

matt-gardner on 4 Sep 2020

Let me add to this: In case you develop a general way of giving class weights to CrfTagger, I'd love to review the pull request for it :-)

dirkgr on 11 Sep 2020

👍3

@matt-gardner Thanks for the feedback! @dirkgr I believe it is possible to create a task for this problem. Do you have any idea how we can add weights?

@matt-gardner In case there is any tip on how we can add this weight without changing the official code, I await feedback!

calusbr on 22 Sep 2020

I did a quick Google of this problem, and unless I missed something, no major library has weighted CRFs. That tells me that a) it would be very cool if AllenNLP did, and b) it's not easy.

I did find this paper referenced a few times: https://perso.uclouvain.be/michel.verleysen/papers/ieeetbe12gdl.pdf

They give the math, but not code, on how to do it. I'd recommend copying the existing CrfTagger code, adding the math from the paper there, and submitting a PR to us. Or maybe do some more research first to find a more accessible paper, or maybe even an existing implementation somewhere that could be adapted.

dirkgr on 23 Sep 2020

I glanced over the paper and it seems relatively straight-forward, but it would definitely take some refactoring of the CrfTagger, and I'm not sure what the performance implications are.

But this definitely interests me, so if no one else decides to work on it, I might eventually do it.