Models: How to include the lemma/stem in the output for universal_models?

Created on 25 Dec 2016 · 2Comments · Source: tensorflow/models

Hi,
I would like to include the lemma/stem in the output for universal_models.

Now I have:

1   I       _   PRON    PRP _   2   nsubj   _   _
2   got     _   VERB    VBD _   0   ROOT    _   _
3   these       _   DET DT  _   4   det _   _
4   t-shirts    _   NOUN    NNS _   2   dobj    _   _

but I wold like to print the lemma/stem as well, like:

1   I          I    PRON    PRP _   2   nsubj   _   _
2   got        get  VERB    VBD _   0   ROOT    _   _
3   these          this DET DT  _   4   det _   _
4   t-shirts       t-shirt  NOUN    NNS _   2   dobj    _   _

How can I do that?
I saw that the file "syntaxnet/text_formats.cc"
report this:

...
namespace syntaxnet {
// CoNLL document format reader for dependency annotated corpora.
// The expected format is described e.g. at http://ilk.uvt.nl/conll/#dataformat
//
// Data should adhere to the following rules:
//   - Data files contain sentences separated by a blank line.
//   - A sentence consists of one or tokens, each one starting on a new line.
//   - A token consists of ten fields described in the table below.
//   - Fields are separated by a single tab character.
//   - All data files will contains these ten fields, although only the ID
//     column is required to contain non-dummy (i.e. non-underscore) values.
// Data files should be UTF-8 encoded (Unicode).
//
// Fields:
// 1  ID:      Token counter, starting at 1 for each new sentence and increasing
//             by 1 for every new token.
// 2  FORM:    Word form or punctuation symbol.
// 3  LEMMA:   Lemma or stem.
// 4  CPOSTAG: Coarse-grained part-of-speech tag or category.
// 5  POSTAG:  Fine-grained part-of-speech tag. Note that the same POS tag
//             cannot appear with multiple coarse-grained POS tags.
// 6  FEATS:   Unordered set of syntactic and/or morphological features.
// 7  HEAD:    Head of the current token, which is either a value of ID or '0'.
// 8  DEPREL:  Dependency relation to the HEAD.
// 9  PHEAD:   Projective head of current token.
// 10 PDEPREL: Dependency relation to the PHEAD.
//
// This CoNLL reader is compatible with the CoNLL-U format described at
//   http://universaldependencies.org/format.html
// Note that this reader skips CoNLL-U multiword tokens and ignores the last two
// fields of every line, which are PHEAD and PDEPREL in CoNLL format, but are
// replaced by DEPS and MISC in CoNLL-U.
//
...

LEMMA is the third element (i.e fields[2] down here) but the field LEMMA is not passed to the Token object

...
// Get relevant fields.
  const string &word = fields[1];
------------------------------------------ NO fields[2] HERE
  const string &cpostag = fields[3];
  const string &tag = fields[4];
  const string &attributes = fields[5];
  const int head = utils::ParseUsing<int>(fields[6], 0, utils::ParseInt32);
  const string &label = fields[7];

  // Add token to sentence text.
  if (!text.empty()) text.append(" ");
  const int start = text.size();
  const int end = start + word.size() - 1;
  text.append(word);

  // Add token to sentence.
  Token *token = sentence->add_token();
  token->set_word(word);
  token->set_start(start);
  token->set_end(end);
  if (head > 0) token->set_head(head - 1);
  if (!tag.empty()) token->set_tag(tag);
  if (!cpostag.empty()) token->set_category(cpostag);
  if (!label.empty()) token->set_label(label);
  if (!attributes.empty()) AddMorphAttributes(attributes, token);
  if (join_category_to_pos_) JoinCategoryToPos(token);
  if (add_pos_as_attribute_) AddPosAsAttribute(token);

...

and then it's statically set to "_"

...
for (int i = 0; i < sentence.token_size(); ++i) {
Token token = sentence.token(i);
if (join_category_to_pos_) SplitCategoryFromPos(&token);
if (add_pos_as_attribute_) RemovePosFromAttributes(&token);
vector<string> fields(10);
fields[0] = tensorflow::strings::Printf("%d", i + 1);
fields[1] = UnderscoreIfEmpty(token.word());
fields[2] = "_"; <---------------------------------------- LEMMA
fields[3] = UnderscoreIfEmpty(token.category());
fields[4] = UnderscoreIfEmpty(token.tag());
fields[5] = GetMorphAttributes(token);
fields[6] = tensorflow::strings::Printf("%d", token.head() + 1);
fields[7] = UnderscoreIfEmpty(token.label());
fields[8] = "_";
fields[9] = "_";
lines.push_back(utils::Join(fields, "\t"));
}
*value = tensorflow::strings::StrCat(utils::Join(lines, "\n"), "\n\n");

Don't know why this choice to not print the lemma/steam is out even if is present in the ud-treebanks files.
I tried several times to add it to the token object but without any results...
Anyone can help me please?
Thank you!
Best

Source