Hi,
I would like to include the lemma/stem in the output for universal_models.
Now I have:
1 I _ PRON PRP _ 2 nsubj _ _
2 got _ VERB VBD _ 0 ROOT _ _
3 these _ DET DT _ 4 det _ _
4 t-shirts _ NOUN NNS _ 2 dobj _ _
but I wold like to print the lemma/stem as well, like:
1 I I PRON PRP _ 2 nsubj _ _
2 got get VERB VBD _ 0 ROOT _ _
3 these this DET DT _ 4 det _ _
4 t-shirts t-shirt NOUN NNS _ 2 dobj _ _
How can I do that?
I saw that the file "syntaxnet/text_formats.cc"
report this:
...
namespace syntaxnet {
// CoNLL document format reader for dependency annotated corpora.
// The expected format is described e.g. at http://ilk.uvt.nl/conll/#dataformat
//
// Data should adhere to the following rules:
// - Data files contain sentences separated by a blank line.
// - A sentence consists of one or tokens, each one starting on a new line.
// - A token consists of ten fields described in the table below.
// - Fields are separated by a single tab character.
// - All data files will contains these ten fields, although only the ID
// column is required to contain non-dummy (i.e. non-underscore) values.
// Data files should be UTF-8 encoded (Unicode).
//
// Fields:
// 1 ID: Token counter, starting at 1 for each new sentence and increasing
// by 1 for every new token.
// 2 FORM: Word form or punctuation symbol.
// 3 LEMMA: Lemma or stem.
// 4 CPOSTAG: Coarse-grained part-of-speech tag or category.
// 5 POSTAG: Fine-grained part-of-speech tag. Note that the same POS tag
// cannot appear with multiple coarse-grained POS tags.
// 6 FEATS: Unordered set of syntactic and/or morphological features.
// 7 HEAD: Head of the current token, which is either a value of ID or '0'.
// 8 DEPREL: Dependency relation to the HEAD.
// 9 PHEAD: Projective head of current token.
// 10 PDEPREL: Dependency relation to the PHEAD.
//
// This CoNLL reader is compatible with the CoNLL-U format described at
// http://universaldependencies.org/format.html
// Note that this reader skips CoNLL-U multiword tokens and ignores the last two
// fields of every line, which are PHEAD and PDEPREL in CoNLL format, but are
// replaced by DEPS and MISC in CoNLL-U.
//
...
LEMMA is the third element (i.e fields[2] down here) but the field LEMMA is not passed to the Token object
...
// Get relevant fields.
const string &word = fields[1];
------------------------------------------ NO fields[2] HERE
const string &cpostag = fields[3];
const string &tag = fields[4];
const string &attributes = fields[5];
const int head = utils::ParseUsing<int>(fields[6], 0, utils::ParseInt32);
const string &label = fields[7];
// Add token to sentence text.
if (!text.empty()) text.append(" ");
const int start = text.size();
const int end = start + word.size() - 1;
text.append(word);
// Add token to sentence.
Token *token = sentence->add_token();
token->set_word(word);
token->set_start(start);
token->set_end(end);
if (head > 0) token->set_head(head - 1);
if (!tag.empty()) token->set_tag(tag);
if (!cpostag.empty()) token->set_category(cpostag);
if (!label.empty()) token->set_label(label);
if (!attributes.empty()) AddMorphAttributes(attributes, token);
if (join_category_to_pos_) JoinCategoryToPos(token);
if (add_pos_as_attribute_) AddPosAsAttribute(token);
...
and then it's statically set to "_"
...
for (int i = 0; i < sentence.token_size(); ++i) {
Token token = sentence.token(i);
if (join_category_to_pos_) SplitCategoryFromPos(&token);
if (add_pos_as_attribute_) RemovePosFromAttributes(&token);
vector<string> fields(10);
fields[0] = tensorflow::strings::Printf("%d", i + 1);
fields[1] = UnderscoreIfEmpty(token.word());
fields[2] = "_"; <---------------------------------------- LEMMA
fields[3] = UnderscoreIfEmpty(token.category());
fields[4] = UnderscoreIfEmpty(token.tag());
fields[5] = GetMorphAttributes(token);
fields[6] = tensorflow::strings::Printf("%d", token.head() + 1);
fields[7] = UnderscoreIfEmpty(token.label());
fields[8] = "_";
fields[9] = "_";
lines.push_back(utils::Join(fields, "\t"));
}
*value = tensorflow::strings::StrCat(utils::Join(lines, "\n"), "\n\n");
Don't know why this choice to not print the lemma/steam is out even if is present in the ud-treebanks files.
I tried several times to add it to the token object but without any results...
Anyone can help me please?
Thank you!
Best
This kind of usage question is best asked on Stackoverflow. Github issues are for bug reports and installation problems.
It's sound to me more like a bug then a "usage question".