Models: Problem with using Parsey McParseface to read files.

Created on 5 Jul 2016 · 19Comments · Source: tensorflow/models

I can use the pretrained parser with no issues. For example, typing in a sentence such as:

echo "I like playing tennis and reading books." | syntaxnet/demo.sh

provides a correct output of the different parts of speech in the sentence and such.

My goal now is to feed a text file (.txt extension) into the parser and have it POS tag the information within the file. Is there any way to do this?

Source

sumanyugupta

Most helpful comment

@sumanyugupta
For doing that, you will have to make changes in 'context.pbtxt' that is pointed inside 'demo.sh'
Add the below chunk to context.pbtxt:

input {
name: 'MAIN-IN'
record_format: 'english-text'
Part {
file_pattern:'/path_to_0095.txt'
}
}

Also, you need to make changes in 'demo.sh':
INPUT_FORMAT=MAIN-IN
--output=sdtout-conll //parser output

The above changes will make you do, input from a text file and output on terminal.
If, you also want output in a file then, add another chunk in context.pbtxt specifying the path to output directory and specify --output=MAIN-OUT

Hope that will solve your problem.
Thanks

prakhar21 on 6 Jul 2016

👍2

All 19 comments

If you want to do it in the kernel then just output to the file for
echo 'Hello This is Sparta'| syntaxnet/demo.sh >> 0095.txt

# here the file 0095 is created storing i think in POS format

satroan on 5 Jul 2016

Satroan thanks for the reply. However, that's not what I want. Suppose I already have the file "0095.txt" existing. I want to use that file as the input to PMP. So, PMP should parse the contents of "0095.txt" and then I can output that information to whatever I want. Does that make sense? Please let me know if you need further clarification!

sumanyugupta on 5 Jul 2016

@sumanyugupta
For doing that, you will have to make changes in 'context.pbtxt' that is pointed inside 'demo.sh'
Add the below chunk to context.pbtxt:

input {
name: 'MAIN-IN'
record_format: 'english-text'
Part {
file_pattern:'/path_to_0095.txt'
}
}

Also, you need to make changes in 'demo.sh':
INPUT_FORMAT=MAIN-IN
--output=sdtout-conll //parser output

Hope that will solve your problem.
Thanks

prakhar21 on 6 Jul 2016

👍2

@prakhar21
Thanks Prakhar for the reply.
I have attached a photo of my demo.sh file. My question is, where should I put the

INPUT_FORMAT=MAIN-IN
--output=sdtout-conll //parser output

If you look at the picture, should it be after line 28? Please tell me. Also, after I place those lines in the file, what should the command be to run Parsey and feed it the text file?

screen shot 2016-07-06 at 10 25 00 am

sumanyugupta on 6 Jul 2016

@sumanyugupta On line 28 change INPUT_FORMAT ( the last one in the row ) . Next, modification is to be made on line 44, which is already set, So no need to change there.

Run this syntaxnet/demo.sh --input=MAIN-IN

It would be taking input from the file and output the tagged and parsed sentence on terminal in conll- format.

Thanks

prakhar21 on 6 Jul 2016

👍1

@prakhar21

./syntaxnet/proto_io.h:147] Check failed: input.record_format_size() == 1 (0 vs. 1)TextReader only
supports inputs with one record format: name: "main-in"
INFO:tensorflow:Total processed documents: 0

I did that and got the above message.... it is not reading the file correctly?

sumanyugupta on 6 Jul 2016

@sumanyugupta Are you sure, you made changes properly? These changes are enough to get this working.

just verify once again, context.pbtxt from mine (mentioned above) and also demo.sh. Take care of case sensitivity. Use absolute path in file_pattern.
File should contain normal english text (one per line). Try out with just one sentence first.

prakhar21 on 6 Jul 2016

@prakhar21

Yes, I added this to the bottom of context.pbtxt file:

input {
name: 'main-in'
record_format: 'english-text'
Part {
file_pattern:'/Users/guptas6/models/syntaxnet/sam.txt'
}
}

The file I'm reading in has one sentence and is named "sam.txt"
The "demo.sh" file is the same as the picture above.

sumanyugupta on 6 Jul 2016

@sumanyugupta May be someone else might help you on this, because I got it working.

Thanks

prakhar21 on 7 Jul 2016

@shaxtell

I see that your issue #137 was having the same trouble I am.... what did you do to get Parsey to read the file and POS tag it? I can't figure out what your resolution was. Please let me know. Thanks!

sumanyugupta on 7 Jul 2016

@sumanyugupta you can see my context.pbtxt and demo.sh files on #137. Pay close attention to my --input and --output values (as they are not the same for both stages of the PARSE_EVAL).

shaxtell on 7 Jul 2016

@shaxtell

Hey Shane, I've already done what you stated in your answer. Yet I still continue to get the same error. I've attached pictures of my 'demo.sh' and 'context.pbtxt' files, and the error I'm getting. If you are able to take a look at them and see if I'm missing anything, please let me know. Appreciate it.
screen shot 2016-07-07 at 2 50 19 pm

screen shot 2016-07-07 at 2 51 16 pm

Also, in my demo.sh file on the third line, I've tried changing the INPUT_FORMAT to stdin as it was in your file, but that did not work either.

sumanyugupta on 7 Jul 2016

@sumanyugupta

It looks to me like there might be a problem with your input file. Do you have it formatted with one sentence per line (with punctuation at the end of each sentence)? Also, try making your second output file a different name (something other than "tagged_data", like "tagged_data_parsed").

shaxtell on 7 Jul 2016

I added this to context:

input {
name: 'tagged_data_parsed'
record_format: 'conll-sentence'
Part {
file_pattern:'syntaxnet/sam-tagged-out.txt'
}
}

And changed demo to this:
screen shot 2016-07-07 at 3 05 29 pm

Also, my text file, sam.txt, consists only of 3 sentences on 3 lines with periods all at the end. My other two text files, sam-tagged and sam-tagged-out, are both blank text files without lines.
I still get the same error message.

Just curious, what OS are/were you using?

sumanyugupta on 7 Jul 2016

@sumanyugupta Hey man I think the issue here is that your modifying the wrong context.pbtxt. There are actually 4 of them located within the syntaxnet directory's. This is the path to the correct context.pbtxt. syntaxnet/syntaxnet/models/parsey_mcparseface/context.pbtxt. Let me know if that takes care of the issue. :)

Kahless1985 on 9 Aug 2016

Please let us know if the last advice did not solve your problem by reopening this issue. Otherwise we will assume it is working now.

slavpetrov on 12 Aug 2016

@prakhar21 @shaxtell I am able to give a single file as input to syntaxnet. Could you please let me know if there is any way to give a folder of files as input ? Thanks.

Hema414 on 25 Oct 2016

Does syntaxnet support processing multiple files? I can run it for a single file but not for multiple files.

negacy on 23 Jan 2017

@slavpetrov It didnt work . i did same as what mentioned above..
here my demo.sh file
PARSER_EVAL=bazel-bin/syntaxnet/parser_eval
MODEL_DIR=syntaxnet/models/parsey_mcparseface
[[ "$1" == "--conll" ]] && INPUT_FORMAT=stdin-conll || INPUT_FORMAT=stdin || INPUT_FORMAT=MAIN-IN

$PARSER_EVAL \
--input=$INPUT_FORMAT \
--output=stdout-conll //parser output\
--hidden_layer_sizes=64 \
--arg_prefix=brain_tagger \
--graph_builder=structured \
--task_context=$MODEL_DIR/context.pbtxt \
--model_path=$MODEL_DIR/tagger-params \
--slim_model \
--batch_size=1024 \
--alsologtostderr \
| \
$PARSER_EVAL \
--input=stdin-conll //parser output\
--output=stdout-conll \
--hidden_layer_sizes=512,512 \
--arg_prefix=brain_parser \
--graph_builder=structured \
--task_context=$MODEL_DIR/context.pbtxt \
--model_path=$MODEL_DIR/parser-params \
--slim_model \
--batch_size=1024 \
--alsologtostderr \
| \
bazel-bin/syntaxnet/conll2tree \
--task_context=$MODEL_DIR/context.pbtxt \
--alsologtostderr

context.pbtxt

Parameter {
name: "brain_parser_embedding_dims"
value: "32;32;64"
}
Parameter {
name: "brain_parser_embedding_names"
value: "labels;tags;words"
}
Parameter {
name: 'brain_parser_scoring'
value: 'default'
}
Parameter {
name: "brain_parser_features"
value:
'stack.child(1).label '
'stack.child(1).sibling(-1).label '
'stack.child(-1).label '
'stack.child(-1).sibling(1).label '
'stack.child(2).label '
'stack.child(-2).label '
'stack(1).child(1).label '
'stack(1).child(1).sibling(-1).label '
'stack(1).child(-1).label '
'stack(1).child(-1).sibling(1).label '
'stack(1).child(2).label '
'stack(1).child(-2).label; '
'input.token.tag '
'input(1).token.tag '
'input(2).token.tag '
'input(3).token.tag '
'stack.token.tag '
'stack.child(1).token.tag '
'stack.child(1).sibling(-1).token.tag '
'stack.child(-1).token.tag '
'stack.child(-1).sibling(1).token.tag '
'stack.child(2).token.tag '
'stack.child(-2).token.tag '
'stack(1).token.tag '
'stack(1).child(1).token.tag '
'stack(1).child(1).sibling(-1).token.tag '
'stack(1).child(-1).token.tag '
'stack(1).child(-1).sibling(1).token.tag '
'stack(1).child(2).token.tag '
'stack(1).child(-2).token.tag '
'stack(2).token.tag '
'stack(3).token.tag; '
'input.token.word '
'input(1).token.word '
'input(2).token.word '
'input(3).token.word '
'stack.token.word '
'stack.child(1).token.word '
'stack.child(1).sibling(-1).token.word '
'stack.child(-1).token.word '
'stack.child(-1).sibling(1).token.word '
'stack.child(2).token.word '
'stack.child(-2).token.word '
'stack(1).token.word '
'stack(1).child(1).token.word '
'stack(1).child(1).sibling(-1).token.word '
'stack(1).child(-1).token.word '
'stack(1).child(-1).sibling(1).token.word '
'stack(1).child(2).token.word '
'stack(1).child(-2).token.word '
'stack(2).token.word '
'stack(3).token.word '
}
Parameter {
name: "brain_parser_transition_system"
value: "arc-standard"
}

Parameter {
name: "brain_tagger_embedding_dims"
value: "8;16;16;16;16;64"
}
Parameter {
name: "brain_tagger_embedding_names"
value: "other;prefix2;prefix3;suffix2;suffix3;words"
}
Parameter {
name: "brain_tagger_features"
value:
'input.digit '
'input.hyphen; '
'input.prefix(length="2") '
'input(1).prefix(length="2") '
'input(2).prefix(length="2") '
'input(3).prefix(length="2") '
'input(-1).prefix(length="2") '
'input(-2).prefix(length="2") '
'input(-3).prefix(length="2") '
'input(-4).prefix(length="2"); '
'input.prefix(length="3") '
'input(1).prefix(length="3") '
'input(2).prefix(length="3") '
'input(3).prefix(length="3") '
'input(-1).prefix(length="3") '
'input(-2).prefix(length="3") '
'input(-3).prefix(length="3") '
'input(-4).prefix(length="3"); '
'input.suffix(length="2") '
'input(1).suffix(length="2") '
'input(2).suffix(length="2") '
'input(3).suffix(length="2") '
'input(-1).suffix(length="2") '
'input(-2).suffix(length="2") '
'input(-3).suffix(length="2") '
'input(-4).suffix(length="2"); '
'input.suffix(length="3") '
'input(1).suffix(length="3") '
'input(2).suffix(length="3") '
'input(3).suffix(length="3") '
'input(-1).suffix(length="3") '
'input(-2).suffix(length="3") '
'input(-3).suffix(length="3") '
'input(-4).suffix(length="3"); '
'input.token.word '
'input(1).token.word '
'input(2).token.word '
'input(3).token.word '
'input(-1).token.word '
'input(-2).token.word '
'input(-3).token.word '
'input(-4).token.word '
}
Parameter {
name: "brain_tagger_transition_system"
value: "tagger"
}

input {
name: "tag-map"
Part {
file_pattern: "syntaxnet/models/parsey_mcparseface/tag-map"
}
}
input {
name: "tag-to-category"
Part {
file_pattern: "syntaxnet/models/parsey_mcparseface/fine-to-universal.map"
}
}
input {
name: "word-map"
Part {
file_pattern: "syntaxnet/models/parsey_mcparseface/word-map"
}
}
input {
name: "label-map"
Part {
file_pattern: "syntaxnet/models/parsey_mcparseface/label-map"
}
}
input {
name: "prefix-table"
Part {
file_pattern: "syntaxnet/models/parsey_mcparseface/prefix-table"
}
}
input {
name: "suffix-table"
Part {
file_pattern: "syntaxnet/models/parsey_mcparseface/suffix-table"
}
}
input {
name: 'MAIN-IN'
record_format: 'english-text'
Part {
file_pattern:'test/input/sample.txt'
}
}
input {
name: 'stdin'
record_format: 'english-text'
Part {
file_pattern: '-'
}
}
input {
name: 'stdin-conll'
record_format: 'conll-sentence'
Part {
file_pattern: '-'
}
}
input {
name: 'stdout-conll'
record_format: 'conll-sentence'
Part {
file_pattern: '-'
}
}

i had run this command
root@38c6a725c0c2:~/models/syntaxnet# syntaxnet/demo.sh

this is my output..

I syntaxnet/term_frequency_map.cc:101] Loaded 46 terms from syntaxnet/models/parsey_mcparseface/label-map.
I syntaxnet/embedding_feature_extractor.cc:35] Features: stack.child(1).label stack.child(1).sibling(-1).label stack.child(-1).label stack.child(-1).sibling(1).label stack.child(2).label stack.child(-2).label stack(1).child(1).label stack(1).child(1).sibling(-1).label stack(1).child(-1).label stack(1).child(-1).sibling(1).label stack(1).child(2).label stack(1).child(-2).label; input.token.tag input(1).token.tag input(2).token.tag input(3).token.tag stack.token.tag stack.child(1).token.tag stack.child(1).sibling(-1).token.tag stack.child(-1).token.tag stack.child(-1).sibling(1).token.tag stack.child(2).token.tag stack.child(-2).token.tag stack(1).token.tag stack(1).child(1).token.tag stack(1).child(1).sibling(-1).token.tag stack(1).child(-1).token.tag stack(1).child(-1).sibling(1).token.tag stack(1).child(2).token.tag stack(1).child(-2).token.tag stack(2).token.tag stack(3).token.tag; input.token.word input(1).token.word input(2).token.word input(3).token.word stack.token.word stack.child(1).token.word stack.child(1).sibling(-1).token.word stack.child(-1).token.word stack.child(-1).sibling(1).token.word stack.child(2).token.word stack.child(-2).token.word stack(1).token.word stack(1).child(1).token.word stack(1).child(1).sibling(-1).token.word stack(1).child(-1).token.word stack(1).child(-1).sibling(1).token.word stack(1).child(2).token.word stack(1).child(-2).token.word stack(2).token.word stack(3).token.word
I syntaxnet/embedding_feature_extractor.cc:36] Embedding names: labels;tags;words
I syntaxnet/embedding_feature_extractor.cc:37] Embedding dims: 32;32;64
I syntaxnet/term_frequency_map.cc:101] Loaded 49 terms from syntaxnet/models/parsey_mcparseface/tag-map.
I syntaxnet/term_frequency_map.cc:101] Loaded 64036 terms from syntaxnet/models/parsey_mcparseface/word-map.
INFO:tensorflow:Building training network with parameters: feature_sizes: [12 20 20] domain_sizes: [ 49 51 64038]
INFO:tensorflow:Created variable step:0 with shape () and init
INFO:tensorflow:Created variable embedding_matrix_0:0 with shape (49, 32) and init
INFO:tensorflow:Created variable embedding_matrix_1:0 with shape (51, 32) and init
INFO:tensorflow:Created variable embedding_matrix_2:0 with shape (64038, 64) and init
INFO:tensorflow:Created variable weights_0:0 with shape (2304, 512) and init
INFO:tensorflow:Created variable bias_0:0 with shape (512,) and init
I syntaxnet/term_frequency_map.cc:101] Loaded 46 terms from syntaxnet/models/parsey_mcparseface/label-map.
I syntaxnet/embedding_feature_extractor.cc:35] Features: input.digit input.hyphen; input.prefix(length="2") input(1).prefix(length="2") input(2).prefix(length="2") input(3).prefix(length="2") input(-1).prefix(length="2") input(-2).prefix(length="2") input(-3).prefix(length="2") input(-4).prefix(length="2"); input.prefix(length="3") input(1).prefix(length="3") input(2).prefix(length="3") input(3).prefix(length="3") input(-1).prefix(length="3") input(-2).prefix(length="3") input(-3).prefix(length="3") input(-4).prefix(length="3"); input.suffix(length="2") input(1).suffix(length="2") input(2).suffix(length="2") input(3).suffix(length="2") input(-1).suffix(length="2") input(-2).suffix(length="2") input(-3).suffix(length="2") input(-4).suffix(length="2"); input.suffix(length="3") input(1).suffix(length="3") input(2).suffix(length="3") input(3).suffix(length="3") input(-1).suffix(length="3") input(-2).suffix(length="3") input(-3).suffix(length="3") input(-4).suffix(length="3"); input.token.word input(1).token.word input(2).token.word input(3).token.word input(-1).token.word input(-2).token.word input(-3).token.word input(-4).token.word
I syntaxnet/embedding_feature_extractor.cc:36] Embedding names: other;prefix2;prefix3;suffix2;suffix3;words
I syntaxnet/embedding_feature_extractor.cc:37] Embedding dims: 8;16;16;16;16;64
INFO:tensorflow:Created variable weights_1:0 with shape (512, 512) and init
INFO:tensorflow:Created variable bias_1:0 with shape (512,) and init
INFO:tensorflow:Created variable softmax_weight:0 with shape (512, 93) and init
I syntaxnet/term_frequency_map.cc:101] Loaded 64036 terms from syntaxnet/models/parsey_mcparseface/word-map.
I syntaxnet/term_frequency_map.cc:101] Loaded 49 terms from syntaxnet/models/parsey_mcparseface/tag-map.
INFO:tensorflow:Building training network with parameters: feature_sizes: [2 8 8 8 8 8] domain_sizes: [ 5 10665 10665 8970 8970 64038]
INFO:tensorflow:Created variable step:0 with shape () and init
INFO:tensorflow:Created variable softmax_bias:0 with shape (93,) and init
INFO:tensorflow:Created variable embedding_matrix_0:0 with shape (5, 8) and init
INFO:tensorflow:Created variable embedding_matrix_1:0 with shape (10665, 16) and init
INFO:tensorflow:Created variable embedding_matrix_2:0 with shape (10665, 16) and init
INFO:tensorflow:Created variable embedding_matrix_3:0 with shape (8970, 16) and init
INFO:tensorflow:Created variable embedding_matrix_4:0 with shape (8970, 16) and init
INFO:tensorflow:Created variable embedding_matrix_5:0 with shape (64038, 64) and init
INFO:tensorflow:Created variable weights_0:0 with shape (1040, 64) and init
INFO:tensorflow:Created variable bias_0:0 with shape (64,) and init
INFO:tensorflow:Created variable softmax_weight:0 with shape (64, 49) and init
INFO:tensorflow:Created variable softmax_bias:0 with shape (49,) and init
I syntaxnet/term_frequency_map.cc:101] Loaded 49 terms from syntaxnet/models/parsey_mcparseface/tag-map.
I syntaxnet/term_frequency_map.cc:101] Loaded 46 terms from syntaxnet/models/parsey_mcparseface/label-map.
I syntaxnet/embedding_feature_extractor.cc:35] Features: input.digit input.hyphen; input.prefix(length="2") input(1).prefix(length="2") input(2).prefix(length="2") input(3).prefix(length="2") input(-1).prefix(length="2") input(-2).prefix(length="2") input(-3).prefix(length="2") input(-4).prefix(length="2"); input.prefix(length="3") input(1).prefix(length="3") input(2).prefix(length="3") input(3).prefix(length="3") input(-1).prefix(length="3") input(-2).prefix(length="3") input(-3).prefix(length="3") input(-4).prefix(length="3"); input.suffix(length="2") input(1).suffix(length="2") input(2).suffix(length="2") input(3).suffix(length="2") input(-1).suffix(length="2") input(-2).suffix(length="2") input(-3).suffix(length="2") input(-4).suffix(length="2"); input.suffix(length="3") input(1).suffix(length="3") input(2).suffix(length="3") input(3).suffix(length="3") input(-1).suffix(length="3") input(-2).suffix(length="3") input(-3).suffix(length="3") input(-4).suffix(length="3"); input.token.word input(1).token.word input(2).token.word input(3).token.word input(-1).token.word input(-2).token.word input(-3).token.word input(-4).token.word
I syntaxnet/embedding_feature_extractor.cc:36] Embedding names: other;prefix2;prefix3;suffix2;suffix3;words
I syntaxnet/embedding_feature_extractor.cc:37] Embedding dims: 8;16;16;16;16;64
I syntaxnet/term_frequency_map.cc:101] Loaded 64036 terms from syntaxnet/models/parsey_mcparseface/word-map.
I syntaxnet/term_frequency_map.cc:101] Loaded 46 terms from syntaxnet/models/parsey_mcparseface/label-map.
I syntaxnet/embedding_feature_extractor.cc:35] Features: stack.child(1).label stack.child(1).sibling(-1).label stack.child(-1).label stack.child(-1).sibling(1).label stack.child(2).label stack.child(-2).label stack(1).child(1).label stack(1).child(1).sibling(-1).label stack(1).child(-1).label stack(1).child(-1).sibling(1).label stack(1).child(2).label stack(1).child(-2).label; input.token.tag input(1).token.tag input(2).token.tag input(3).token.tag stack.token.tag stack.child(1).token.tag stack.child(1).sibling(-1).token.tag stack.child(-1).token.tag stack.child(-1).sibling(1).token.tag stack.child(2).token.tag stack.child(-2).token.tag stack(1).token.tag stack(1).child(1).token.tag stack(1).child(1).sibling(-1).token.tag stack(1).child(-1).token.tag stack(1).child(-1).sibling(1).token.tag stack(1).child(2).token.tag stack(1).child(-2).token.tag stack(2).token.tag stack(3).token.tag; input.token.word input(1).token.word input(2).token.word input(3).token.word stack.token.word stack.child(1).token.word stack.child(1).sibling(-1).token.word stack.child(-1).token.word stack.child(-1).sibling(1).token.word stack.child(2).token.word stack.child(-2).token.word stack(1).token.word stack(1).child(1).token.word stack(1).child(1).sibling(-1).token.word stack(1).child(-1).token.word stack(1).child(-1).sibling(1).token.word stack(1).child(2).token.word stack(1).child(-2).token.word stack(2).token.word stack(3).token.word
I syntaxnet/embedding_feature_extractor.cc:36] Embedding names: labels;tags;words
I syntaxnet/embedding_feature_extractor.cc:37] Embedding dims: 32;32;64
I syntaxnet/term_frequency_map.cc:101] Loaded 49 terms from syntaxnet/models/parsey_mcparseface/tag-map.
I syntaxnet/term_frequency_map.cc:101] Loaded 64036 terms from syntaxnet/models/parsey_mcparseface/word-map.