Models: #Textsum# How to generate the vocab file from the original data And what's the format of test data

Created on 8 Nov 2016  路  3Comments  路  Source: tensorflow/models

Textsum# I am trying to use textsum these days. Several questions I have faced:

  1. How to generate the vocab file from the original data file? Could you please open source the code?
  2. What's the format of test data? Is it still need to be a binary file?
    I can see the format of training data from the toy data example "data" , but you didn't provide the data format of test data.
    Thank you.

Most helpful comment

@licangqiong
Please skim through the open and closed issues first to see if your question has already been asked before opening here. Both of these questions have already been answered. Regarding your first question, the vocab file is simply a collection of all the words in your dataset and its associated count of how often it has occurred in the overall corpus. Some people will use this count to reduce the size of the vobab to something like the top 200k words, however this is not necessary. So once you have your corpus you will be training against, simply write a script that adds a new word if it isnt there, or if it is then just increment the counter next to it. Then write this out to a file and you have your vocab file. I personally would also recommend performing a sort on the data counts before writing the file, but that is totally up to you and will not affect your final results if you don't.

To answer your second question, please refer to data_convert_example.py. You can find a little more here: https://github.com/tensorflow/models/issues/373

It will allow you to perform a text-to-binary or binary-to-text conversion. If you take the toy dataset and perform the binary-to-text conversion on it, you will get the formatting you are looking for.

Hope this helps. Please close this ticket when you get the chance.

All 3 comments

@licangqiong
Please skim through the open and closed issues first to see if your question has already been asked before opening here. Both of these questions have already been answered. Regarding your first question, the vocab file is simply a collection of all the words in your dataset and its associated count of how often it has occurred in the overall corpus. Some people will use this count to reduce the size of the vobab to something like the top 200k words, however this is not necessary. So once you have your corpus you will be training against, simply write a script that adds a new word if it isnt there, or if it is then just increment the counter next to it. Then write this out to a file and you have your vocab file. I personally would also recommend performing a sort on the data counts before writing the file, but that is totally up to you and will not affect your final results if you don't.

To answer your second question, please refer to data_convert_example.py. You can find a little more here: https://github.com/tensorflow/models/issues/373

It will allow you to perform a text-to-binary or binary-to-text conversion. If you take the toy dataset and perform the binary-to-text conversion on it, you will get the formatting you are looking for.

Hope this helps. Please close this ticket when you get the chance.

@xtr33me Thanks a lot.

Hi @xtr33me ,

I'm still confused about the vocab generation. I used your modified version of vocab file (which posted in one previous issue). But I got "nvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [11654,128] rhs shape= [10003,128]".
The vocab shouldn't depends on the text data, is that correct?

Thanks.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jacknlliu picture jacknlliu  路  3Comments

Mostafaghelich picture Mostafaghelich  路  3Comments

rakashi picture rakashi  路  3Comments

25b3nk picture 25b3nk  路  3Comments

twangnh picture twangnh  路  3Comments