Models: textsum : How to train against my own data ?

Created on 30 Aug 2016 · 36Comments · Source: tensorflow/models

textsum model

Within the data folder : https://github.com/tensorflow/models/tree/master/textsum/data there are two files : data & vocab. Is the following correct : data contains the article text to be summarised, vocab is a word count based on the Gigaword dataset. Therefore to summarise my own data I just need to replace the content of the data file in /data/data ? Or do i need to use the licensed Gigaword dataset it order to train against my own news articles ?

awaiting model gardener

Source

aronayne

Most helpful comment

I have the same question. It will be nice a have some information how to create the data/data file given some article texts and summaries.

neufang on 6 Sep 2016

👍6

All 36 comments

I have the same question. It will be nice a have some information how to create the data/data file given some article texts and summaries.

neufang on 6 Sep 2016

👍6

See https://github.com/tensorflow/models/pull/379/files for examples of making training data for the model.

panyx0718 on 6 Sep 2016

@panyx0718 After checking the 379/files/data_convert_example.py, i don't know what the input looks text looks like. Is there any data input format that sent to the text_to_binary function?

wangliangguo on 8 Sep 2016

Hi @chenwangliangguo
python data_convert_example.py --command binary_to_text --in_file data/data --out_file data/text_data

The output file text_data shows the format input text.

neufang on 8 Sep 2016

@neufang thanks. Have you run the model on your own data successfully？ How did you generate your vocab ?

wangliangguo on 9 Sep 2016

Hi, @aronayne
Can you give a text data example, for the "python data_convert_example.py --command text_to_binary --in_file data/text_data --out_file data/binary_data"
Thank you!

I use a text data, got an error:
Traceback (most recent call last):
File "data_convert_example.py", line 67, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
File "data_convert_example.py", line 63, in main
_text_to_binary()
File "data_convert_example.py", line 49, in _text_to_binary
(k, v) = feature.split('=')
ValueError: too many values to unpack

The text Data is :

reopens (ROME)
3rd Rd
Michael Stich (Germany x2) bt Karim Alami (Morocco) 7-6 (7/5), 6-4
jpb/rw94

reopens (BERLIN)
3rd Rd
Anke Huber (Germany x7) bt Katarina Maleeva (Bulgaria) 5-7, 6-4, 6-4
Elena Makarova (Russia) bt Barbara Rittner (Germany x15) 6-2, 6-1
Ann Grossman (USA) bt Gabriela Sabatini (Argentina x4) 6-3, 6-4
Brenda Schultz (Holland) bt Silke Meier (Germany) 6-2, 6-4
vog/rw94

Tributes pour in for late British Labour Party leader

UNDATED, May 12 (AFP)

........

UB1010 on 12 Sep 2016

Hi, @aronayne
The HTML tag can't show in here.
My data is from LDC, but it is text data.

If you give a example text data, can you give a file ?
Thank you!

And, the textsum must read a binary data file? Why?

UB1010 on 12 Sep 2016

I ran it again.

python data_convert_example.py --command binary_to_text --in_file data/data
--out_file data/text_data

head data/text_data
publisher=AFP abstract=

sri lanka closes schools as war
escalates .

article=

the sri lankan government on
wednesday announced the closure of government schools with immediate effect
as a military campaign against tamil separatists escalated in the north of
the country . the cabinet wednesday decided to advance the
december holidays by one month because of a threat from the liberation
tigers of tamil eelam -lrb- ltte -rrb- against school children , a
government official said . there are intelligence reports that the tigers may try to kill a lot of children to provoke a backlash against tamils in colombo . </s> <s> if that happens , troops will have to be
withdrawn from the north to maintain law and order here , '' a police
official said . he said education minister richard pathirana
visited several government schools wednesday before the closure decision
was taken . the government will make alternate arrangements to
hold end of term examinations , officials said . earlier wednesday
, president chandrika kumaratunga said the ltte may step up their attacks
in the capital to seek revenge for the ongoing military offensive which she
described as the biggest ever drive to take the tiger town of jaffna . .

...

python data_convert_example.py --command text_to_binary --in_file
data/text_data --out_file data/binary_data
diff data/binary_data data/data

On Mon, Sep 12, 2016 at 4:34 AM, UB1010 [email protected] wrote:

Hi, @aronayne https://github.com/aronayne
The HTML tag can't show in here.
My data is from LDC, but it is text data.

If you give a example text data, can you give a file ? or send email to me
( [email protected] )
Thank you!

And, the textsum must read a binary data file? Why?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tensorflow/models/issues/373#issuecomment-246321004,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACwQe_iEVjZFk2pKvQwDAso5I0g9--o5ks5qpTjGgaJpZM4JxA4-
.

Thanks
Xin

panyx0718 on 12 Sep 2016

👍1

@panyx0718
Thank you for your response.
I got the text data format from data/data, and made the binary data by my own data.
I used more than 2.4 million Chinese weibo data for training, but i can't get a good result.
I have 2 question, can you help?
1, The training data, need the tag "d" "p" "s", and use tag "s" for sentences, The training data must set tags "s" for every sentence? If there aren't tags "s" in my own data . Can I made one article in a pair tags "s"? one article only was tagged to one sentence. That is OK?
I made the vocab file by my own training data, got the word frequency in the dict. But , how to set "UNK" frequency? can you explain how to get vocab file ?

2, I added the tag "s" for all sentences. I used the default parameters to train the model. but that only can get similar words for all articles from decoder, how to test the model, can you give more information?
Thank you.

UB1010 on 20 Sep 2016

I have the same problem as you @UB1010 , all result generated by decode is same.
I used more than one million chinese news data for training, but the result is very bad, like:
decode output=不就的就
Have you saved the problem？

doumoxiao on 18 Oct 2016

@doumoxiao
I havn't solved the problem. Some one said that need long time to train. but i trained the model with CPU more than one week, the result of decode also is bad.

UB1010 on 18 Oct 2016

@UB1010
I tried with vocab of 100K words, 0.3 million articles and corresponding abstract for 5 days using a single tesla K40, and increase the beam-size from 4 to 20, the results become more reasonable than first rounds, but does not match the meaning of test data, for example,
INFO:tensorflow:article: 全球九大警用步枪
INFO:tensorflow:abstract: 世界各国步枪
INFO:tensorflow:decoded: 世界客
INFO:tensorflow:article: 进化计算的理论和方法【图片价格品牌报价】 - 京东商城
INFO:tensorflow:abstract: 自动化书籍图片
INFO:tensorflow:decoded: 吊顶装修亮化

hphp on 26 Oct 2016

👍4

@hphp
Yes, My result is similar.
My data is more than 2.4 million weibo and abstract, beam_size = 4.
Train model 14 days with CPU server.
That is better , but it also can't be used. My result example:

origin_articles[i]: 本期节目的嘉宾，我们邀请到的是百度的创始人，董事长兼首席执行官李彦宏先生，以及通讯和软件领域世界级的科学家，也是新近加盟百度的总裁博士。作为站在中国互联网前沿的 IT 人士，他们会如何看待迅猛发展大数据科技呢？
origin_abstracts[i]: 杨澜访谈录李彦宏、张亚勤：云端的大数据
decoded_output words: 网络营销要要要什么

origin_articles[i]: 中秋节当天，有网友爆出刘翔现身上海普陀婚姻登记中心的照片。新民晚报记者独家拨通刘翔父亲的手机，翔爸爸告诉记者： “ 具体等明天（ 9 月 9 日）领证了再说。 ” 据悉，两人是在伦敦奥运会后认识的，感情很好。
origin_abstracts[i]: 新民晚报独家消息：刘翔今天领证结婚啦
decoded_output words: 湖南一 “ 幼儿园 ” ！

origin_articles[i]: 此次公务员工资试点改革，实行不升职但可升级，只要工作的时间够长，工资就能逐级上涨。但对于公务员工资改革，没有社会共识才是最根本的问题， “ 政府不缺钱，但老百姓不同意。 ” 下
origin_abstracts[i]: 公务员不升职也能涨薪工资可比局长高
decoded_output words: 玛客玛客的 “ 热门 ” ”

I think RNN is so poor efficiency, you can test it to long time.

UB1010 on 26 Oct 2016

How many steps are trained? We use 10 machine, each with 4 gpus, trained for a week.
I saw that there are many words in abstract but not in the original article.

panyx0718 on 26 Oct 2016

@panyx0718
Hi ,panyx, My trained steps is :

summary write, step: 43300
running_avg_loss: 3.681036
running_avg_loss: 4.397063
running_avg_loss: 5.468384
.........

What is your steps ?

Do you think, i need to update some parameters for Chinese weibo data?

UB1010 on 27 Oct 2016

We trained a few million steps. 43k is to small

On Wed, Oct 26, 2016 at 11:55 PM, UB1010 [email protected] wrote:

@panyx0718 https://github.com/panyx0718
Hi ,panyx, My trained steps is :

summary write, step: 43300
running_avg_loss: 3.681036
running_avg_loss: 4.397063
running_avg_loss: 5.468384
.........

What is your steps ?

Do you think, i need to update some parameters for Chinese weibo data?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tensorflow/models/issues/373#issuecomment-256561691,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACwQe-3jKnPYtvWS2y_i3jaGbn5s1WH-ks5q4ErmgaJpZM4JxA4-
.

Thanks
Xin

panyx0718 on 27 Oct 2016

@panyx0718 thank you. If given enough data and computing power, let's say 50 million Chinese articles, one week training of 10 machines with 4 gpus, can we expect decent results from the textsum model?

kinhunt on 1 Nov 2016

It depends on the quality of your data. Also, I haven't tried such large
dataset (50m) before.

On Tue, Nov 1, 2016 at 6:21 AM, kinhunt [email protected] wrote:

@panyx0718 https://github.com/panyx0718 thank you. If given enough data
and computing power, let's say 50 million Chinese articles, one week
training of 10 machines with 4 gpus, can we expect decent results from the
textsum model?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tensorflow/models/issues/373#issuecomment-257564564,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACwQe_6X6CgxDl3T-vjWh2-yoErne9enks5q5zzJgaJpZM4JxA4-
.

Thanks
Xin

panyx0718 on 1 Nov 2016

@hphp @UB1010 Hi, I know you may used the data_convert_example.py file to make the training data, but how do you get the vocab, could you share your could for getting the vocab? Thanks

SawyerW on 12 Jan 2017

@SawyerW The vocab file is simply every word in the dataset you are using with the count next to it for the number of times it has shown up in all the data files. I have seen some datasets then only taking the top 200K words or something along those lines to just reduce the number of words in the file. This is completely up to you though.

xtr33me on 13 Jan 2017

👍1

@xtr33me Yes, you are right, but when I used the vocab file created by myself, some errors happened which was hard to check. When I used the vocab file in /textsum/data to train my own data, it can work, even it could not get right answer. So I wonder if you have your own code to create the vocab file? Maybe you also used your own code to transform the training data into binary data.

SawyerW on 13 Jan 2017

@SawyerW The code I used to create my vocab file is pretty simple. You can get the gist of it below. Add it to a function or in-line it in some other processor you have. It's of course important that you run this against your decoded data and not the dataset that has been converted to binary. I'm sure there is a better way of doing this, but it worked for me. The other important key is that your input data is as clean as possible. This can usually be performed by in your web scraper.

from collections import Counter

with open("datadecoded") as datainput, open("vocab", 'w') as v:
    wordcount = Counter(datainput.read().split())
    for item in wordcount.items(): 
        print >>v, ("{} {}".format(*item))

xtr33me on 14 Jan 2017

@xtr33me Thanks, found the problem, the data set is the problem, already fixed.

SawyerW on 15 Jan 2017

@hphp @UB1010 @panyx0718 After watching above all, I still have some questions: First, do i have to transfer my raw_data into the text_data form with the <d><p><s> tag? Second, how about we using the gensim to generate the vocab?Thanks a lot!

zcc973784075 on 5 Apr 2017

I am working on a search engine. My inventory is movie name , actor's name etc.
My data set is : search_term and target_word(click)(from inventory)
My question is that we have only one or may be two three word search. i want predication on 2-3 character so is textsum useful for me

evalsocket on 24 May 2017

Hi all...
I create my own data and now i wanna to convert it to binary file....when i run this:

python data_convert_example.py --command text_to_binary --in_file data/text_data --out_file data/binary_data

I encounter to this error:

Traceback (most recent call last):
File "data_convert_example.py", line 65, in
tf.app.run()
File "/home/az/.local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "data_convert_example.py", line 61, in main
_text_to_binary()
File "data_convert_example.py", line 47, in _text_to_binary
(k, v) = feature.split('=')
ValueError: need more than 1 value to unpack

how can i fix this?what is that mean exactly?

Ali-Zareie on 3 Jun 2017

👍1

@xtr33me @panyx0718
I want to train the textsum model on the Insight (mlg.ucd.ie/datasets/bbc.html) dataset.

I ran data_convert_example.py on the toy dataset and got this result:

abstract=

~~sri lanka closes schools as war escalates .~~

article=

~~the sri lankan government on wednesday announced the closure of government schools with immediate effect as a military campaign against tamil separatists escalated in the north of the country .~~ the cabinet wednesday decided to advance the december holidays by one month because of a threat from the liberation tigers of tamil eelam -lrb- ltte -rrb- against school children , a government official said . there are intelligence reports that the tigers may try to kill a lot of children to provoke a backlash against tamils in colombo . </s> <s> if that happens , troops will have to be withdrawn from the north to maintain law and order here , '' a police official said . ~~he said education minister richard pathirana visited several government schools wednesday before the closure decision was taken .~~ ~~the government will make alternate arrangements to hold end of term examinations , officials said .~~ earlier wednesday , president chandrika kumaratunga said the ltte may step up their attacks in the capital to seek revenge for the ongoing military offensive which she described as the biggest ever drive to take the tiger town of jaffna . .

publisher=AFP

Can you please help me by providing some script or some hints to make one to convert my training data to the abstract = ... article = ...., publisher = .... format?

vdevmcitylp on 8 Jun 2017

@vdevmcitylp feel free to check out my github link below. I added some formatting scripts some time back I havent touched this code in a while, but I know it was working. Essentially I scraped articles, then had to format them for parsing in the referenced data_convert_example.py. Hope it helps some.

https://github.com/tensorflow/models/tree/ef9c156ca7802a5e60018fb0cc7d950ea54569de/textsum

xtr33me on 9 Jun 2017

@xtr33me
It works!
I can't thank you enough for this.

vdevmcitylp on 9 Jun 2017

👍1

@Ali-Zareie I have same issue, did you solve it?

minaSamizade on 19 Sep 2017

@Ali-Zareie @minaSamizade This is because there is "=" in your text. Try use (k, v) = feature.split('=', 1), which add the parameter 1.

gycg on 25 Jan 2018

Looks like the original issue was resolved.

gunan on 8 Feb 2018

@panyx0718 what is the configuration of the GPU used in the experiment with 10 machines with 4 gpu each?

cassioalmeidas on 9 Jun 2018

@hphp @UB1010

I'm using this code recently。and also get a bad result. The decode dose not match the abstract.
I want to know have you solved this problem? and how to do it ? Thank you.

Abstrace = ‘’78 亿主力资金近三日大量撤出中小创‘’
Decode = ‘’储备面早盘银行‘’

wengenihaoshuai on 31 Oct 2018

@wengenihaoshuai I saw similar results when I didn't have enough source articles. Could this be the issue by chance? In the end I scraped around 1.3 million articles and after cleaning and filtering, I was left with about 900k articles that I was able to train on. When using this many, it was the first time I was happy with the results. Early on I had tried 40k and 200k articles and just wasn't happy with the results at all. Unsure if that is your problem, but it's something to look at.

xtr33me on 31 Oct 2018

when I was try to train on my own dataset with the offical vocab, it is no problem. however one I was train it with my vocab, “ValueError: Duplicated word: four." this is happened. Anybody knows why??