Models: textsum : How to train against my own data ?

Created on 30 Aug 2016  ·  36Comments  ·  Source: tensorflow/models

textsum model

Within the data folder : https://github.com/tensorflow/models/tree/master/textsum/data there are two files : data & vocab. Is the following correct : data contains the article text to be summarised, vocab is a word count based on the Gigaword dataset. Therefore to summarise my own data I just need to replace the content of the data file in /data/data ? Or do i need to use the licensed Gigaword dataset it order to train against my own news articles ?

awaiting model gardener

Most helpful comment

I have the same question. It will be nice a have some information how to create the data/data file given some article texts and summaries.

All 36 comments

I have the same question. It will be nice a have some information how to create the data/data file given some article texts and summaries.

See https://github.com/tensorflow/models/pull/379/files for examples of making training data for the model.

@panyx0718 After checking the 379/files/data_convert_example.py, i don't know what the input looks text looks like. Is there any data input format that sent to the text_to_binary function?

Hi @chenwangliangguo
python data_convert_example.py --command binary_to_text --in_file data/data --out_file data/text_data

The output file text_data shows the format input text.

@neufang thanks. Have you run the model on your own data successfully? How did you generate your vocab ?

Hi, @aronayne
Can you give a text data example, for the "python data_convert_example.py --command text_to_binary --in_file data/text_data --out_file data/binary_data"
Thank you!

I use a text data, got an error:
Traceback (most recent call last):
File "data_convert_example.py", line 67, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
File "data_convert_example.py", line 63, in main
_text_to_binary()
File "data_convert_example.py", line 49, in _text_to_binary
(k, v) = feature.split('=')
ValueError: too many values to unpack

The text Data is :

reopens (ROME)
3rd Rd
Michael Stich (Germany x2) bt Karim Alami (Morocco) 7-6 (7/5), 6-4
jpb/rw94



reopens (BERLIN)
3rd Rd
Anke Huber (Germany x7) bt Katarina Maleeva (Bulgaria) 5-7, 6-4, 6-4
Elena Makarova (Russia) bt Barbara Rittner (Germany x15) 6-2, 6-1
Ann Grossman (USA) bt Gabriela Sabatini (Argentina x4) 6-3, 6-4
Brenda Schultz (Holland) bt Silke Meier (Germany) 6-2, 6-4
vog/rw94



Tributes pour in for late British Labour Party leader


UNDATED, May 12 (AFP)

........

Hi, @aronayne
The HTML tag can't show in here.
My data is from LDC, but it is text data.

If you give a example text data, can you give a file ?
Thank you!

And, the textsum must read a binary data file? Why?

I ran it again.

python data_convert_example.py --command binary_to_text --in_file data/data
--out_file data/text_data

head data/text_data
publisher=AFP abstract=

sri lanka closes schools as war
escalates .

article=

the sri lankan government on
wednesday announced the closure of government schools with immediate effect
as a military campaign against tamil separatists escalated in the north of
the country .
the cabinet wednesday decided to advance the
december holidays by one month because of a threat from the liberation
tigers of tamil eelam -lrb- ltte -rrb- against school children , a
government official said .
there are intelligence reports that the tigers may try to kill a lot of children to provoke a backlash against tamils in colombo . </s> <s> if that happens , troops will have to be
withdrawn from the north to maintain law and order here , '' a police
official said .
he said education minister richard pathirana
visited several government schools wednesday before the closure decision
was taken .
the government will make alternate arrangements to
hold end of term examinations , officials said .
earlier wednesday
, president chandrika kumaratunga said the ltte may step up their attacks
in the capital to seek revenge for the ongoing military offensive which she
described as the biggest ever drive to take the tiger town of jaffna . .


...

python data_convert_example.py --command text_to_binary --in_file
data/text_data --out_file data/binary_data
diff data/binary_data data/data

On Mon, Sep 12, 2016 at 4:34 AM, UB1010 [email protected] wrote:

Hi, @aronayne https://github.com/aronayne
The HTML tag can't show in here.
My data is from LDC, but it is text data.

If you give a example text data, can you give a file ? or send email to me
( [email protected] )
Thank you!

And, the textsum must read a binary data file? Why?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tensorflow/models/issues/373#issuecomment-246321004,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACwQe_iEVjZFk2pKvQwDAso5I0g9--o5ks5qpTjGgaJpZM4JxA4-
.

Thanks
Xin

@panyx0718
Thank you for your response.
I got the text data format from data/data, and made the binary data by my own data.
I used more than 2.4 million Chinese weibo data for training, but i can't get a good result.
I have 2 question, can you help?
1, The training data, need the tag "d" "p" "s", and use tag "s" for sentences, The training data must set tags "s" for every sentence? If there aren't tags "s" in my own data . Can I made one article in a pair tags "s"? one article only was tagged to one sentence. That is OK?
I made the vocab file by my own training data, got the word frequency in the dict. But , how to set "UNK" frequency? can you explain how to get vocab file ?

2, I added the tag "s" for all sentences. I used the default parameters to train the model. but that only can get similar words for all articles from decoder, how to test the model, can you give more information?
Thank you.

I have the same problem as you @UB1010 , all result generated by decode is same.
I used more than one million chinese news data for training, but the result is very bad, like:
decode output=不 就 的 就
Have you saved the problem?

@doumoxiao
I havn't solved the problem. Some one said that need long time to train. but i trained the model with CPU more than one week, the result of decode also is bad.

@UB1010
I tried with vocab of 100K words, 0.3 million articles and corresponding abstract for 5 days using a single tesla K40, and increase the beam-size from 4 to 20, the results become more reasonable than first rounds, but does not match the meaning of test data, for example,
INFO:tensorflow:article: 全球 九大 警用 步枪
INFO:tensorflow:abstract: 世界 各国 步枪
INFO:tensorflow:decoded: 世界 客
INFO:tensorflow:article: 进化 计算 的 理论 和 方法 【 图片 价格 品牌 报价 】 - 京东 商城
INFO:tensorflow:abstract: 自动 化 书籍 图片
INFO:tensorflow:decoded: 吊顶 装修 亮化

@hphp
Yes, My result is similar.
My data is more than 2.4 million weibo and abstract, beam_size = 4.
Train model 14 days with CPU server.
That is better , but it also can't be used. My result example:

origin_articles[i]: 本期 节目 的 嘉宾 , 我们 邀请 到 的 是 百度 的 创始人 , 董事长 兼 首席 执行官 李彦宏 先生 , 以及 通讯 和 软件 领域 世界级 的 科学家 , 也 是 新近 加盟 百度 的 总裁 博士 。 作为 站 在 中国 互联网 前沿 的 IT 人士 , 他们 会 如何 看待 迅猛发展 大 数据 科技 呢 ?
origin_abstracts[i]: 杨澜 访谈录 李彦宏 、 张亚勤 : 云端 的 大 数据
decoded_output words: 网络营销 要 要 要 什么

origin_articles[i]: 中秋节 当天 , 有 网友 爆出 刘翔 现身 上海 普陀 婚姻登记 中心 的 照片 。 新民晚报 记者 独家 拨通 刘翔 父亲 的 手机 , 翔 爸爸 告诉 记者 : “ 具体 等 明天 ( 9 月 9 日 ) 领证 了 再说 。 ” 据悉 , 两 人 是 在 伦敦 奥运会 后 认识 的 , 感情 很 好 。
origin_abstracts[i]: 新民晚报 独家 消息 : 刘翔 今天 领证 结婚 啦
decoded_output words: 湖南 一 “ 幼儿园 ” !

origin_articles[i]: 此次 公务员 工资 试点 改革 , 实行 不 升职 但 可 升级 , 只要 工作 的 时间 够 长 , 工资 就 能 逐级 上涨 。 但 对于 公务员 工资改革 , 没有 社会 共识 才 是 最 根本 的 问题 , “ 政府 不 缺钱 , 但 老百姓 不 同意 。 ” 下
origin_abstracts[i]: 公务员 不 升职 也 能涨 薪 工资 可比 局长 高
decoded_output words: 玛客 玛客 的 “ 热门 ” ”

I think RNN is so poor efficiency, you can test it to long time.

How many steps are trained? We use 10 machine, each with 4 gpus, trained for a week.
I saw that there are many words in abstract but not in the original article.

@panyx0718
Hi ,panyx, My trained steps is :

summary write, step: 43300
running_avg_loss: 3.681036
running_avg_loss: 4.397063
running_avg_loss: 5.468384
.........

What is your steps ?

Do you think, i need to update some parameters for Chinese weibo data?

We trained a few million steps. 43k is to small

On Wed, Oct 26, 2016 at 11:55 PM, UB1010 [email protected] wrote:

@panyx0718 https://github.com/panyx0718
Hi ,panyx, My trained steps is :

summary write, step: 43300
running_avg_loss: 3.681036
running_avg_loss: 4.397063
running_avg_loss: 5.468384
.........

What is your steps ?

Do you think, i need to update some parameters for Chinese weibo data?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tensorflow/models/issues/373#issuecomment-256561691,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACwQe-3jKnPYtvWS2y_i3jaGbn5s1WH-ks5q4ErmgaJpZM4JxA4-
.

Thanks
Xin

@panyx0718 thank you. If given enough data and computing power, let's say 50 million Chinese articles, one week training of 10 machines with 4 gpus, can we expect decent results from the textsum model?

It depends on the quality of your data. Also, I haven't tried such large
dataset (50m) before.

On Tue, Nov 1, 2016 at 6:21 AM, kinhunt [email protected] wrote:

@panyx0718 https://github.com/panyx0718 thank you. If given enough data
and computing power, let's say 50 million Chinese articles, one week
training of 10 machines with 4 gpus, can we expect decent results from the
textsum model?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tensorflow/models/issues/373#issuecomment-257564564,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACwQe_6X6CgxDl3T-vjWh2-yoErne9enks5q5zzJgaJpZM4JxA4-
.

Thanks
Xin

@hphp @UB1010 Hi, I know you may used the data_convert_example.py file to make the training data, but how do you get the vocab, could you share your could for getting the vocab? Thanks

@SawyerW The vocab file is simply every word in the dataset you are using with the count next to it for the number of times it has shown up in all the data files. I have seen some datasets then only taking the top 200K words or something along those lines to just reduce the number of words in the file. This is completely up to you though.

@xtr33me Yes, you are right, but when I used the vocab file created by myself, some errors happened which was hard to check. When I used the vocab file in /textsum/data to train my own data, it can work, even it could not get right answer. So I wonder if you have your own code to create the vocab file? Maybe you also used your own code to transform the training data into binary data.

@SawyerW The code I used to create my vocab file is pretty simple. You can get the gist of it below. Add it to a function or in-line it in some other processor you have. It's of course important that you run this against your decoded data and not the dataset that has been converted to binary. I'm sure there is a better way of doing this, but it worked for me. The other important key is that your input data is as clean as possible. This can usually be performed by in your web scraper.

from collections import Counter

with open("datadecoded") as datainput, open("vocab", 'w') as v:
    wordcount = Counter(datainput.read().split())
    for item in wordcount.items(): 
        print >>v, ("{} {}".format(*item))

@xtr33me Thanks, found the problem, the data set is the problem, already fixed.

@hphp @UB1010 @panyx0718 After watching above all, I still have some questions: First, do i have to transfer my raw_data into the text_data form with the <d><p><s> tag? Second, how about we using the gensim to generate the vocab?Thanks a lot!

I am working on a search engine. My inventory is movie name , actor's name etc.
My data set is : search_term and target_word(click)(from inventory)
My question is that we have only one or may be two three word search. i want predication on 2-3 character so is textsum useful for me

Hi all...
I create my own data and now i wanna to convert it to binary file....when i run this:

python data_convert_example.py --command text_to_binary --in_file data/text_data --out_file data/binary_data

I encounter to this error:

Traceback (most recent call last):
File "data_convert_example.py", line 65, in
tf.app.run()
File "/home/az/.local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "data_convert_example.py", line 61, in main
_text_to_binary()
File "data_convert_example.py", line 47, in _text_to_binary
(k, v) = feature.split('=')
ValueError: need more than 1 value to unpack

how can i fix this?what is that mean exactly?

@xtr33me @panyx0718
I want to train the textsum model on the Insight (mlg.ucd.ie/datasets/bbc.html) dataset.

I ran data_convert_example.py on the toy dataset and got this result:

abstract=

sri lanka closes schools as war escalates .

article=

the sri lankan government on wednesday announced the closure of government schools with immediate effect as a military campaign against tamil separatists escalated in the north of the country . the cabinet wednesday decided to advance the december holidays by one month because of a threat from the liberation tigers of tamil eelam -lrb- ltte -rrb- against school children , a government official said . there are intelligence reports that the tigers may try to kill a lot of children to provoke a backlash against tamils in colombo . </s> <s> if that happens , troops will have to be withdrawn from the north to maintain law and order here , '' a police official said . he said education minister richard pathirana visited several government schools wednesday before the closure decision was taken . the government will make alternate arrangements to hold end of term examinations , officials said . earlier wednesday , president chandrika kumaratunga said the ltte may step up their attacks in the capital to seek revenge for the ongoing military offensive which she described as the biggest ever drive to take the tiger town of jaffna . .

publisher=AFP

Can you please help me by providing some script or some hints to make one to convert my training data to the abstract = ... article = ...., publisher = .... format?

@vdevmcitylp feel free to check out my github link below. I added some formatting scripts some time back I havent touched this code in a while, but I know it was working. Essentially I scraped articles, then had to format them for parsing in the referenced data_convert_example.py. Hope it helps some.

https://github.com/tensorflow/models/tree/ef9c156ca7802a5e60018fb0cc7d950ea54569de/textsum

@xtr33me
It works!
I can't thank you enough for this.

@Ali-Zareie I have same issue, did you solve it?

@Ali-Zareie @minaSamizade This is because there is "=" in your text. Try use (k, v) = feature.split('=', 1), which add the parameter 1.

Looks like the original issue was resolved.

@panyx0718 what is the configuration of the GPU used in the experiment with 10 machines with 4 gpu each?

@hphp @UB1010

I'm using this code recently。and also get a bad result. The decode dose not match the abstract.
I want to know have you solved this problem? and how to do it ? Thank you.

Abstrace = ‘’78 亿 主力 资金 近 三日 大量 撤出 中小 创‘’
Decode = ‘’储备面 早盘 银行‘’

@wengenihaoshuai I saw similar results when I didn't have enough source articles. Could this be the issue by chance? In the end I scraped around 1.3 million articles and after cleaning and filtering, I was left with about 900k articles that I was able to train on. When using this many, it was the first time I was happy with the results. Early on I had tried 40k and 200k articles and just wasn't happy with the results at all. Unsure if that is your problem, but it's something to look at.

when I was try to train on my own dataset with the offical vocab, it is no problem. however one I was train it with my vocab, “ValueError: Duplicated word: four." this is happened. Anybody knows why??

Was this page helpful?
0 / 5 - 0 ratings