Fairseq: Understanding the language modeling data format

Created on 24 Sep 2019 · 7Comments · Source: pytorch/fairseq

Hello everyone. I'm familiar with Fairseq for translation, but so far I haven't used it for language modeling. In translation, each line is considered to be the basic unit, and the source ones must be aligned with the target ones. I don't quite understand what are the basic units in language modeling.

I was following some examples, like the one for training RoBERTa from scratch with your own data (https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md). It is said that "Data should be preprocessed following the language modeling format.", but I haven't found the specification for this format.

I have downloaded the wikitext dataset and the training set file starts with:

 = Valkyria Chronicles III = 

 Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " <unk> Raven " . 
 The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more forgiving for series newcomers . Character designer <unk> Honjou and composer Hitoshi Sakimoto both returned from previous entries , along with Valkyria Chronicles II director Takeshi Ozawa . A large team of writers handled the script . The game 's opening theme was sung by May 'n . 
 It met with positive sales in Japan , and was praised by both Japanese and western critics . After release , it received downloadable content , along with an expanded edition in November of that year . It was also adapted into manga and an original video animation series . Due to low sales of Valkyria Chronicles II , Valkyria Chronicles III was not localized , but a fan translation compatible with the game 's expanded edition was released in 2014 . Media.Vision would return to the franchise with the development of Valkyria : Azure Revolution for the PlayStation 4 . 

 = = Gameplay = =

I infer that the format is something like:
`\n = < Document heading > = \n

Is this what is meant by language modeling format? Are headings ignored? Also, in TokenBlockDataset, I understand that text is treated as a 1D stream of data. If that is the case, and I have a set of different documents concatenated following the said language modeling data format, will text from different documents be mixed? There is the argument '--sample-break-mode' with options '{none,complete,complete_doc,eos}', and the optional
'document_sep_len (int, optional): document separator size (required for 'complete_doc' break mode). Typically 1 if the sentences have eos and 0 otherwise.'

I don't understand the exact meaning of 'document_sep_len'.

To sum up, and please excuse me for the long message, suppose that I have a set of documents and I don't want them to be mixed in the same samples (ie. text from other documents should not be used for predicting text in one document). The way to go should be:

Concat all the documents following the aforementioned format. The names of the documents should be written as '\n = < Document heading > = \n'.
Set --sample-break-model to complete_doc and document_sep_len to ?.

Finally, if I want some documents to be mixed (ie. using information from each other as context), would it be a good idea to put each of these document after their respective title as a sub-heading, within the same heading?

Many thanks in advance.

Source

jordiae

👍2

Most helpful comment

I think a lot of your questions about headings and subheadings are about how wikitext103, the dataset, presents them- which is simply as sentences just like any other sentence. Thus, we treat headings and subheadings the same as paragraph information, which is as a stream of text

huihuifan on 25 Sep 2019

👍2

All 7 comments

but I haven't found the specification for this format.

If you click on the link it should take you to the instructions for preprocessing data for the language modeling task [[1](https://github.com/pytorch/fairseq/tree/master/examples/language_model#training-a-transformer-language-model-with-the-cli-tools)]

If you'd like to ensure that text from different documents does not get merged into the same block, then you should use the --sample-break-mode complete_doc. Documents in your input dataset should be separated by an empty line.

lematt1991 on 24 Sep 2019

👍1

but I haven't found the specification for this format.

If you click on the link it should take you to the instructions for preprocessing data for the language modeling task [[1](https://github.com/pytorch/fairseq/tree/master/examples/language_model#training-a-transformer-language-model-with-the-cli-tools)]

If you'd like to ensure that text from different documents does not get merged into the same block, then you should use the --sample-break-mode complete_doc. Documents in your input dataset should be separated by an empty line.

Many thanks for your answer, Matt. I did follow the instructions, but the dataset seemed to be already formatted that way (and I needed to build my own dataset). I counted empty lines with Grep in the Wikitext dataset and I got 0, is that possible?

jordiae on 24 Sep 2019

Assuming you downloaded it using the prepare-wikitext-103.sh script, I don't think you should have 0 empty lines. Using grep -c "^\s*$" wiki.train.tokens, I count 636321 blank lines.

lematt1991 on 24 Sep 2019

👍1

Assuming you downloaded it using the prepare-wikitext-103.sh script, I don't think you should have 0 empty lines. Using grep -c "^\s*$" wiki.train.tokens, I count 636321 blank lines.

My bad, I was counting empty lines with "^$" (so, not including lines with only one space). Thanks!

jordiae on 24 Sep 2019

👍1

Does this clarify your issue? If so, do you mind closing the issue. Thanks!

lematt1991 on 24 Sep 2019

Does this clarify your issue? If so, do you mind closing the issue. Thanks!

Excuse me, just one last thing that I also asked in the initial post: how are headings and subheadings treated? Eg. '= Valkyria Chronicles III ='.

jordiae on 25 Sep 2019

huihuifan on 25 Sep 2019

👍2

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Currently fairseq-py requires PyTorch version >= 0.4.0 ?

mali-nuist · 3Comments

RuntimeError: Creating MTGP constants failed. at /pytorch/aten/src/THC/THCTensorRandom.cu:33

kyquang97 · 3Comments

Any performance comparison between pre-norm and post-norm for Transformer on Machine Translation

gaopengcuhk · 3Comments

Should sentences be split for the (masked) language modeling task?

jordiae · 3Comments

errors trying to decode with mbart model

mjpost · 3Comments