Hello everyone. I'm familiar with Fairseq for translation, but so far I haven't used it for language modeling. In translation, each line is considered to be the basic unit, and the source ones must be aligned with the target ones. I don't quite understand what are the basic units in language modeling.
I was following some examples, like the one for training RoBERTa from scratch with your own data (https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md). It is said that "Data should be preprocessed following the language modeling format.", but I haven't found the specification for this format.
I have downloaded the wikitext dataset and the training set file starts with:
= Valkyria Chronicles III =
Senj艒 no Valkyria 3 : <unk> Chronicles ( Japanese : 鎴﹀牬銇兇銈°儷銈儱銉偄3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " <unk> Raven " .
The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more forgiving for series newcomers . Character designer <unk> Honjou and composer Hitoshi Sakimoto both returned from previous entries , along with Valkyria Chronicles II director Takeshi Ozawa . A large team of writers handled the script . The game 's opening theme was sung by May 'n .
It met with positive sales in Japan , and was praised by both Japanese and western critics . After release , it received downloadable content , along with an expanded edition in November of that year . It was also adapted into manga and an original video animation series . Due to low sales of Valkyria Chronicles II , Valkyria Chronicles III was not localized , but a fan translation compatible with the game 's expanded edition was released in 2014 . Media.Vision would return to the franchise with the development of Valkyria : Azure Revolution for the PlayStation 4 .
= = Gameplay = =
I infer that the format is something like: Is this what is meant by language modeling format? Are headings ignored? Also, in TokenBlockDataset, I understand that text is treated as a 1D stream of data. If that is the case, and I have a set of different documents concatenated following the said language modeling data format, will text from different documents be mixed? There is the argument '--sample-break-mode' with options '{none,complete,complete_doc,eos}', and the optional I don't understand the exact meaning of 'document_sep_len'. To sum up, and please excuse me for the long message, suppose that I have a set of documents and I don't want them to be mixed in the same samples (ie. text from other documents should not be used for predicting text in one document). The way to go should be: Finally, if I want some documents to be mixed (ie. using information from each other as context), would it be a good idea to put each of these document after their respective title as a sub-heading, within the same heading? Many thanks in advance.
`\n = < Document heading > = \n
'document_sep_len (int, optional): document separator size (required for 'complete_doc' break mode). Typically 1 if the sentences have eos and 0 otherwise.'
but I haven't found the specification for this format.
If you click on the link it should take you to the instructions for preprocessing data for the language modeling task [[1](https://github.com/pytorch/fairseq/tree/master/examples/language_model#training-a-transformer-language-model-with-the-cli-tools)]
If you'd like to ensure that text from different documents does not get merged into the same block, then you should use the --sample-break-mode complete_doc. Documents in your input dataset should be separated by an empty line.
but I haven't found the specification for this format.
If you click on the link it should take you to the instructions for preprocessing data for the language modeling task [[1](https://github.com/pytorch/fairseq/tree/master/examples/language_model#training-a-transformer-language-model-with-the-cli-tools)]
If you'd like to ensure that text from different documents does not get merged into the same block, then you should use the
--sample-break-mode complete_doc. Documents in your input dataset should be separated by an empty line.
Many thanks for your answer, Matt. I did follow the instructions, but the dataset seemed to be already formatted that way (and I needed to build my own dataset). I counted empty lines with Grep in the Wikitext dataset and I got 0, is that possible?
Assuming you downloaded it using the prepare-wikitext-103.sh script, I don't think you should have 0 empty lines. Using grep -c "^\s*$" wiki.train.tokens, I count 636321 blank lines.
Assuming you downloaded it using the
prepare-wikitext-103.shscript, I don't think you should have 0 empty lines. Usinggrep -c "^\s*$" wiki.train.tokens, I count 636321 blank lines.
My bad, I was counting empty lines with "^$" (so, not including lines with only one space). Thanks!
Does this clarify your issue? If so, do you mind closing the issue. Thanks!
Does this clarify your issue? If so, do you mind closing the issue. Thanks!
Excuse me, just one last thing that I also asked in the initial post: how are headings and subheadings treated? Eg. '= Valkyria Chronicles III ='.
I think a lot of your questions about headings and subheadings are about how wikitext103, the dataset, presents them- which is simply as sentences just like any other sentence. Thus, we treat headings and subheadings the same as paragraph information, which is as a stream of text
Most helpful comment
I think a lot of your questions about headings and subheadings are about how wikitext103, the dataset, presents them- which is simply as sentences just like any other sentence. Thus, we treat headings and subheadings the same as paragraph information, which is as a stream of text