Fairseq: Should sentences be split for the (masked) language modeling task?

Created on 24 Feb 2020  Â·  3Comments  Â·  Source: pytorch/fairseq

âť“ Questions and Help

What is your question?

In the wikitext dataset suggested in the language modeling task (and used as well by the RoBERTa example), sentences are not split into different lines. Instead, in this dataset, newlines denote a new paragraph (and double new line denotes change of document, as mandated by the "language modeling format" mentioned in the docs).

My question is: is sentence splitting something that we should consider when training our own language models? In the case of BERT, it is obvious that it is a hard requirement (for the NSP objective), while in the case of BART, I'm not sure because there are no examples of training BART from scratch, but I think that it's necessary because of the sentence permutation. In the case of RoBERTa, it is not a requirement, and it doesn't appear in the example, but is it something that would be beneficial? Did you use it when building your models? So far, I haven't found any mention of this in the original articles or fairseq's documentation.

In summary: even if sentence splitting (into newlines) is not required for RoBERTa, is it something that would be beneficial? Did you do it? In the case of BART, it is a hard requirement, right?

Many thanks in advance.

question

All 3 comments

Good question. For RoBERTa we always put a blank newline between "documents", so for books there's a blank newline between each book, for wikipedia a blank newline between articles, etc.

Within each "document," we split sentences for books and wikipedia. STORIES also seems to split on sentences. Both CC-NEWS and OpenWebText usually have one paragraph per line.

So for example, in Wikipedia we have one sentence per line, with blank lines between articles:

Jean Bernard Bossu (1720–1792) was a captain in the French navy, adventurer and explorer.
He travelled several times to New France, where he explored the regions along the Mississippi.
(...)

The long-tailed Talaud mosaic-tailed rat or the long-tailed Talaud melomys ("Melomys talaudium") is a species of rodent in the family Muridae.
It is endemic to Karakelong and Salebabu in the Talaud Islands in Indonesia where it occurs in forest habitats.
(...)

For OpenWebText we usually have one paragraph per line, with blank lines between articles:

St Columba Day: the Christianization of Scotland
Today is the feast day of St Columba, a Christian missionary known for the spread of Christianity in what is now known as Scotland. Columba was born in Ireland in 591 CE, and was a monk of some renown, and the story about him is interesting. He made a copy of the Psalms under the direction of another monk, intending to keep the copy. The dispute between ownership grew beyond Columba and the monk to their respective groups, and eventually led to an actual battle in 561. Later, Columba also induced another battle in violation of the King Ireland’s order.
(...)

Georgia Tech players expressed disappointment over not being able to play against Central Florida on Saturday after the game was canceled Monday because of effects of Hurricane Irma.
“We’re always ready to play,” quarterback TaQuon Marshall said Wednesday following the team’s practice. “We were looking forward to playing. I know a lot of the guys from Florida were looking forward to going down and playing in their hometown. It’s disappointing, but we’re happy we can get a break also and rest our bodies and move on to next week.”
(...)

I think the key is putting blank lines between articles, which gives the model an explicit separator. This also enables you to train with --sample-break-mode=complete_doc, which we found gives slightly better performance than complete.

@myleott Understood, many thanks for your answer.

Was this page helpful?
0 / 5 - 0 ratings