Fairseq: "Note that ASG loss currently doesn't do well with word-pieces"

Created on 19 May 2020  路  2Comments  路  Source: pytorch/fairseq

Hey guys, thanks for this awesome library!

I was reading the documentation for the ASG Criterion here and it mentions "that ASG loss currently doesn't do well with word-pieces". Is there any more details that can be given here or references?

needs triage question

Most helpful comment

ASG loss requires a matrix of NTOKEN x NTOKEN. With english letters, that's about 29 x 29, or 841 elements. With 10,000 word pieces, that's 100 million. It's really slow or impossible to meaningfully train a transitions matrix like that, and it's also mostly useless as most of the cells should be near zero. Basically ASG makes a lot of sense for small letter token vocabs with small model strides, and no sense at all for large word vocabs and large strides.

All 2 comments

ASG loss requires a matrix of NTOKEN x NTOKEN. With english letters, that's about 29 x 29, or 841 elements. With 10,000 word pieces, that's 100 million. It's really slow or impossible to meaningfully train a transitions matrix like that, and it's also mostly useless as most of the cells should be near zero. Basically ASG makes a lot of sense for small letter token vocabs with small model strides, and no sense at all for large word vocabs and large strides.

@lunixbochs thanks :)

Was this page helpful?
0 / 5 - 0 ratings