ASG loss requires a matrix of NTOKEN x NTOKEN. With english letters, that's about 29 x 29, or 841 elements. With 10,000 word pieces, that's 100 million. It's really slow or impossible to meaningfully train a transitions matrix like that, and it's also mostly useless as most of the cells should be near zero. Basically ASG makes a lot of sense for small letter token vocabs with small model strides, and no sense at all for large word vocabs and large strides.
@lunixbochs thanks :)
Most helpful comment
ASG loss requires a matrix of
NTOKEN x NTOKEN. With english letters, that's about 29 x 29, or 841 elements. With 10,000 word pieces, that's 100 million. It's really slow or impossible to meaningfully train a transitions matrix like that, and it's also mostly useless as most of the cells should be near zero. Basically ASG makes a lot of sense for small letter token vocabs with small model strides, and no sense at all for large word vocabs and large strides.