Transformers: XLNetForQuestionAnswering - weight pruning

Created on 5 Aug 2019 · 8Comments · Source: huggingface/transformers

🚀 Feature

Hi guys, first of all, thank you a lot for the great API, I'm using a lot pytorch-transformers, you guys are really doing a good job!

I have recently fine-tuned a XLNetForQuestionAnswering on SQuAD1.10, results looks good, however the model is taking ~ 2.0 seconds (in a MacBook Pro) to do a forward in a reasonable small "facts/passage" text.

I had some some weight pruning in the past (in a small network), and I was wondering if you guys heard of any paper/idea to do weight pruning in transformer based networks such as BERT or XLNet?

Any other ideas to optimize model forward look for inferencing? I'm thinking to put these model in prod but ~1-2 seconds is still too high.

I'm willing to help and work on this issue, but it will be great if you guys can point some directions on best way to do this?

Motivation

Currently the forward times of trained BertForQuestionAnswering and XLNetForQuestionAnswering are too high, I'm searching for options to reduce forward time on QA task for both networks (results below running on a MacBook Pro 2.9GHz Corei7, 16GB RAM):

BertForQuestionAnswering: 1.48 s ± 52.4 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
XLNetForQuestionAnswering: 2.14 s ± 45.5 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)

Additional context

wontfix

Source

rsilveira79

Most helpful comment

I think we could speed up significantly XLNet by refactoring the tensorflow code to use Embeddings instead of multiplication of static matrices with one-hot vectors as it's currently done in several places. We could also reduce the use of torch.einsum and replace them with matrix multiplications. We'll experiment with that in the coming months.

thomwolf on 5 Aug 2019

👍2

All 8 comments

I'm interested in this as well. I've seen similar inference times of nearly 1.5 seconds running BERT for inference on a fine-tuned classification task on TF Serving and would like to improve it without paying for a GPU.

I'm not associated with the following work, but found the paper interesting:
"tranformers.zip: Compressing Transformers with Pruning and Quantization"
http://web.stanford.edu/class/cs224n/reports/custom/15763707.pdf

The open source corresponding to the paper above has been published in a branch of OpenNMT here:
https://github.com/robeld/ERNIE

caseysackett on 5 Aug 2019

thomwolf on 5 Aug 2019

👍2

Might even just dropping in opt_einsum as a substitute for the torch.einsum be an easy speedup?

MiroFurtado on 6 Aug 2019

I'm doing some time profiling here, it looks like the time bottleneck in the forward loop of the transformer. In this case my overall forward loop for XLNetForQuestionAnswering is taking 2.5 s ± 310 ms per loop (mean ± std. dev. of 3 runs, 1 loop each). Please see below a breakdown for each forward step (in seconds). Looks like the large chunk of the time is spent in the chunk of the code below ~2.33 seconds. Will start doing some optimizations on XLNetRelativeAttention and XLNetFeedForward to see what happens.

Causal attention mask: 7e-05
Data mask: 3e-05
Word Embedding: 0.00073
Segment Embedding: 5e-05
___ Pos encoding - 1 : 0.0099
___ Pos encoding - 2 : 0.00012
**___ Pos encoding - 3: 2.33072**
Positional encoding: 2.34084
Prepare output: 0.00025
Transformer time: 2.3420751094818115

___ Pos encoding - 3 - Code chunk

        new_mems = ()
        if mems is None:
            mems = [None] * len(self.layer)

        attentions = []
        hidden_states = []
        for i, layer_module in enumerate(self.layer):
            # cache new mems
            new_mems = new_mems + (self.cache_mem(output_h, mems[i]),)
            if self.output_hidden_states:
                hidden_states.append((output_h, output_g) if output_g is not None else output_h)

            outputs = layer_module(output_h, output_g, attn_mask_h=non_tgt_mask, attn_mask_g=attn_mask,
                                   r=pos_emb, seg_mat=seg_mat, mems=mems[i], target_mapping=target_mapping,
                                   head_mask=head_mask[i])
            output_h, output_g = outputs[:2]
            if self.output_attentions:
                attentions.append(outputs[2])

rsilveira79 on 6 Aug 2019

@MiroFurtado it looks like Torch.Einsum is already as optimized as opt_einsum - see attached an example of multiplication of 1024x1024 matrix using torch.einsum, torch.matmul,np.einsum and opt_einsum. Looks like in fact np.einsum is not optimized after all.
I modified the code to include opt_einsum using contract and actually it tooked ~3x more! 5.79 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Einsum Comparison - Torch Einsum, Matmul, Numpy, Opt Contract

rsilveira79 on 6 Aug 2019

Just FYI, a relevant blog post about this topic, will investigate: https://blog.rasa.com/compressing-bert-for-faster-prediction-2/

rsilveira79 on 9 Aug 2019

👀1

More related information, freshly released: https://ai.facebook.com/blog/making-transformer-networks-simpler-and-more-efficient/?refid=52&__tn__=*s-R

rsilveira79 on 24 Aug 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.