Bert: Why hidden size must be a multiple of the number of attention head

Created on 28 Dec 2018 · 4Comments · Source: google-research/bert

if hidden_size % num_attention_heads != 0: raise ValueError( "The hidden size (%d) is not a multiple of the number of attention " "heads (%d)" % (hidden_size, num_attention_heads))
why hidden size must be a multiple of the number of attention head?
from line 804, modeling.py

Source

HaodaY

Most helpful comment

I have the same question and after I read the The Illustrated Transformer again, it seems to be solved. I made some annotations on a image in the post.

I think it's now clear that hidden size must be a multiple of the number of attention head.

If I'm wrong, please correct me.

secsilm on 29 Aug 2019

👍6

All 4 comments

the reason is in the next line of code:
attention_head_size = int(hidden_size / num_attention_heads)
because the dimension of query and value could be different(dimension of query and key should be equal), attention_head_size was used in attention layer, which was set as the dimension of vector query and key.

IBruceWayne on 9 Jan 2019

👍4

dimension of query and key should be equal => is this because these need to be multiplied to calculate attention?
May I know how hidden_size , num_attention_heads and query vector size are linked with each other? Can these not be independent?

ghost on 14 Mar 2019

dimension of query and key should be equal => is this because these need to be multiplied to calculate attention?
May I know how hidden_size , num_attention_heads and query vector size are linked with each other? Can these not be independent?

1.Multiply query and key to calculate similarity(dimension of query and key should be equal).
2.Num_attention_heads is independent with query vector size,I think he did it for convenience,but its not necessary.
3.you can find details in this website:https://jalammar.github.io/illustrated-transformer/

IBruceWayne on 15 Mar 2019

👍1

I have the same question and after I read the The Illustrated Transformer again, it seems to be solved. I made some annotations on a image in the post.

I think it's now clear that hidden size must be a multiple of the number of attention head.

If I'm wrong, please correct me.

secsilm on 29 Aug 2019

👍6

Was this page helpful?

0 / 5 - 0 ratings