the reason is in the next line of code:
attention_head_size = int(hidden_size / num_attention_heads)
because the dimension of query and value could be different(dimension of query and key should be equal), attention_head_size was used in attention layer, which was set as the dimension of vector query and key.
dimension of query and key should be equal => is this because these need to be multiplied to calculate attention?
May I know how hidden_size , num_attention_heads and query vector size are linked with each other? Can these not be independent?
dimension of query and key should be equal => is this because these need to be multiplied to calculate attention?
May I know how hidden_size , num_attention_heads and query vector size are linked with each other? Can these not be independent?
1.Multiply query and key to calculate similarity(dimension of query and key should be equal).
2.Num_attention_heads is independent with query vector size,I think he did it for convenience,but its not necessary.
3.you can find details in this website:https://jalammar.github.io/illustrated-transformer/
I have the same question and after I read the The Illustrated Transformer again, it seems to be solved. I made some annotations on a image in the post.

I think it's now clear that hidden size must be a multiple of the number of attention head.
If I'm wrong, please correct me.
Most helpful comment
I have the same question and after I read the The Illustrated Transformer again, it seems to be solved. I made some annotations on a image in the post.
I think it's now clear that hidden size must be a multiple of the number of attention head.
If I'm wrong, please correct me.