Ax: Modulo parameter constraints

Created on 15 Sep 2019 · 4Comments · Source: facebook/Ax

1. modulo

hidden_size % num_attention_heads == 0

use case

In transformer models, we use multi-head attention which the hidden vector will be divided into n part, and n is number of attention heads,
so we usually want hidden_size divisible by num_attention_heads

2. log2

math.log2(batch_size) % 1 == 0

use case

To make batch size be 2**n to just fit in memory.

3. Why?

Can we just pass a function as parameter constraint and have all parameters' names as arguments?

enhancement wishlist

Source

richarddwang

👍1

Most helpful comment

Can you just reparameterize your problem in this case? Like define integer parameters size_per_head and log_batch_size and then in your evaluation code have

hidden_size = size_per_head * num_attention_heads
batch_size = 2**log_batch_size

Or would that mean you need a joint constraint on size_per_head and num_attention_heads?

@sdsingh From an optimization/candidate generation perspective (other than difficulty of the problem) there is no issue with imposing non-linear constraints on the parameter space. I assume the linearity assumption is mainly imposing structure for representing the constraints in Ax?

Balandat on 18 Sep 2019

👍3 ❤1

All 4 comments

All parameter constraints are implemented as linear constraints in the modeling layer. When we pass points into our GPs, they are all normalized to [0,1]^d, where then our constraints are applied. You can see how these are implemented w/ a simple matrix multiply in the Botorch model. As a result, we can only support constraints that can be mapped into a linear constraint, however creatively that may be. Unfortunately, I don't see a way of doing that for either of these constraint types.

Both of these constraint types seem pretty useful, though. Let me think about what we can do here, but unfortunately I don't think this will be a quick fix.

sdsingh on 17 Sep 2019

👍2 ❤1

Can you just reparameterize your problem in this case? Like define integer parameters size_per_head and log_batch_size and then in your evaluation code have

hidden_size = size_per_head * num_attention_heads
batch_size = 2**log_batch_size

Or would that mean you need a joint constraint on size_per_head and num_attention_heads?

Balandat on 18 Sep 2019

👍3 ❤1

Thank you, that's indeed a great alternative way.
I just thought we can make it elegant, but after I read the explanation, I knew that's not easy as I thought.
But stiil a very useful library to tune, thanks for your works !!

richarddwang on 22 Sep 2019

👍1

Closing this issue for now as there is an easy workaround and adding modulo constraints is not on our roadmap for the foreseeable future.