Ax: Modulo parameter constraints

Created on 15 Sep 2019  路  4Comments  路  Source: facebook/Ax

1. modulo

hidden_size % num_attention_heads == 0

use case

In transformer models, we use multi-head attention which the hidden vector will be divided into n part, and n is number of attention heads,
so we usually want hidden_size divisible by num_attention_heads

2. log2

math.log2(batch_size) % 1 == 0

use case

To make batch size be 2**n to just fit in memory.

3. Why?

Can we just pass a function as parameter constraint and have all parameters' names as arguments?

enhancement wishlist

Most helpful comment

Can you just reparameterize your problem in this case? Like define integer parameters size_per_head and log_batch_size and then in your evaluation code have

hidden_size = size_per_head * num_attention_heads
batch_size = 2**log_batch_size

Or would that mean you need a joint constraint on size_per_head and num_attention_heads?

@sdsingh From an optimization/candidate generation perspective (other than difficulty of the problem) there is no issue with imposing non-linear constraints on the parameter space. I assume the linearity assumption is mainly imposing structure for representing the constraints in Ax?

All 4 comments

All parameter constraints are implemented as linear constraints in the modeling layer. When we pass points into our GPs, they are all normalized to [0,1]^d, where then our constraints are applied. You can see how these are implemented w/ a simple matrix multiply in the Botorch model. As a result, we can only support constraints that can be mapped into a linear constraint, however creatively that may be. Unfortunately, I don't see a way of doing that for either of these constraint types.

Both of these constraint types seem pretty useful, though. Let me think about what we can do here, but unfortunately I don't think this will be a quick fix.

Can you just reparameterize your problem in this case? Like define integer parameters size_per_head and log_batch_size and then in your evaluation code have

hidden_size = size_per_head * num_attention_heads
batch_size = 2**log_batch_size

Or would that mean you need a joint constraint on size_per_head and num_attention_heads?

@sdsingh From an optimization/candidate generation perspective (other than difficulty of the problem) there is no issue with imposing non-linear constraints on the parameter space. I assume the linearity assumption is mainly imposing structure for representing the constraints in Ax?

Thank you, that's indeed a great alternative way.
I just thought we can make it elegant, but after I read the explanation, I knew that's not easy as I thought.
But stiil a very useful library to tune, thanks for your works !!

Closing this issue for now as there is an easy workaround and adding modulo constraints is not on our roadmap for the foreseeable future.

Was this page helpful?
0 / 5 - 0 ratings