PySyft design conventions discussion

Created on 27 Sep 2017 · 5Comments · Source: OpenMined/PySyft

I created this issue as a discussion board about design. So please chip in you views.
A lot of great code has come in recently, but I think we lack a solid design philosophy about what code should go where. And that has resulted in inconsistent code organization, and could get worse if not addressed. For example:

There are some functions in syft/math.py that are not implemented in syft/tensor.py(i.e. cumsum ) and vice versa (i.e. rsqrt ).
There are methods in syft/math.py that are not necessarily math operations (i.e. transpose and unsqueeze)
There are functions in tensor.py like mv which feel like they belong to math.py but are implemented in tensor.py

I think of some things that have bearing on how we define our conventions are:

How the future of encrypted tensors will look like? What method do they inherit from TensorBase? I expect methods like transpose, reshape, basically anything without math operation will work for encrypted tensors as well, but don't know much about cryptography. So it would be great to hear from someone with experience there.
Also should encrypted tensors implement their own math functions? If yes, do they get their own math.py files?
If we want to use PyTorch's Autograd, what bearing does that have on our design?
I think following PyTorch's conventions are good, but we should also think about how they mix with our encrypted tensors.

I think after we come to a consensus about design, we should:

Update our documentation to be more clear for contributors
Update current issues so not to confuse new contributors. I know it was a cause of confusion for me.

Please share your thoughts.

Type

Source

siarez

👍1

Most helpful comment

I'm with @baldassarreFe in that we're "just" being less strict about some parameters and that can't be breaking. However I think that it might be very confusing for a user to have functions with a more or less "constrained" usage in an arbitrary way (as you have pointed that we have now).

I prefer in general strict documentation, what you get is what it's written. The user of the library doesn't have to depend on "tricks" to do certain computations, and if they're really necessary/convenient they should be "officially" in the docs. I think it also helps a bit in debugging.

And about allowing or not implicit conversion of Tensor-like objects I'm not completely sure. If we want to really encourage using always our Tensors (like PyTorch does) it might make sense to avoid it.

aiorla on 6 Oct 2017

👍3

All 5 comments

I agree that this will generally be the case in the short run. The only exception will be for a thing called "packing" where instead of encrypting each scalar individually, and entire vector is encrypted as a unit. While this makes it more challenging to index into specific values, it can increase performance significantly in the right setting.
This is a great question. In many cases, yes they should. This is particularly true for cases wherein there are different ways to accomplish the same thing. For example, when multiplying a vector by a scalar, one can either encrypt the scalar and perform multiplication or perform multiplication with the scalar decrypted. The latter is much faster. This kind of logic we want to bake into the Encrypted tensor (and is likely to be slightly different for each variant).

Whether that means it should justify its own encrypted_math.py is less certain. I don't personally feel that this is clear yet.

we probably need to extend the torch.Tensor() interface. While our function names should already be quite similar, I expect that some of the ways we're storing data might need some changing (not 100% on that yet). That being said, PyTorch itself is designed to interop freely with Numpy, so this is probably more about interface design than changing things under the hood (hopefully)
agreed.

iamtrask on 28 Sep 2017

👍1

A question regarding our pysyft codebase.
So, if you look at some functions(many) in _math.py_ or _tensor.py_, such as dot/sigmoid/tanh/relu/transpose (check _math.py_) etc.
The inputs they accept are either _numpy_ object or a _TensorBase_ object.

But the tickets associated with those clearly mentioned they should accept a tensor as an input, which was based on pytorch conventions.

In pytorch, the same functions do not accept a _numpy_ object. pytorch throws a TypeError. but our functions accept the _numpy_ object and convert it to a TensorBase object within the function and then do whatever the function was supposed to do and output a tensor.

Now, this is highly inconsistent. Or is it right in doing this? Was this intentional?
What is the behavior we want our functions to follow?
could this break anything?

_ensure_tensorbase(input) vs input = _ensure_tensorbase(input) the usage is what bothers me.

@iamtrask @siarez @aiorla @baldassarreFe what do you guys feel about this?

bharathgs on 6 Oct 2017

I see the inconsistency and agree that if we want to stick 100% to PyTorch we should raise an error if the value passed is not in the TensorBase hierarchy. On the other hand, I can't think of a situation where being less strict with the requirements on a parameter would break the code. In the end, we are providing more functionality than what we are "promising" in the docstring, so that should be fine.

baldassarreFe on 6 Oct 2017

aiorla on 6 Oct 2017

👍3

I prefer in general strict documentation, what you get is what it's written. The user of the library doesn't have to depend on "tricks" to do certain computations, and if they're really necessary/convenient they should be "officially" in the docs. I think it also helps a bit in debugging.

Agree with @aiorla especially since this is a pure python library it should naturally respect the Zen Of Python

And about allowing or not implicit conversion of Tensor-like objects I'm not completely sure. If we want to really encourage using always our Tensors (like PyTorch does) it might make sense to avoid it.

It would also be helpful for new contributors like me to understand things like:

The structure of TensorBase in explicit terms (how does it compare to PyTorch, what are the differences if any ...) .
Is it supposed to always rely on ndarray for it's storage ?
What parts should TensorBase implement instead of reusing ndarray ?
In the User Story issues it would be helpful to join example usage showcasing the desired api for new features.