Pytorch: torch.lobpcg always breaks for autograd

Created on 23 May 2020  ·  67Comments  ·  Source: pytorch/pytorch

🐛 Bug

It seems that torch.lobpcg (https://pytorch.org/docs/stable/torch.html?highlight=lobpcg#torch.lobpcg) just always breaks when trying to take gradients via backward.

To Reproduce

Here's a minimalist example showing lobpcg breaking.

# lob.py
import torch as T
T.autograd.set_detect_anomaly(True)

A = T.randn(10, 10)
A.requires_grad_()
S = A.matmul(A.t())
e, v = T.lobpcg(S, k=3)
S_hat = T.einsum('ij,j,kj->ik', v, e, v) # v * diag(e) * v^T
loss = S_hat.abs().sum()
loss.backward() # breaks here

Running that code produces the following error.

Warning: Error detected in MmBackward. Traceback of forward call that caused the error:
  File "lob.py", line 9, in <module>
    e, v = T.lobpcg(S, k=3)
  File "/usr/local/lib/python3.5/dist-packages/torch/_lobpcg.py", line 261, in lobpcg
    worker.run()
  File "/usr/local/lib/python3.5/dist-packages/torch/_lobpcg.py", line 408, in run
    self.update()
  File "/usr/local/lib/python3.5/dist-packages/torch/_lobpcg.py", line 343, in update
    self._update_ortho()
  File "/usr/local/lib/python3.5/dist-packages/torch/_lobpcg.py", line 498, in _update_ortho
    self.X[:, nc:] = mm(S_, Z[:, :n - nc])
 (print_stack at /pytorch/torch/csrc/autograd/python_anomaly_mode.cpp:60)
Traceback (most recent call last):
  File "lob.py", line 12, in <module>
    loss.backward() # breaks here
  File "/usr/local/lib/python3.5/dist-packages/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/lib/python3.5/dist-packages/torch/autograd/__init__.py", line 100, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [10, 5]], which is output 0 of SliceBackward, is at version 14; expected version 11 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

I have a feeling that the problem is that torch.lobpcg's implementation is using an in-place operation when it shouldn't be.

This happened when running torch.__version__ == '1.5.0+cpu' installed with pip on Windows 10 WSL (Windows Subsystem for Linux) on Python 3.5.2.

Can this be fixed, or is torch.lobpcg not meant to support autograd?

cc @ezyang @albanD @zou3519 @gqchen @pearu @nikitaved @vincentqb @vishwakftw @jianyuh @mruberry @SsnL

autograd linear algebra triaged

Most helpful comment

The math trick avoiding the square root of B is to notice that inv(B)A is symmetric in the B-based scalar product, i.e. x'Bx. So one can apply whatever standard theory or algorithms for generalized eigenvalue problems, as soon as B is SPD (our case). After the substitution, in many places one gets inv(B)B so it just goes away. But in a few key spots one still needs implementing vector multiplication by inv(B)A - that is where a linear solve for B is needed.

All 67 comments

@pearu, could you please take a look at this?

The LOBPCG algorithm is iterative, and indeed, the torch.lobpcg implementation uses in-place operations which cannot be avoided. However, the current issue can be resolved by implementing autograd support for torch.lobpcg: the same backward algorithm used in torch.eig ought to be a good starting point, it needs to be modified for the k parameter that restricts the number of eigenpairs.

For reference: https://github.com/pytorch/pytorch/issues/32531 - gradient of torch.eig.

The paper https://people.maths.ox.ac.uk/gilesm/files/NA-08-01.pdf
derives the backward formula for the standard eigenvalue problem that involves the inverse of the eigenvectors matrix. Since the lobpcg provides only the first k eigenvectors, the formula is not applicable here.

@nikitaved what do you think, is the problem of backward solvable with the restricted set of eigenpairs?

The torch.eig function has a similar limitation that if you set eigenvectors=False, you won't be able to compute the backward pass.

It is possible to derive the gradient for k-rank lobpcg, but it is not going to be unique because it will be a solution to an overdetermined system of linear equations. I will put my derivations a bit later so that you could tell whether you agree with them...

Ah, sorry, the system is underdetermined, here it goes...
For simplicity, let's consider the basic rank-k eigenvalue problem of the form AU = UD, where A is n by n, rank(U) = rank(D) = k. Then applying the diff operator:

dA U + A dU = dU D + U dD,
U^T dA^T = (dU D + U dD - A dU)^T.

Let K := U^T, X:= dA^T, L:= (dU D + U dD - A dU)^T. We want to find X by solving the following matrix equation:

K X = L, where K,L in (k, n), X in (n, n).

we can split it into n systems of linear equations of the form

K X[i] = L[i].

As you can see, we have k < n equalities with n unknowns, so it is underdetermined.
One possible solution could be this:

Let I = {i_i,...i_k} column indices, such that rank(K[I]) = k, then
X[i]_I = K[I]^{-1} L[i] for indices in I, and X[i]_{[n] \ I} = 0.

The non-uniqueness comes from the system being underdetermined...

But I guess we might run into an issue when both input matrices for lobpcg require gradients and they are not some function of each other...

But in this case, if you consider the rank-k problem, AU = UD is not true in general. There are no rank-k matrix U and diagonal D that verifies this. So you can't base your gradient derivation on this equality right?

Take any k eigenvectors of A as U and fill diagonal of D with the corresponding eigenvalues, there you go...

@albanD I am not sure I am following you. We have

E, V = torch.lobpcg(A, k=k)

so that

torch.mm(A, V) == torch.mm(V, torch.diag(E))

holds.

Replacing k with n will produce the same first k eigenpairs as when using the initial k.

Yeah, exactly, lobpcg is a function A -> U, D, such that AU = UD, in case of the basic eignevalue problem. In the backward we receive dU, dD and propagate them to dA through the equality.

Replacing k with n will produce the same first k eigenpairs as when using the initial k.

This is not actually true as lobpcg works only when 3*k<=n. Sorry for the confusion. But lobpcg will return the first k eigenpairs.

One possible solution could be this:

Let I = {i_i,...i_k} column indices, such that rank(K[I]) = k, then
X[i]_I = K[I]^{-1} B[i] for indices in I, and zero otherwise.

This method does seem to produce X such that K X=L holds only the k columns but for others not (as the corresponding columns in L are non-trivial).

UPDATED: B -> L

One possible solution could be this:

Let I = {i_i,...i_k} column indices, such that rank(K[I]) = k, then
X[i]_I = K[I]^{-1} B[i] for indices in I, and zero otherwise.

This method does seem to produce X such that K X=B holds only the k columns but for others not (as the corresponding columns in B are non-trivial).

I am not sure I understand... In X only a submatrix of size of size (k, n) is non-zero, the rest is zero.
X[i][I] = K[I]^{-1} L[i], and X[i][[n] \ I] = 0. To be more precise, X has k non-zero rows, the rest is zero.

@pearu My concern is how can the k top eigenvalues be exact if A has rank > k:

import torch

A = torch.rand(10, 10) # Most likely rank > 3
k = 3

E, V = torch.lobpcg(A, k=k)

# This will print != 0
print(( torch.mm(A, V) - torch.mm(V, torch.diag(E)) ).abs().max())

@albanD , take k eigenvectors/eigenvalues of A, so that we have Au_i = lambda_i u_i. Now, Let U = (u_1, ... u_k), D = diag(lambda_1,..., lambda_k). Now you have AU = UD.

Ho right this system has only k columns! Thanks !

My concern is how can the k top eigenvalues be exact if A has rank > k

The input to lobpcg must be a symmetric positive defined matrix. So try

A = torch.mm(A.transpose(-2, -1), A)

before calling lobpcg.

In that case you get an error in the order of 1e-3 (with the same parameters, n=10, k=3). Which is a worst case so ok I think.
But my concern is solved by seeing that we only have k vectors for which we verify this equality. So it makes sense that k eigenpairs are enough.

I am not sure I understand... In X only a submatrix of size of size (k, n) is non-zero, the rest is zero.
X[i][I] = K[I]^{-1} L[i], and X[i][[n] \ I] = 0. To be more prices, X has k non-zero rows, the rest is zero.

@nikitaved, it seems to be ok (I had a little bug before). Here is a quick example of your method:

import torch as T

class LOBPCG2(T.autograd.Function):

    @staticmethod
    def forward(ctx, A):
        k = 2
        e, v = T.lobpcg(A, k=k)
        res = T.mm(A, v) - T.mm(v, T.diag(e))
        assert (res.abs() < 1e-5).all()
        ctx.save_for_backward(e, v, A)
        return e, v

    @staticmethod
    def backward(ctx, de, dv):
        """
        solve `dA v + A dv = dv diag(e) + v diag(de)` for `dA`
        """
        e, v, A = ctx.saved_tensors

        vt = v.transpose(-2, -1)
        print('vt=', vt)
        print('de=', de)
        print('dv=', dv)
        rhs = (T.mm(dv, T.diag(e)) + T.mm(v, T.diag(de)) - T.mm(A, dv)).transpose(-2, -1)
        print('rhs=', rhs)

        n, k = v.shape
        K = vt[:, :vt.shape[0]]
        print('K.det=', K.det())  # should be > 0
        iK = K.inverse()

        dAt = T.zeros((n, n))
        dAt[:k] = T.mm(iK, rhs)[:k]
        print('dAt=', dAt)
        dA = dAt.transpose(-2, -1)

        res = T.mm(dA, v) + T.mm(A, dv) - T.mm(dv, T.diag(e)) - T.mm(v, T.diag(de))
        print('res=', res)
        return dA

T.random.manual_seed(123)

A = T.randn(6, 6)
S = A.matmul(A.t())
S.requires_grad_()

e, v = LOBPCG2.apply(S)

S_hat = T.einsum('ij,j,kj->ik', v, e, v) # v * diag(e) * v^T
loss = S_hat.abs().sum()
loss.backward()


(the printed res is close to zero).

@pearu, so, does it work? :)

Yes, the backward call works but the method requires verification against torch.symeig case and understanding the consequences of the arbitrariness of dA.

You can compare it for the case k=n, for example. With torch.symeig I mean.

But it is analytic. I am sure there must be some linear equation solver somewhere in PyTorch..

We do have a torch.solve.
Also if you want to compare to finite difference, you can use torch.autograd.gradcheck (with double input) with:

S = S.double().detach().requires_grad_(True)

# You need to make sure in your backward that your call to torch.zeros() creates the proper dtype
# by passing `dtype=A.dtype`.
T.autograd.gradcheck(LOBPCG2.apply, S)

Interestingly, the Jacobian wrt the eigenvalues given by the finite difference don't match because it generates a "dense" gradient while your method give gradients only for the first two columns of A :/
Couldn't see the Jacobian wrt to the eigenvectors as the other fails before :/

Well, the issue is exactly we have an infinite number of solutions to the system of linear equations K X = L, where X in (n, n), K, L in (k, n). Wlog assume that the first k columns of K are linearly independent, call it K_k. We split X into two submatrices X_k of size (k, n) and X_-k of size (n-k, n), and similar we split K (but column-wise), then we get:

K X = L => [K_k K_-k] [X_k / X_-k] = L => K_k X_k + K_-k X_-k = L. K_k is full rank, so
X_k = K_k^{-1} (L - K_-k * X_-k).

We decided to pick one solution with X_-k = 0, but you can set X_-k to an arbitrary matrix and get a different solution X = [K_k^{-1} (L - K_-k * X_-k) / X_-k].
So, I am back to my claim, we have an infinite number of valid gradients ;)

Are all of these (local) subgradients?

Oh, wait we just found a forward differential with X, it is not the backward formula, however :) Sorry for the confusion, my bad. I will write down the backward formula tomorrow....

Could it be like this? Again, assuming that

A is symmetric (n, n), U is (n, k) and orthogonal, i.e. U^T T = I, D = diag(d_1,...,d_k).
We have AU = UD. Applyting diff we get:
dA U + A dU = dU D + U dD [multiply by U^T on the left, orthogonality of U]
U^T dA U + U^T A dU = U^T dU D + dD, [make a substitute dU = U dC, dC in (k, k)]
U^T dA U + U^T (AU) dC = (U^T U) dC D + dD, [AU = UD, U^T U = I]
U^T dA U + D dC = dC D + dD. (*)

Consider D dC - dC D, since D is a scalar matrix, it can be represented as
D dC - dC D = E o dC, where o is the Hadamard product and E_ij = d_i - d_j.

Substituting it all back into (*) gives:
U^T dA U + E o dC = dD. 
Note that E o dC has zero diagonal, while dD is a scalar matrix, hence
dD = I o (U^T dA U) (**) and
dC = F o (U^T dA U) (***), where F_ij = (d_i - d_j)^{-1} for i != j if d_i = d_j, then F_ij = 0.
Recalling our substitute dU = U dC we get:
dU = U (F o (U^T dA U)) (****)

Now we have all the ingredients for the backward AD:
Tr(D.grad^T dD + U.grad^T dU) =
Tr(D.grad^T ( I o (U^T dA U))) + Tr(U.grad^T dU) = 
Tr(D.grad U^T dA U) + Tr(U.grad^T dU) = [cyclic property of trace]
Tr(U D.grad U^T dA) + Tr(U.grad^T dU) =
[1] + [2],

[2] =  Tr(U.grad^T dU) = Tr(U.grad^T U (F o (U^T dA U))) = [Tr(A (B o C)) = Tr((A o B^T) C)]
Tr((U.grad^T U o F^T)(U^T dA U)) =
Tr(U (U.grad^T U o F^T) U^T dA)

In [1], [2] we sum the terms to the left from the dA and transpose them to get:
A.grad = U (D.grad + (U^T U.grad o F)) U^T.

gradcheck gives:

RuntimeError: Jacobian mismatch for output 0 with respect to input 0,
numerical:tensor([[ 5.1314e-04,  1.3259e-01],
        [-2.1311e-02,  5.7138e-02],
        [-1.1582e-02, -5.3796e-01],
        [-2.1237e-02, -1.5204e-01],
        [-2.2161e-02,  3.7977e-01],
        [-2.2827e-02,  4.2838e-03],
        [ 0.0000e+00,  0.0000e+00],
        [ 2.2126e-01,  6.1558e-03],
        [ 2.4050e-01, -1.1592e-01],
        [ 4.4100e-01, -3.2760e-02],
        [ 4.6018e-01,  8.1831e-02],
        [ 4.7400e-01,  9.2305e-04],
        [ 0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00],
        [ 6.5354e-02,  5.4569e-01],
        [ 2.3968e-01,  3.0844e-01],
        [ 2.5010e-01, -7.7046e-01],
        [ 2.5761e-01, -8.6907e-03],
        [ 0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00],
        [ 2.1974e-01,  4.3584e-02],
        [ 4.5860e-01, -2.1774e-01],
        [ 4.7237e-01, -2.4561e-03],
        [ 0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00],
        [ 2.3927e-01,  2.7195e-01],
        [ 4.9292e-01,  6.1352e-03],
        [ 0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00],
        [ 2.5386e-01,  3.4603e-05]], dtype=torch.float64)
analytical:tensor([[ 5.1314e-04,  1.3259e-01],
        [-1.0655e-02,  2.8569e-02],
        [-5.7910e-03, -2.6898e-01],
        [-1.0619e-02, -7.6018e-02],
        [-1.1081e-02,  1.8989e-01],
        [-1.1413e-02,  2.1419e-03],
        [-1.0655e-02,  2.8569e-02],
        [ 2.2126e-01,  6.1558e-03],
        [ 1.2025e-01, -5.7958e-02],
        [ 2.2050e-01, -1.6380e-02],
        [ 2.3009e-01,  4.0916e-02],
        [ 2.3700e-01,  4.6153e-04],
        [-5.7910e-03, -2.6898e-01],
        [ 1.2025e-01, -5.7958e-02],
        [ 6.5354e-02,  5.4569e-01],
        [ 1.1984e-01,  1.5422e-01],
        [ 1.2505e-01, -3.8523e-01],
        [ 1.2881e-01, -4.3453e-03],
        [-1.0619e-02, -7.6018e-02],
        [ 2.2050e-01, -1.6380e-02],
        [ 1.1984e-01,  1.5422e-01],
        [ 2.1974e-01,  4.3584e-02],
        [ 2.2930e-01, -1.0887e-01],
        [ 2.3619e-01, -1.2281e-03],
        [-1.1081e-02,  1.8989e-01],
        [ 2.3009e-01,  4.0916e-02],
        [ 1.2505e-01, -3.8523e-01],
        [ 2.2930e-01, -1.0887e-01],
        [ 2.3927e-01,  2.7195e-01],
        [ 2.4646e-01,  3.0676e-03],
        [-1.1413e-02,  2.1419e-03],
        [ 2.3700e-01,  4.6153e-04],
        [ 1.2881e-01, -4.3453e-03],
        [ 2.3619e-01, -1.2281e-03],
        [ 2.4646e-01,  3.0676e-03],
        [ 2.5386e-01,  3.4602e-05]], dtype=torch.float64)


Notice that some elements of numerical and analytical approaches match, some not.

Could you show your code? And could it be because of the approximation to the true eigenspace? We can test it by running lobpcg for longer...

@albanD it the following behavior of gradcheck expected?:

import torch as T
T.random.manual_seed(123)
A = T.randn(2, 2).double()
S = A.matmul(A.t())
S = S.double().detach().requires_grad_(True)
def mysymeig(A):
    return T.symeig(A, eigenvectors=True)
T.autograd.gradcheck(mysymeig, S)

results

RuntimeError: Jacobian mismatch for output 0 with respect to input 0,
numerical:tensor([[ 0.9947,  0.0053],
        [-0.1449,  0.1449],
        [ 0.0000,  0.0000],
        [ 0.0053,  0.9947]], dtype=torch.float64)
analytical:tensor([[ 0.9947,  0.0053],
        [-0.0724,  0.0724],
        [-0.0724,  0.0724],
        [ 0.0053,  0.9947]], dtype=torch.float64)

@albanD , note that the numerical value is the sum of the lower and the upper triangle of the analytic result, as if gradcheck takes into account the symmetry of input...

The issue here is that gradcheck does finite differentiation only.
But we do this in an unstructured way which breaks the symmetry of A.
The following works for me:

import torch as T
T.random.manual_seed(123)
A = T.randn(2, 2).double()
S = A.double().detach().requires_grad_(True)
def mysymeig(A):
    A = A.matmul(A.t())
    return T.symeig(A, eigenvectors=True)
T.autograd.gradcheck(mysymeig, S)

@albanD , this behavior is quite surprising as the modified mysymeig does not do what it was supposed to do anymore: solve the eigenproblem for A, not for A A^T.

Are there any references for this design of gradcheck or autograd in general?

For this specific case, you could modify the gradcheck implementation to make sure you follow the hyperplane of symmetric matrices.
Changing l153 of gradcheck:

            x_tensor = x_tensor.data
            for d_idx, x_idx in enumerate(product(*[range(m) for m in x_tensor.size()])):
                assert len(x_idx) == 2
                if x_idx[0] != x_idx[1]:
                    mul = 2
                else:
                    mul = 1
                orig = x_tensor[x_idx].item()
                x_tensor[x_idx] = orig - eps
                x_tensor[tuple(reversed(x_idx))] = orig - eps
                outa = fn(input).clone()
                x_tensor[x_idx] = orig + eps
                x_tensor[tuple(reversed(x_idx))] = orig + eps
                outb = fn(input).clone()
                x_tensor[x_idx] = orig
                x_tensor[tuple(reversed(x_idx))] = orig
                r = (outb - outa) / (mul * 2 * eps)
                d_tensor[d_idx] = r.detach().reshape(-1)

After this change, your original code passes the test.

Are there any references for this design of gradcheck or autograd in general?

Unfortunately no :/ It "simply" does finite differentiation along each ei vectors.
If you ignore the complex, sparse, xla stuff, it boils down to:

def get_numerical_jacobian(fn, input, target=None, eps=1e-3):
    if target is None:
        target = input
    output_size = fn(input).numel()
    # Make the empty jacobian will proper size
    jacobian = make_jacobian(target, output_size)

    # It's much easier to iterate over flattened lists of tensors.
    # These are reference to the same objects in jacobian, so any changes
    # will be reflected in it as well.
    x_tensors = iter_tensors(target, True)
    j_tensors = iter_tensors(jacobian)

    for x_tensor, d_tensor in zip(x_tensors, j_tensors):
        # Use .data here to get around the version check
        x_tensor = x_tensor.data
        for d_idx, x_idx in enumerate(product(*[range(m) for m in x_tensor.size()])):
            # Original value
            orig = x_tensor[x_idx].item()
            # Value on left
            x_tensor[x_idx] = orig - eps
            outa = fn(input).clone()
            # value on right
            x_tensor[x_idx] = orig + eps
            outb = fn(input).clone()
            x_tensor[x_idx] = orig
            # Compute the ratio
            r = (outb - outa) / (2 * eps)
            d_tensor[d_idx] = r.detach().reshape(-1)

    return jacobian

Which for a function f with a single input/output does:
grad = ( f(x + eps) - f(x - eps)) / ( 2 * eps).
With a for-loop for multiple inputs/outputs.

Thanks for these details.

To resolve this issue, we'll need to implement lobpcg_backward. The formula for backward is (using the notation from NA-08-01.pdf):

A_bar = U (D_bar + ((U^T U_bar) o F)) U^T

The main question is, what lobpcg_backward ought to return to match autograd expectations?
I would assume that it should return A_bar.

Note that the symeig_backward implementation returns (A_bar + A_bar^T)/2.

It seems that to use gradcheck in tests, one has to use your modification (that is also in test_autograd.py::test_symeig).

Let me add my 2 cents.In

A_bar = U (D_bar + ((U^T U_bar) o F)) U^T

U and D uniquely determine the n-by-n matrix A and their perturbations uniquely determine the perturbed A, and vice versa. lobpcg setup operates with just a small subset of k eigenvectors, columns of the n-by-n matrix U, that cannot possibly reconstruct the n-by-n matrix A uniquely without imposing constraints on A. One possibly reasonable constraint is assuming that A has rank k (assuming all non-zero eigenvalues), which would allow for a unique and possibly numerically stable backward formula, I guess.

@pearu , sorry for the delay. So, the backward for a k-rank eigenproblem from above seems fine, I only updated the definition of E, which was incorrect, it has to be E_ij = d_i - d_j and similar for F_ij. And, as we already checked, it is the same formula as in symeig_backward.

Now, consider the generalized version AU = BUD. We know that B is positive-definite, so it is full-rank, hence B^{-1} exists. Let C = B^{-1}A, then the generalized problem can be rewritten as CU = UD. For this problem we already know how to get its backward, another words, during the backward AD:

Tr(U_grad^T dU) + Tr(D_grad^T dD) = Tr(C_grad^T dC), where
C_grad is the backward result of an eig problem with the input C=B^{-1}A and the output (U, D), i.e.
C_grad = symeig_backward(U_grad, D_grad, U, D, C).

So, we can find C_grad, now we need to find A_grad and B_grad.

C = B^{-1}A => dC = dB^{-1}A + B^{-1}dA                                               (1)

Recall that BB^{-1} = I => dB B^{-1} + B dB^{-1} = 0 => dB^{-1} = -B^{-1} dB B^{-1}   (2)

Plug (2) into (1):

dC = -B^{-1} dB B^{-1} A + B^{-1} dA

Back to the backward AD:

Tr(C_grad^T dC) = Tr(-C_grad^T B^{-1} dB B^{-1} A) + Tr(C_grad^T B^{-1} dA)
= Tr(-B^{-1} A C_grad^T B^{-1} dB) + Tr(C_grad^T B^{-1} dA), hence [B^{-1}^T = B^{-1}]

A_grad = B^{-1} C_grad,
B_grad = -B^{-1} C_grad C^T.

Please, check the math.

Also, to improve numerical stability,

C = B^{-1} A is the solution to [BX = A],
A_grad = B^{-1} C_grad is the solution to [B X = C_grad],
B_grad = -B^{-1} C_grad C^T is the X solution to [solve for Y: B Y = -C_grad A, solve for X: XB = Y]

All the systems involve B, so it is possible to compute some factorization (Cholesky), and reuse it for the systems of linear matrix equations

@pearu , @albanD , lobpcg is not defined in native_functions.yaml. It is not going to be a problem with derivatives.yaml, right? How can I extract the relevant schema which could be used in derivatives.yaml?
Or, another words, how can I implement the backward function of any function not defined in native_functions.yaml?

Hey,

The function is implemented here.
So I think the simplest way to go about this is to actually create a custom autograd.Function (https://pytorch.org/docs/stable/notes/extending.html) with the forward that already exists and your new backward.
Then you can make torch.lobpcg point to LOBPCGFunction.apply.

@nikitaved , see https://github.com/Quansight/pearu-sandbox/blob/master/pytorch/issue38948.py for implementing lobpcg backward in Python. Be warned, it is a dirty script.

Ah, sorry, my comment got deleted by mistake. derivatives.yaml allows to specify grads for any subset of inputs. How do I do that with with the autograd.Function approach? There are 3 cases: grad wrt to joint (A, B), wrt A, wrt B.

You can check the example in the doc I sent. But you have ctx.needs_input_grad list that tells you what you need to compute the gradient for. You can use that to condition your backward code to only do the work that is needed (and you should return None for things that don't need gradients).

I get what you say, so I can explicitly check the input which requires grad, I get it. Thank you!

If I understand correctly the above, the backward computation is algorithm agnostic, i.e. comes from perturbation theory of eigenvalues and eigenvectors that are assumed to be computed exactly. LOBPCG only computes approximately, so the results of the backward computation are going to match only quite approximately. E.g., in an extreme case where LOBPCG performs just one iteration, the results of forward and backward computations will unlickly match each other well. This is very different from what is typically meant. Please confirm or correct me.

If I understand correctly the above, the backward computation is algorithm agnostic

yes

LOBPCG only computes approximately, so the results of the backward computation are going to match only quite approximately. E.g., in an extreme case where LOBPCG performs just one iteration, the results of forward and backward computations will unlickly match each other well. This is very different from what is typically meant. Please confirm or correct me.

yes, that is all correct, however, I don't see problems with the above in the sense that if a user asks for less accuracy from lobpcg, s/he cannot expect forward/backward to be exact either.

Thanks for the confirmation! I hope that this distinction is made clear to avoid confusion. There are simple iterative methods where forward/backward pair can be implemented quite accurately, for a fixed initial approximation.

Sent from my T-Mobile 4G LTE Device
Get Outlook for Androidhttps://aka.ms/ghei36


From: Pearu Peterson notifications@github.com
Sent: Wednesday, August 12, 2020 11:59:33 AM
To: pytorch/pytorch pytorch@noreply.github.com
Cc: Knyazev, Andrew Andrew.Knyazev@ucdenver.edu; Comment comment@noreply.github.com
Subject: Re: [pytorch/pytorch] torch.lobpcg always breaks for autograd (#38948)

If I understand correctly the above, the backward computation is algorithm agnostic

yes

LOBPCG only computes approximately, so the results of the backward computation are going to match only quite approximately. E.g., in an extreme case where LOBPCG performs just one iteration, the results of forward and backward computations will unlickly match each other well. This is very different from what is typically meant. Please confirm or correct me.

yes, that is all correct, however, I don't see problems with the above in the sense that if a user asks for less accuracy from lobpcg, s/he cannot expect forward/backward to be exact either.


You are receiving this because you commented.
Reply to this email directly, view it on GitHubhttps://github.com/pytorch/pytorch/issues/38948#issuecomment-672962269, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKFMTPOSBMKOOUYKNYHUUVTSAK36LANCNFSM4NIJFXFA.

@albanD , the lobpcg implementation accepts 14 parameters. Is it possible to implement forward/backward as a class method, and not as a static method, to make forward/backward accept and return only 2 parameters?

You can do anything you want :D
You can make the whole thing a custom Function that takes all 14 parameters (and will return None gradients for most of them as they are not differentiable).
Or you can have some pre/post processing that is autodiffed. And only the core that is a simpler custom Function with a limited number of arguments for which you implement the forward/backward.

@nikitaved Since the proposed backward algorithm is generic and thus unrelated to lobpcg, its interface should not mimic that of lobpcg.

Nothing I can do about it, the forward replicates the interface of lobpcg, and the backward requires grads for each input of forward. I could probably use some old autograd.Function style with a constructor to save the lobpcg relevant data and implement forward/backward as for the perfect Generalized Eigenvalue Solver, but the new style, with static methods, is probably better. I could be wrong, however.

Maybe it actually makes sense to create a method with a name something like geigk (generalized eigenvalue problem of rank k exactly the autograd.Function wrapper), then implement backward for it, and then in the documentation state that its forward is implemented with lobpcg. This method could be using less parameters by ignoring most optional ones from the lobpcg's interface.

I could probably use some old autograd.Function style

Nope, most of the code has been removed. This don't work anymore.

This method could be using less parameters by ignoring most optional ones from the lobpcg's interface.

Does such generalized solver exist in other python lib like scipy? If so, we can try to replicate their API.
If none exist, we do have a bit more freedom here.

Does such generalized solver exist in other python lib like scipy? If so, we can try to replicate their API.
If none exist, we do have a bit more freedom here.

Yes, the scipy's eig does support the generalized interface. What about eigk, to distinguish from the "full-rank" eig?

Well, the formula above for the symmetric case, as does autograd show, is incorrect. Upon closer look I realized I cannot justify the substitute dU = U dC, as there is no guarantee that span(dU) is a subspace of span(U).

There is an alternative derivation below in which I am stuck at one point.

U^T U = I => dU^T U + U^T dU = 0 => du := U^T dU is skew-symmetric, so diag(du) = 0.
AU = UD => 
dA U + A dU = dU D + U dD [left-multiply by U^T]               (*)
U^T dA U + U^T A dU = du D + dD [AU = UD => U^T A = D U^T]
U^T dA U + D du = du D + dD
U^T dA U = (du D - D du) + dD [because du is skew symmetric, diag(du D) = diag(D du) = 0]

so we get
dD = I o (U^T dA U), and, similar to the derivation from above (**)
du = F o (U^T dA U), where F_ij = (d_j - d_i)^{-1}, F_ii = 0   (***)

Here goes the critical part

Let U_ortho be an orthonormal basis of a subspace orthogonal to the span(U).
Then we can write
dU = U du + U_ortho dX, where dX is (m-k) x k. This is true for some dX because <U, U_ortho> form a basis of R^m.

By Left-multipling (*) by U_ortho^T, and using the decomposition of dU and orthogonality we get:
U_ortho^T dA U + U_ortho^T A U_ortho dX = dX D. (x)

(x) is a Sylvester equation wrt dX and we need to solve it explicitly, so that dA enters this solution as a part of matrix products. Do you know how to solve it? D is diagonal.

All problems dissapear, of course, once we know the whole eigenspace of A, because then the Sylvester equation can then be written as B = dX o C for some matrices B and C.

The Sylvester equation from above can be written as

C dX + dX D = E.

The symeig_backward assumes that A has distinct values, so, under this assumption we have spec(C) intersect spec(D) = the empty set, so the system has a unique solution.

This paper

Hu, Q., & Cheng, D. (2006). 
The polynomial solution to the Sylvester matrix equation.
Applied mathematics letters, 19(9), 859-864.

states that the equation from above can be solved explicitly as:

dX = p_D(C)^{-1} \sum_{i=1}^k b_i \sum_{j=1}^{i-1} C^{i-1-j} E D^j,
where p_D is a characteristic polynomial of D with coefficients b_i.

Since D is diagonal, its characteristic polynomial is simple and its coefficients could be found via the Horner's rule.
Since E = U_ortho^T dA U, we can see that the explicit solution for dX has dA entered with power 1 and only as a part of matrix products, which is exactly what we need to be able to gather terms being multiplied with dA in the backward AD.

A PR that implements support for the symmetric case with B=I has been merged and is available in master. After a short break I plan to implement the remaining two cases, that is

  1. Grad wrt A for arbitrary B when B.requires_grad == False and
  2. Grad wrt A and B when both require grads.

There could be some numerical issues popping up once k is large enough. Any feedback on such behaviors is much appreciated!

Hi @nikitaved, I've implemented the backward for symmetric A & B in xitorch for your reference. I have to admit it was not an easy task. I can write down the math derivation if you want.

@mfkasim1 , hi! The symmetric case is already in PyTorch. The case for B requires the matrix square root, there is an issue I am assigned too. Maybe it is possible to avoid it. I am about to make a write-up. We could compare the math or become co-authors, up to you! You can check the code in torch/_lobpcg.py.

Looks like you propagate through the whole eigenspace, right? Here the issue is that you only have a k-subspace, where k is generally much less than the dimension n.

@nikitaved Ah, I thought you haven't implemented for non-identity B.
I'm using implicit matrices/linear operator for A and B in xitorch (only need to know Av and Bv), so it works for sparse matrices if you're interested in implementing that too. However, it involves iterative solve (e.g. cg, bicgstab, etc).

Looks like you propagate over the whole eigenspace, right? Here the issue is that you only have a k-subspace, where k is generally much less than the dimension n

xitorch.linalg.symeig function actually looks very similar to lobpcg where it only retrieves k eigenpairs for k << n and intended for applications where you can't store the n-by-n matrix.

The write-up sounds interesting. Where do you plan to write it?

I see, thank you! Then you are right! B is only identity as of now. Would be cool to see alternatives which involve different methods.

Well, at least arXiv would be great for starters...

The math trick avoiding the square root of B is to notice that inv(B)A is symmetric in the B-based scalar product, i.e. x'Bx. So one can apply whatever standard theory or algorithms for generalized eigenvalue problems, as soon as B is SPD (our case). After the substitution, in many places one gets inv(B)B so it just goes away. But in a few key spots one still needs implementing vector multiplication by inv(B)A - that is where a linear solve for B is needed.

Upon further inspection, I can see how to extend the already implemented method to the generalized problem. And, yes, it does not require the matrix square root, although it does become more computationally heavy as I no longer can exploit simultaneous diagonalizations in the explicit solution to the Sylvester equation (check needed).

I will try and compare this solution to the solution with the matrix square root. Not sure which one is going to be faster.

@nikitaved for your reference, I have written down the symeig derivative in xitorch: https://xitorch.readthedocs.io/en/latest/notes/deriv_symeig.html

Was this page helpful?
0 / 5 - 0 ratings