Velero: Restoring nodePort service that has nodePort preservation always fails if service already exists in the namespace

Created on 28 Feb 2020 · 6Comments · Source: vmware-tanzu/velero

What steps did you take and what happened:
During restore Velero skips restoring a service if the service already exists in the namespace. However for nodePort services that require node ports to be preserved on restore, restores always fail if the same nodePort service already exists in the target namespace. It seems during the restore Velero tries to create the service on the target namespace but the API server rejects the operation since the node port to be restored is already allocated to the preexisting service. Below steps should help in reproducing this issue:

$ kubectl create ns nginx

$ cat <<EOF | kubectl apply -n nginx -f -
apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  ports:
    - port: 80
      nodePort: 32555
  selector:
    app: nginx
  type: NodePort
EOF

$ kubectl create deployment nginx --image=nginx -n nginx
$ velero backup create nginx-backup --include-namespaces nginx
$ kubectl delete deployment nginx -n nginx
$ velero restore create --from-backup nginx-backup

Restores fail consistently with the error "provided port is already allocated":

{"errors":{"namespaces":{"nginx":["error restoring services/nginx/nginx: Service \"nginx\" is invalid: spec.ports[0].nodePort: Invalid value: 32555: provided port is already allocated"]}},"warnings":{"namespaces":{"nginx":["could not restore, endpoints \"nginx\" already exists. Warning: the in-cluster version is different than the backed-up version."]}}}

What did you expect to happen:
Velero to skip restore of the nodePort service if the service already exists in the target namespace. This maintains restore behavior consistent with how Velero handles restore of other objects it finds in the namespace.

The output of the following commands will help us better understand what's going on:
(Pasting long output into a GitHub gist or other pastebin is fine.)

kubectl logs deployment/velero -n velero
velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
velero backup logs <backupname>
velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>

Anything else you would like to add:
This issue is closely related to #1950 .

Environment:

Velero version (use velero version): v1.2.0
Velero features (use velero client config get features):
Kubernetes version (use kubectl version):
Kubernetes installer & version:
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):

EnhancemenUser Icebox Reviewed Q2 2021

Source

betta1

Most helpful comment

I think the issue is at this line. We expect the API server to say that the object already exists if it looks the same, but in the case of NodePorts that are re-using the same port, it gives us a different error - that it's invalid.

I confirmed this by printing out the error, and this is what I saw:

restoreErr was: &errors.StatusError{ErrStatus:v1.Status{TypeMeta:v1.TypeMeta{Kind:"Status", APIVersion:"v1"}, ListMeta:v1.ListMeta{SelfLink:"", ResourceVersion:"", Continue:"", RemainingItemCount:(*int64)(nil)}, Status:"Failure", Message:"Service \"nginx\" is invalid: spec.ports[0].nodePort: Invalid value: 32555: provided port is already allocated", Reason:"Invalid", Details:(*v1.StatusDetails)(0xc0007c1680), Code:422}}time="2020-03-04T20:50:34-05:00" level=info msg="error restoring nginx: Service \"nginx\" is invalid: spec.ports[0].nodePort: Invalid value: 32555: provided port is already allocated" logSource="pkg/restore/restore.go:1200" restore=velero/nginx-backup-20200304205033

Digging into the StatusDetails a little more with the following patch:

        if apierrors.IsInvalid(restoreErr) {
            statusErr := restoreErr.(*apierrors.StatusError)
            fmt.Printf("Err Reason: %#v: ", statusErr.Status().Reason)
            fmt.Printf("Err Details:   %#v", statusErr.Status().Details)
        }

Err Reason: "Invalid": Err Details:   &v1.StatusDetails{Name:"nginx", Group:"", Kind:"Service", UID:"", Causes:[]v1.StatusCause{v1.StatusCause{Type:"FieldValueInvalid", Message:"Invalid value: 32555: provided port is already allocated", Field:"spec.ports[0].nodePort"}}, RetryAfterSeconds:0}time="2020-03-04T20:57:21-05:00" level=info msg="error restoring nginx: Service \"nginx\" is invalid: spec.ports[0].nodePort: Invalid value: 32555: provided port is already allocated" logSource="pkg/restore/restore.go:1204" restore=velero/nginx-backup-20200304205719

So, to fix this for nodePort services, I think we could do the following, which I have not yet tested.

After checking for apierrors.AlreadyExists, check for apierrors.IsInvalid
If it is invalid, check the restoreError.Status().Details.Causes slice for nodePort in Field and already allocated in Message.
If all Causes match this case, then the Service is the same and log it as such instead of erroring.

I'm not 100% happy with this as it relies on a text string in the error message that could change, but I haven't yet checked on the API server side to see what exactly it generates, so maybe there's a better solution.

More generally, chasing down each case like Service Accounts and now Services in the core restore code doesn't seem sustainable. I'm wondering if some sort of RestoreConflict plugin type might make sense, which would allow us to specify criteria by which instances of a given type would be considered the same. That's a very rough idea at the moment and would require some more design work.

@skriss and @ashish-amarnath does my rough algorithm seem workable to you? I'll try to track down what the API server does to return the invalid error to see if there's something more reliable than string matching that we can do.

@ajohnstone and @jenting I'm tagging you too as you both commented on the other issue, and I'm able to reproduce, with detailed error structs.

nrb on 5 Mar 2020

❤3

All 6 comments

I confirmed this by printing out the error, and this is what I saw:

restoreErr was: &errors.StatusError{ErrStatus:v1.Status{TypeMeta:v1.TypeMeta{Kind:"Status", APIVersion:"v1"}, ListMeta:v1.ListMeta{SelfLink:"", ResourceVersion:"", Continue:"", RemainingItemCount:(*int64)(nil)}, Status:"Failure", Message:"Service \"nginx\" is invalid: spec.ports[0].nodePort: Invalid value: 32555: provided port is already allocated", Reason:"Invalid", Details:(*v1.StatusDetails)(0xc0007c1680), Code:422}}time="2020-03-04T20:50:34-05:00" level=info msg="error restoring nginx: Service \"nginx\" is invalid: spec.ports[0].nodePort: Invalid value: 32555: provided port is already allocated" logSource="pkg/restore/restore.go:1200" restore=velero/nginx-backup-20200304205033

Digging into the StatusDetails a little more with the following patch:

        if apierrors.IsInvalid(restoreErr) {
            statusErr := restoreErr.(*apierrors.StatusError)
            fmt.Printf("Err Reason: %#v: ", statusErr.Status().Reason)
            fmt.Printf("Err Details:   %#v", statusErr.Status().Details)
        }

Err Reason: "Invalid": Err Details:   &v1.StatusDetails{Name:"nginx", Group:"", Kind:"Service", UID:"", Causes:[]v1.StatusCause{v1.StatusCause{Type:"FieldValueInvalid", Message:"Invalid value: 32555: provided port is already allocated", Field:"spec.ports[0].nodePort"}}, RetryAfterSeconds:0}time="2020-03-04T20:57:21-05:00" level=info msg="error restoring nginx: Service \"nginx\" is invalid: spec.ports[0].nodePort: Invalid value: 32555: provided port is already allocated" logSource="pkg/restore/restore.go:1204" restore=velero/nginx-backup-20200304205719

So, to fix this for nodePort services, I think we could do the following, which I have not yet tested.

After checking for apierrors.AlreadyExists, check for apierrors.IsInvalid
If it is invalid, check the restoreError.Status().Details.Causes slice for nodePort in Field and already allocated in Message.
If all Causes match this case, then the Service is the same and log it as such instead of erroring.

@ajohnstone and @jenting I'm tagging you too as you both commented on the other issue, and I'm able to reproduce, with detailed error structs.

nrb on 5 Mar 2020

❤3

Ok, so the error seems to originate in the port allocator, which is pretty core to Kubernetes. The error is passed directly through the REST API if the port is already allocated, which is why this doesn't show up as AlreadyExists.

Looking around here, I see it uses a RangeRegistry, which is similar to how IP address fields are exposed. My guess is that objects that have IPs for allocation will get similar errors when retained, although those have been more dynamic in my experience.

NodePorts have been a far more recurring problem for users of Velero.

nrb on 5 Mar 2020

@nrb that approach sounds reasonable, though I agree that adding lots of special cases here is not going to be sustainable.

Another option we could consider is trying to Get the resource before trying to restore it, and use that to determine if the resource exists already. That means an additional API call per resource we're restoring, though, so potentially has some runtime impact.

skriss on 5 Mar 2020

@nrb @skriss How about trying a GET only on the services iff we find the last-applied-config annotation on the service during restore? This way we only make the API call if we're restoring a service that needs node port preservation. If a service does not have the last-applied-config annotation, we simply delete the node ports as we're doing currently and let the API server auto-assign the node ports.

betta1 on 5 Mar 2020

How about trying a GET only on the services iff we find the last-applied-config annotation on the service during restore?

That works for Services applied by Kubectl, but I think it still has the problem of adding special cases as they come up.

Reworking the entire logic to be based on a Get vs doing a Create and expecting an AlreadyExists error may be more useful in more cases, and may also serve us better for doing things like trying to overwrite items that differ from an existing backup, because then we'll have a lot more information about the exact differences versus just information that it already exists.

nrb on 5 Mar 2020

👍2

Problem is actual in 1.5.2