Netbox: Nondeterministic ordering of interfaces with a "zero" channel or subinterface

Created on 9 Mar 2020 · 13Comments · Source: netbox-community/netbox

Environment

Python version: 3.7.5
NetBox version: 2.7.9

Steps to Reproduce

Create a device to which you can attach interfaces
Create 2 interfaces on this device, named ge-0/0/26 and ge-0/0/26.0 filling in any other required fields.
The sub-interface (ge-0/0/26.0) should be listed after the main interface (ge-0/0/26), and it may be on your screen.
In the postgres database, the _name column of the dcim_interface table for both entries will be the same.

Note: the problem that I had was trickier to find originally, and resulted in inconsistent behavior when requesting a multi-page group of interfaces via the API. If the ge-0/0/26 and ge-0/0/26.0 interfaces were to be the 50th and 51st items to be returned by the API, there was a very consistent behavior where 'ge-0/0/26' was returned both as the last item in the first call (limit=50 no offset), and the first item in the second call (limit=50&offset=50) and that the ge-0/0/26.0 item would not be returned at all. By running a larger limit, I could see that the problem didn't happen, which indicated a variance in what was being returned by the database query, which led me to the _name field ordering in the dcim_interface table.

Expected Behavior

Each of the interfaces should have a unique _name key in the database.

Observed Behavior

The _name key for an item with no .0 and an item with a .0 were identical. This is due to the ordering algorithm using 000000 as the empty placeholder as well as the value for 0. To the left of the interface name, this is not a problem, because the slot numbers use 9999 as placeholders instead of 0, and thus do not collide with the 0's.

I've got a patch for this which I'll be submitting in a pull request forthwith.

accepted bug

Source

gaige

👍1

Most helpful comment

@gaige Thanks but there's no need. I've made the changes and adapted the tests accordingly. Will be pushed soon; I'm trying to work around an issue with pycodestyle atm.

jeremystretch on 10 Mar 2020

👍2

All 13 comments

Note that I do have a pull request ready to go, but in accordance with how I'm interpreting the policy, I won't submit it until after the bug has been accepted. For reference, my proposed mechanism is to use a 4-space semaphore instead of a 4-0 semaphor to indicate a missing element. This both preserves order and differentiates between 0 and missing elements.

My branch containing the fix is available at:
https://github.com/cluetrust/netbox/tree/4366-interface-collision

gaige on 9 Mar 2020

Yeah, the old regex approach would treat a missing ID/channel as NULL and order appropriately. We didn't fully replicate this in #3799, instead setting a zeroized value.

For reference, my proposed mechanism is to use a 4-space semaphore instead of a 4-0 semaphor to indicate a missing element.

This is the right track, though I'd prefer to use a non-whitespace character such as a dot for better readability. Also, I want to avoid adding random examples to the tests as they become very difficult to manage over time.

jeremystretch on 10 Mar 2020

I'll go ahead and put a dot in there. Although I'm half-tempted to suggest a - (HYPHEN-MINUS) to prevent problems with . potentially being a pain in regular expressions. But, I'm happy to go either way.

How about the following for the test additions (with the appropriate changes as above):
('vlan0', '9999999999999999vlan000000 '), ('vlan0.0', '9999999999999999vlan000000 000000'), ('ge-0/0/1', '0000000099999999ge-000001 '), ('ge-0/0/1.0', '0000000099999999ge-000001 000000'), ('ge-0/0/1.2', '0000000099999999ge-000001 000002'), ('ge-1/2/3.4', '0001000299999999xe-000003 000004'),

In particular, this tests the two sub-interface types (hyphen-separated and not hyphen-separated) for Junos and sticks with the formula used previously in the Cisco-style interfaces for increasing digits.

I'd also replace the not equal verification to be (which I realized was incorrect):

self.assertNotEqual(naturalize_interface('ge-0/0/1', max_length=100), naturalize_interface('ge-0/0/1.0', max_length=100))

Let me know and I'll make the changes, test, and do the pull request.

gaige on 10 Mar 2020

@gaige Thanks but there's no need. I've made the changes and adapted the tests accordingly. Will be pushed soon; I'm trying to work around an issue with pycodestyle atm.

jeremystretch on 10 Mar 2020

👍2

Unfortunately this is not going to be such a simple fix. Because PosgtreSQL is using localized ordering (e.g. en_US for me), we cannot assume deterministic ordering of non-alphanumeric characters. For instance, PostgreSQL ignores the period and hyphen characters when ordering _name values.

Worse, this doesn't seem to be something we can fix with a schema migration: The entire database (or at least the table?) would likely need to be rebuilt in order to change the locale, which would not be practical. IMO we have two options:

Identify a "placeholder" character indicating null values which will _always_ be sorted before zero; or
Force collation to plain ASCII ordering on the column when sorting.

Option 2 is possible but we'll need to look more into the implementation details of enforcing the collation for the model's default ordering. IIRC there's a bug when using F() to wrap firelds in Meta.order_by that isn't fixed until Django 3.0, so this _might_ need to wait for NetBox v2.8.

For now, I've reverted my earlier patch, but kept the modified tests in place, which appear to be passing anyway.

jeremystretch on 10 Mar 2020

Yes, the reason the modified tests are working is that you didn't take the conflict test with (it was in a separate test function in the same test file). However, there was a false-positive in my original conflict test. I fixed it in the version that I did later but haven't pushed (see the comment earlier in this thread).

However the bigger problem is not the ordering for the UI, but the bug that I mentioned in the API. If you have 2 devices with the same _name the order they are returned can be different based on the specific query to the database. In this case, it causes 2 copies of a single interface and no copies of the second interface to be returned in the interface list retrieval if they happen to sit astride the page boundary. Since the python API doesn't allow you to adjust the page boundary, there's no way to guarantee that these interfaces will be returned (and not duplicated) in the interface list. Had this not shown up, I never would have bothered to create the fix (the ordering in the visuals isn't that important to me). It's made worse by the fact that any Junos-based switch with VLANs on it will have the "XX-NN/MM/OO" and "XX-NN/MM/OO.0" interfaces, and thus potentially trigger the issue.

gaige on 10 Mar 2020

Yes, the reason the modified tests are working is that you didn't take the conflict test with (it was in a separate test function in the same test file)

I injected a similar condition into the existing test instead. No need for a separate test.

However the bigger problem is not the ordering for the UI, but the bug that I mentioned in the API

The same ordering logic is used for both.

Since the python API doesn't allow you to adjust the page boundary

What are you talking about, the REST API or the Django ORM? Both allow you to adjust the page size.

jeremystretch on 10 Mar 2020

Yes, the reason the modified tests are working is that you didn't take the conflict test with (it was in a separate test function in the same test file)

I injected a similar condition into the existing test instead. No need for a separate test.

I'm not sure I'm seeing the test condition you're referring to. I did see these:

('Gi1', '9999999999999999Gi000001000000000000'), ('Gi1.0', '9999999999999999Gi000001000000000000'), ('Gi1.1', '9999999999999999Gi000001000000000001'), ('Gi1:0', '9999999999999999Gi000001000000000000'), ('Gi1:0.0', '9999999999999999Gi000001000000000000'),
But, they show that the .0 interfaces and the non-.0 interfaces have the same normalized value.

My separate test was designed to verify that the two items with different interface names created different normalized values. I'm not sure how that's represented here.

However the bigger problem is not the ordering for the UI, but the bug that I mentioned in the API

The same ordering logic is used for both.

Right, but as a user, you only see one page at a time. If you have 2 pages of interfaces, you probably: a) wouldn't notice the missing interface; or b) would search for it using the search box and you would find it.

However, as an API user, when I need to retrieve the list of all interfaces on an individual device (we've got an auditing program that validates the interfaces and bridge interfaces against the devices via SNMP), when I request items 0-49 and 50-99, I expect to get 100 different items representing the entire data set. And, if I set the limit in the API query to 0 or some sufficiently large number (using Paw or the Swagger tools), I get a good list. However, if I page the list (which is enforced at 50 in the pynetbox package), then I get 2 of one and 0 of the other.

Since the python API doesn't allow you to adjust the page boundary

What are you talking about, the REST API or the Django ORM? Both allow you to adjust the page size.

The Python library for accessing the API (pynetbox). Sorry if that wasn't clear. I could patch that to allow retrieving of all of the items, but that seemed the wrong approach given that the API was returning errant information.

Alternatively, the API bug could be fixed by returning the items in a "less pretty" order, such as by the id or the name, but right now, with the _name field being not unique, the database won't necessarily return the same data every time. Just an alternate thought. It would also fix any problems that we don't know about in the algorithm.

gaige on 10 Mar 2020

But, they show that the .0 interfaces and the non-.0 interfaces have the same normalized value.

Yes, because as noted we're still using the original zero-based approach until we come up with a solution. However, the test also evaluates the actual ordering of the two interfaces.

I will come back to this issue when I have more time to dedicate to it.

jeremystretch on 10 Mar 2020

I will come back to this issue when I have more time to dedicate to it.

I respect that. You need to prioritize items as you must. In my case, this bug in the API (duplicate and missing items) cost me quite a bit of time in diagnosing other issues, so we'll have to keep a separate branch until this gets addressed in some fashion that makes the data returned by the API correct.

Thanks,
-Gaige

gaige on 10 Mar 2020

Can we just add the PK as the last ordering field to act as a tie-breaker? Sure, it might not order these collisions as you'd want, but at least it wouldn't break pagination by introducing a duplicate record and omitting the other when the collision is at a page break.

I imagine changes to natural ordering will only reduce the likelihood of non-deterministic ordering; some combination of input will eventually raise the same issue as this one.

hSaria on 11 Mar 2020

👍1

Good thought! It also protects against future potential collisions. Although it'd be nice for the display to show the "shorter" interfaces first, my primary concern is the API contract violation, which this would definitely solve.

gaige on 11 Mar 2020