Visidata: [unintuitive] selecting negative values includes errors and nulls

Created on 19 Nov 2019  路  7Comments  路  Source: saulpw/visidata

I have a CSV that looks like this:

id,amount
1,0
2,15
3,-5
4,-2
5,
6,
7,2

In visidata I mark the amount column as an integer with #. Visidata gives rows 5 and 6 the ! notation since they are not valid integers.

I want to select the rows with a negative amount. Using z| I get to the select by expression prompt and enter amount < 0. This results in rows 3, 4, 5, and 6 being selected. I only expect rows 3, 4. The null values in 5 and 6 are not less than zero.

This seems to be somewhat reminiscent of the Python 2 behaviour where None < 0 evaluates to True, where in Python 3 it raises a TypeError.

Similarly, if I have the following sheet, I also end up with rows 3, 4, 5, and 6 selected using the same expression:

id,amount
1,0
2,15
3,-5
4,-2
5,foo
6,bar
7,2

I'm not sure if this a bug or the expected behaviour for invalid values like this. How should I go about selecting the valid negative values only?

Most helpful comment

Yeah, I think that will provide me a way to more easily identify unexpected data and then, if desirable, remove them from the sheet before proceeding with further select/sort operations on the valid data.

All 7 comments

Hi @pigmonkey a way is to start form z| and then amount < 0. A second step is to add g\ to unselect by regex, using ^$

Thanks. That does work for my first example. For my second example I could use a regex like [a-z] to unselect the foo and bar rows.

Is this the expected behaviour for the amount < 0 expression?

Hi @pigmonkey, this is an interesting case, thanks for filing an issue. I agree that this behavior is surprising and unintuitive, and unfortunately I can't see a way around it.

Errors (and nulls) need to exist somewhere specific on the number line so that sorting is stable. Artbitrarily, I chose negative infinity, so that when you sort ascending, the errors would come first. It could very easily be the other way (just change the True to False in TypedWrapper.__lt__).

So this is deliberate behavior, even if it is undesired in your case. You could do as @aborruso suggests and unselect ^$, or you could select all rows and then unselect amount >= 0.

I can think of some possible fixes for your issue, but we'd have to give them a lot of thought, as they would all affect existing use cases:

  • explicitly remove errors and nulls from consideration with selection operations, like we do with aggregators; but then you wouldn't be able to select them easily at all, and there would be other surprise cases that may need workarounds;
  • make TypedWrapper return False for all relative comparisons; but then sort would be unstable;
  • make TypedWrapper behave differently whether you're sorting or selecting or calculating (or what else); but this seems like a really bad idea.

Intuitively, the behaviour I expected was for invalid values to be treated as nullish, such that they would not be selected by an expression like amount < 0 but would be selected by an expression like amount == None. (When I noticed the undesired rows were being selected by amount < 0, my first response was to try amount < 0 and amount != None.)

I do think that the current sort behaviour, where invalid values are treated as the "most negative", is intuitive.

make TypedWrapper return False for all relative comparisons; but then sort would be unstable;

Is it possible to do this, but then make sort a two step process where valid values are first sorted normally, and then these wrapped invalid objects are appended or prepended to the list depending on the sort direction?

This also makes me curious if there is currently a canonical way to select all rows with invalid values. The regex selection can be used if I know what specific invalid things I want to (un)select -- like ^$ for empty values, [a-z] for values with lowercase letters where I expect digits, etc -- but it is not a great solution if I don't know what the errors may look like. Maybe I expect an integer and one of the values is a float like 3.14.

Having some way to select all invalid things would be useful. I could see myself often using this and copying the rows into a new sheet with " to see if the data in all the rows is as I expect it to be.

And if there were a way to do this, then "explicitly remove errors and nulls from consideration with selection operations" might be more attractive, since I could always use this alternative command when I did want to include the invalids.

Okay @pigmonkey, we're going to address this in two ways:

1) documenting this behavior on a new page visidata.org/limits and in other places in the docs, so that other people understand what's going on;
2) adding two new commands (unbound for now), select-error-col and select-error, which select all rows with errors in the current column or any column, respectively.

I think (2) this should at least give you a mechanism to accomplish what you originally wanted, even if it's a little clunky. Does this work for you?

Yeah, I think that will provide me a way to more easily identify unexpected data and then, if desirable, remove them from the sheet before proceeding with further select/sort operations on the valid data.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

aborruso picture aborruso  路  3Comments

zaiste picture zaiste  路  4Comments

aborruso picture aborruso  路  3Comments

pigreco picture pigreco  路  3Comments

polm picture polm  路  4Comments