Pandas: The input column name in query contains special characters

Created on 24 Jun 2019  Â·  20Comments  Â·  Source: pandas-dev/pandas

The input column name in pandas.dataframe.query() contains special characters.
I saw the change in 0.25, but still have . \ / 等问题
And main problem is that I can't restore these characters after converting them to "_" , which is a very serious problem.

I have an idea

input: pandas.dataframe
Intermediate processing:
`x2E`(Hex)-> like pandas`x2E`dataframe(Add backslashes after processing the data) ->pandas\x2Edataframe
output pandas\x2Edataframe

Enhancement expressions

Most helpful comment

@TomAugspurger To give you little background. Using non-python identifiers was solved in https://github.com/pandas-dev/pandas/pull/24955 . However, it was only solved for spaces. This issue talks about extending that functionality. I actually already implemented it here: https://github.com/hwalinga/pandas/tree/allow-special-characters-query

Shall I make a PR? (I will then also make the change to allow numbers in the beginning.)

All 20 comments

Do you have a minimally reproducible example for this?

https://pandas.pydata.org/pandas-docs/stable/development/contributing.html#bug-reports-and-enhancement-requests

this is my say like this @WillAyd @hwalinga

import pandas as pd
import re
a = pd.DataFrame({'foo.bar':[11111,2222,333333],})
def in_columns_data(col):
    space = re.findall('(\W)',list(col)[0])[0]

    if space=='.':
        new = list(col)[0].replace('{space}'.format(**locals()),'x2E')
        control=1
        print (new)
        # new = list(col)[0].encode('hex')
        return new,control
    elif space==' ':
        new = list(col)[0].replace('{space}'.format(**locals()),'x20')
        control=2
        print (new)
        # new = list(col)[0].encode('hex')
        return new,control

name,control = in_columns_data(a.columns)
a.columns = [name]
print (a)
a = a.query("{name} >2222".format(**locals()))
def out_columns_data(col,control):
    if control==1:
        new = list(col)[0].replace('x2E','.')
        print (new)
        # new = list(col)[0].encode('hex')
        return [new]
a.columns = out_columns_data(a.columns,control)
print (a)

the result is

   foox2Ebar
0      11111
1       2222
2     333333
foo.bar
   foo.bar
0    11111
2   333333

Can you try and make the example minimal? This link might help:

http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

@WillAyd

This is what he means:

import pandas as pd
a = pd.DataFrame({'foo.bar':[11111,2222,333333],})
a.query("foo.bar > 2222")
# fails 

(NB, Chinese just works fine in Python3, and since new releases don't support Python2 anyway, that issue can be dropped.)

He is coming from https://github.com/pandas-dev/pandas/issues/6508 which I solved for spaces in https://github.com/pandas-dev/pandas/pull/24955

The question that is now on the table: Do we want to support other forbidden characters (next to space) as well?

The problem is that we need to work around the tokenizer, but since the tokenizer recognizes python operators that can be dealt with. (So . / - + *) However, \ is an escape character, which is probably a lot harder, I don't know if even possible.

So, shall I make a pull request for this, or isn't this necessary enough to pollute the code base further with? (I estimate I need to add one new function and alter two existing ones, from what I remember from the last PR I did on this.)

@jreback has reviewed the previous PR and may want to have a word on this before I start.

I am probably -1 on this; we by definition only support python tokens

you can always not use query which is a convenience method

I think the input str in query and eval is a convenient way to input, and it can improve the speed of processing data. I have been entangled in this problem because I found that strings can use regular expressions, format, etc. Method, this is too convenient@jreback

I think the input str in query and eval is a convenient way to input, and it can improve the speed of processing data. I have been entangled in this problem because I found that strings can use regular expressions, format, etc. Method, this is too convenient@jreback

this does not improve processing at all unless you are doing very simple operations

changing to non python semantics is cause for confusion

@danielhrisca @dgua

These people might also have a word on this, since they requested the same in the issue about the spaces.

I thought you would be skeptical @jreback but let's view it this way: We already have a way to distinguish between valid python names and invalid python names. That is the backtick quoting I have implemented to allow spaces. So, there isn't much of a barrier left to next to allow `this kind of names` to also allow `this.kind.of.names` or `this-kind-of-names`.

Not everybody in the world is used to snake_case, and the dots in the names would probably cater to the people coming from R. (In which having dots in your identifiers is basically the equivalent of underscores in python.) And who knows, maybe there is a webdev very happy to write his/her names as CSS class names.

If you are worried how horrible the code will look like. It is not that bad. I implemented it already. Take a look:
https://github.com/hwalinga/pandas/tree/allow-special-characters-query

excellent job @hwalinga
Eval and query are the two best methods I use in use
Looking forward to updating this part of 0.25, I will download the test first.

@jreback Do you want me to open a PR, or will you close this as a 'wontfix'?

@hwalinga I second that. One more use case besides this.kind.of.name is also when a column name starts with a digit (which is not a valid Python variable name), like 60d or 30d.

Currently (as of 0.25.1) even using backticks doesn't work with columns starting with a digit.

df.eval("`column 60d` + `column 30d`") # works
df.eval("`60d` + `30d`") # fails

It seems that under the hood, Pandas replaces spaces by underscores and removes the backticks like this:

`hello there` + `foo bar`
# becomes
hello_there_BACKTICK_QUOTED_STRING + foo_bar_BACKTICK_QUOTED_STRING
# which is a valid Python expression

and

`30d` + `60d`
# becomes
30d_BACKTICK_QUOTED_STRING + 60d_BACKTICK_QUOTED_STRING
# which is NOT a valid Python expression

As a solution, one might suggest always prepending a _ in front of a backticked-column (instead of only appending _BACKTICK_QUOTED_STRING like now), like the following:

`30d` + `60d`
# would become
_30d_BACKTICK_QUOTED_STRING + _60d_BACKTICK_QUOTED_STRING
# which is a valid Python expression

@jreback

changing to non python semantics is cause for confusion

I don't think this is right. Datasets' column names (and the people who come up with them) couldn't care less about Python semantics, and it seems reasonable enough to think that Pandas users would expect eval and query to work out-of-the-box with any kind of dataset column names.

Another way to put it is that the analytics tool (here, Pandas) has to adapt to real-life datasets, instead of the other way around. The "other way around" will simply never happen, and if it doesn't happen either on Pandas' side, then it's up to the Pandas analyst to deal with the nitty-gritty hoops and bumps of dealing with exotic column names. Which is far from ideal.

Thanks for the feedback.

I can understand jreback's perspective. Allowing all these kind of edge cases brings in quite some ugly code to the codebase. Unless you have your own language parser build-in. You aren't really solving it very elegantly.

Even if we allow for these kind of edge cases to be valid with the aforementioned hacks, there will all kinds of ways to break it.

Some different approach that has also been discussed was to use uuid instead. That would probably be most elegant, but would make exceptions harder to read.

Since there are now multiple people having trouble (or expecting too much) from the backticks, maybe the backtick approach is something worth reimplementing, even though the exceptions become harder to read?

@zhaohongqiangsoliva maybe this new activity makes it worth reopening again?

@JivanRoquet Are non-python identifiers even feasible with how query / eval works? I'm not too familiar with it, but my understanding is that we use the ast module to build up an expression that's passed to numexpr.

@hwalinga I won't open a closed question,I searched using google, but I didn't find a way.

"I don't think this is right. Datasets' column names (and the people who come up with them) couldn't care less about Python semantics, and it seems reasonable enough to think that Pandas users would expect eval and query to work out-of-the-box with any kind of dataset column names."

100% agree with @JivanRoquet. This needs to work for all of us who use real life data files.
This issue needs to be resolved.

@dgua do you have a design proposal?

@TomAugspurger To give you little background. Using non-python identifiers was solved in https://github.com/pandas-dev/pandas/pull/24955 . However, it was only solved for spaces. This issue talks about extending that functionality. I actually already implemented it here: https://github.com/hwalinga/pandas/tree/allow-special-characters-query

Shall I make a PR? (I will then also make the change to allow numbers in the beginning.)

@hwalinga your implementation looks ok; even though i am basically 0- on generally using query / eval (as its very opaque to readers); would be ok with this change

Was this page helpful?
0 / 5 - 0 ratings