I think it would be great to impose column names to be symbols (this is related to https://github.com/JuliaData/CSV.jl/issues/158)
So for instance, we could still have a variable named v1_2, but not v1 2 or v1.2 anymore.
There are a number of reasons for this restrictions: it makes sense for column names to obey the same syntax as Julia variables, stricter names will make it easier to develop for dataframes going forward, and it potentially frees special characters to denote certain operations on columns in the future.
NOTE: This is also related to the discussion on issue #1200
I don't really see the advantage of being that strict. What matters IMHO is that we don't create invalid names by default, but if people want to do something special why forbid it? If they use spaces or dots in their variable names, of course they won't be able to access them using df.col, and they will have similar issues inside Query or DataFramesMetaMacros. But as long as that doesn't hurt the standard use case, I'd rather allow them, which could be useful e.g. if you need to read a file and write it again with the exact same column names.
Aren't you concerned with people having issues with Query/DataFramesMeta and not understanding why / complaining about it? What about silent errors?
I had not thought about the user case you are talking about, but is it really that common? And would not you be able to do it by writing something else than a DataFrames in this case?
Whether or not DataFrames ends up imposing stricter column names, you agree to convert column names to valid symbols in CSV by default, right? I do think it would be a step in the right direction.
But then should we also do that with Feather.jl? ReadStat.jl? I would say so — but then maybe at some point it becomes just easier to forbid invalid names.
Aren't you concerned with people having issues with Query/DataFramesMeta and not understanding why / complaining about it? What about silent errors?
I don't know, we haven't received any complaints so far (except from people who know what is happening and who would like a way to support weird named in macros). R works the same and I'm not aware of complaints in that regard either. We can deprecate it at any point if we realize users are confused, but I don't see the point of making this preemptively.
I had not thought about the user case you are talking about, but is it really that common? And would not you be able to do it by writing something else than a DataFrames in this case?
Create a completely new data structure just to allow for non-standard names?
Whether or not we end up imposing stricter column names, it seems that you agree to convert column names to valid symbols in CSV by default, right? I think it would be great.
Yes, we should really fix this as it can be much more confusing.
It seems that the support requires a lot of work for developers in R. See this dplyr issue:
https://github.com/tidyverse/dplyr/issues/2243 (and all its linked issues)
(for writing CSV file with non standard column names, I was just thinking of writing a dictionary instead of a DataFrame, but I am not that familiar with CSV.jl).
I think at least for Query.jl there would be a fairly straightforward way to support complicated column names via something like i[sym"my crazy column name"]. This would require a string macro sym that we discussed somewhere previously that seems really harmless, and then I think it actually is quite a usable syntax. Plus, I think this would be type stable, if I understand constant folding properly.
Related StatsModels#35. For being able to build statistical models is best to force it to be symbol compatible.
I think enforcing dotless col names would be a good idea because it would make many downstream issues obsolete and therefore reduce effort there.
I came across this issue when trying to plot a DataFrame with the StatPlots package and accessing col names with dots is currently rather unwieldy.
I think in R during creation of a data frame (either during import or conversion) dots are replaced by underscores.
That sounds like a good option to me.
Probably need to catch dots in renaming cols as well.
I came across this issue when trying to plot a DataFrame with the StatPlots package and accessing col names with dots is currently rather unwieldy.
How was that data frame created? That's where the problem should be fixed (rather than preventing people from using dots if they explicitly ask for them).
I think the main question is: Should people be allowed to use dots in col names?
So far, I've not come across any a good argument for them, only many arguments against.
The data was a legacy dataset from a file, containing the dotted col names.
I think the main question is: Should people be allowed to use dots in col names?
So far, I've not come across any a good argument for them, only many arguments against.
The main argument if favor of allowing them is that it's sometimes necessary to be able to preserve names as they are, e.g. to write a CSV file with the same names as the input. Making this impossible is problematic for robust programming.
The data was a legacy dataset from a file, containing the dotted col names.
But how was it loaded?
Even if enforcing some rules would stabilise a whole ecosystem?
That would be closer to robust programming, as far as I understand it.
But I see your original point: Fix it at the root/generation
I agree, that would be preferable.
It doesn't necessarily help, though, if you have to deal with legacy stuff.
Loading: I tested it via the feather format file import, using the read function.
I also tested a smaller subset, using a csv file (the full set gave me a stack overflow error).
Loading: I tested it via the feather format file import, using the read function.
I also tested a smaller subset, using a csv file (the full set gave me a stack overflow error).
Then file bugs against Feather.jl and CSV.jl/CSVFiles.jl so that they replace dots with underscores by default.
@nalimilan Do you know what is left to be done from this issue? Or it can be closed?
Most helpful comment
I think at least for Query.jl there would be a fairly straightforward way to support complicated column names via something like
i[sym"my crazy column name"]. This would require a string macrosymthat we discussed somewhere previously that seems really harmless, and then I think it actually is quite a usable syntax. Plus, I think this would be type stable, if I understand constant folding properly.