Dataframes.jl: Handling of strings for column indexing

Created on 21 Aug 2019 · 59Comments · Source: JuliaData/DataFrames.jl

Something @StefanKarpinski pointed out on Slack, and I was not aware of is that we could overload getproperty and setproperty to accept a string as an agrument, so things like:

df."first column"

and

for n in names(df)
    df."$n" = some_vector
end

would work.

@nalimilan - do you think we want to allow this?

decision non-breaking

Source

bkamins

Most helpful comment

but we are aspiring towards that, no?

I do not want to say "no", but at least for now I do not see how to achieve this.

The current design is the following:

we are type unstable
we have all the benefits of type instability - we can add/remove columns, we can change column types, we can change column names, we can have thousands of heterogeneous columns without huge compilation cost (even recently we changed some bits of code to be type unstable, as otherwise CSV.jl was very slow when saving files)
all exposed methods are type stable internally - i.e. they process things fast and only input and output is type unstable - roughly there is at most one dynamic dispatch per column processed (unless we explicitly want to be unstable or we have forgotten to fix something); in particular purposefully by default we drop column names when processing data to avoid constant recompilation even in type-stable branches (as passing around column names would trigger recompilation each time names change)
if you want type stability for your own methods then call Tables.columntable or Tables.namedtupleiterator to have a no-copy type-stable object (or Tables.rowtable - at the cost of performing a copy)
Hopefully one day Julia will be able to cache compiled functions better than it does now between sessions, so given the points 1-4 DataFrames.jl will have a very fast lading and response time (as many things will already be compiled in cache)
There are other packages in the ecosystem that focus on type stability, so if despite points 1-5 one still needs type stability there are other options

bkamins on 15 Apr 2020

👍5

All 59 comments

Actually I have just realized that this works:

julia> df = DataFrame(rand(2,3))
2×3 DataFrame
│ Row │ x1       │ x2       │ x3       │
│     │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼──────────┼──────────┤
│ 1   │ 0.296424 │ 0.134759 │ 0.704889 │
│ 2   │ 0.827267 │ 0.977076 │ 0.173992 │

julia> for n in names(df)
           @show df.:($n)
       end
df.:($(Expr(:$, :n))) = [0.29642448322674375, 0.8272667427942673]
df.:($(Expr(:$, :n))) = [0.13475891130026074, 0.9770757374452936]
df.:($(Expr(:$, :n))) = [0.7048890856238188, 0.17399213873640806]

bkamins on 21 Aug 2019

That sounds tempting to support names that cannot be typed as literal symbols. OTOH we don't allow strings anywhere else, so that would be a bit inconsistent. So maybe we should recommend using df.:var"x1" from https://github.com/JuliaLang/julia/pull/32408 instead (I've just checked and it works).

nalimilan on 21 Aug 2019

Actually df.var"first column" works, and this is what I have discussed initially on Slack.
In response @StefanKarpinski suggested that we could also consider handling strings (we do not have to).

bkamins on 21 Aug 2019

OK. Then maybe not allow this and see whether var"x1" is enough once it lands? I guess we could see df."x1" as a shorthand for df.:var"x1", but since we don't get complaints from users about this there's no hurry to add a convenience syntax.

nalimilan on 21 Aug 2019

Agreed. I will then close it for now then (we can reopen if we find it is needed).

bkamins on 21 Aug 2019

👍1

-1 to adding df."x1". I think df.var"x1" is enough.

ararslan on 21 Aug 2019

I agree that var"x1" is a good solution here. We should think about compatibility though; I know that just having macro var_str(str); Symbol(str); end doesn't quite give the same results as the parser-supported var"str", but I think the cases that are different are edge cases that are weird. Maybe @c42f can comment on whether conditionally defining our own var_str macro would probably be good enough for previous Julia versions.

quinnj on 21 Aug 2019

I was thinking of having this post Julia 1.3 users.

bkamins on 21 Aug 2019

Yeah there's no hurry. People can use a longer syntax that works on 1.0 if they need it (e.g. in packages), otherwise they can require 1.3.

nalimilan on 21 Aug 2019

In terms of compatibility, x.var"y" is unfortunately one of the cases where the string macro doesn't work — there are quite a few which is why we had to make it syntax.

I think the best we can do is to have @compat x.var"y" for this case which isn't great.

c42f on 22 Aug 2019

It doesn't seem so bad to me to just allow df."y"—or for that matter to allow columns to be accessed by symbol or by string. But I'm just doing a drive-by suggestion here, so take it with a grain of salt. The additional convenience of being able to write df."first column" and have it work on any 1.x version of Julia seems pretty worthwhile.

StefanKarpinski on 22 Aug 2019

I am reopening this, as the issue seems to get more attention (and thank you all for discussing that) than most open issues 😄 in this package.

bkamins on 22 Aug 2019

I don't have a strong opinion. All I'm saying is that there's no hurry since that's a feature that can be added later.

(We've refrained from allowing strings in getindex because they are too appealing for users, and yet much slower than symbols for dict lookup.)

nalimilan on 22 Aug 2019

👍1

That's a fair point, although this PR would probably help with that: https://github.com/JuliaLang/julia/pull/32437. At least that would probably fix the overhead in the case of a literal string column name. In other cases, if the user has a string, the conversion to symbol needs to happen anyway, so I'm not sure there's that much difference.

StefanKarpinski on 22 Aug 2019

For the getindex case the companion of var would be the sym macro suggested at https://github.com/JuliaLang/julia/pull/32707 (or equivalent syntax).

However, constant propagating the string into an inlined getindex, combined with https://github.com/JuliaLang/julia/pull/32437 could also remove any overhead of allowing strings.

c42f on 23 Aug 2019

OK, good to know!

Then I suggest we wait until https://github.com/JuliaLang/julia/pull/32437 is merged and then check that it suits our needs. I wouldn't like df."x" to be equivalent to, but slower than df.var"x": that would be a real trap.

In the longer term we may consider allowing strings in indexing if literals are as fast as symbols.

nalimilan on 23 Aug 2019

👍1

This is a benchmark on Julia 1.3:

julia> f(df, s) = getproperty(df, Symbol(s))
f (generic function with 1 method)

julia> @benchmark f($df, :x1)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     13.928 ns (0.00% GC)
  median time:      19.840 ns (0.00% GC)
  mean time:        20.855 ns (0.00% GC)
  maximum time:     243.588 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

julia> @benchmark f($df, "x1")
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     44.444 ns (0.00% GC)
  median time:      54.040 ns (0.00% GC)
  mean time:        58.893 ns (0.00% GC)
  maximum time:     317.374 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     990

In my opinion the only problem with var is that things like df[:, var"x1"] do not work with it.

bkamins on 1 Dec 2019

Things like df[:, var"x1"] do not work with it.

Well, var"x1" is a variable name so you can quote it: df[:, :var"x1"]. The less ugly equivalent would be something like @sym_str from https://github.com/JuliaLang/julia/pull/32707

c42f on 2 Dec 2019

👍1

It's not clear how df."x1" solves any problems users have. If they are capable of writing out their column name as a literal, then they can just write df.x1. The situations where users want to work with strings is when column names are variables and they need to perform operations on them, like occursin, startswith, etc. In that scenario, df."x1" won't help them, they still need to do df[!, Symbol(varname)].

A better proposal would be to have df."$varname".

pdeffebach on 16 Mar 2020

You could make column names strings and still allow df.x1 to work. There's no real reason why column names have to be symbols in order to use that syntax.

StefanKarpinski on 16 Mar 2020

My point is that adding the syntax df."x1" will still only be useful in the context where the user is willing to write out the string. I am more concerned with a situation like:

varname = function_making_complicated_string()
df[!, Symbol(varname)] # works
df."varname" # will look for column `:varname`
df."$varname" # possibility

In situations where the column name is complicated or created by a function, they won't be willing to write it out and thus won't be able to take advantage of df."x1".

And if they are willing to write it out, they might as well just do df.x1.

pdeffebach on 16 Mar 2020

👍1

I renamed the issue, as getproperty/setproperty are special cases of a more general question: do we want to allow data frame to be indexed using AbstractString (in essentially all functions that currently accept Symbol - our internal design makes this change relatively easy).

Pros:

people will find it more convenient when doing string processing to produce/chceck column names

Cons:

I am afraid that users will be confused if they should use Symbol or AbstractString as their primary column access method (this is my chief concern)
The chief usefulness of accepting strings is to allow comparison against propertynames(df) and names(df), but these will keep returning collections of Symbol. Of course string.(names(df)) is easy enough to write, but maybe we should define a special function returning names of columns of a data frame as a collection of Strings?

I would not be afraid of performance issues as they would not be a bottleneck I think.

bkamins on 16 Mar 2020

👍2

I am afraid that users will be confused if they should use Symbol or AbstractString as their primary column access method (this is my chief concern)

I think that there are a lot of merits to using strings. But in terms of keeping the surface of the API manageable and to make users less confused, it should be either strings or symbols and that decision should happen for 2.0.

string.(names(df)) is easy enough to write that there doesn't need to be a special method. Our current regex functionality is pretty robust and meets the needs of many users.

pdeffebach on 16 Mar 2020

So this exactly was my thinking, but I was afraid that just switching from symbols to strings (and dropping symbols) seems as "too breaking" even for 2.0, and would require a really serious consideration (to avoid problems like Python had moving from 2 to 3).

Some additional comments for a decision:

string lookup vs symbol lookup is ~ 2.5x slower (for typical column name length); this is not hugely problematic, but still I wanted to note this
we should consider other tabular data formats (and Tables.jl in general), where Tables.jl strictly assumes that column names should be Symbols

Essentially our current design is:

fast to lookup, and consistent with Tables.jl
at the cost that if the user wants to work with strings one needs to use Symbol and string for conversions in both ways.

And the additional question is in what cases this is really problematic (i.e. worth considering to change) given that:

in rename we already handle strings
we allow using regex for column name matching

(as maybe it is enough to just add 1-2 convenience functions to cover 90% of use cases where strings are needed)

bkamins on 16 Mar 2020

👍3

Another consideration is that Symbols are interned by the compiler, so if you're programmatically generating a lot of unique column names and transforming them to Symbols that slowly leaks memory. Not a problem for analytical batch or interactive work, but for a long running server process it could become problematic. Just a thought :-)

c42f on 17 Mar 2020

Do you have any ideas on when that might be a problem? My workflow in Stata using survey data regularly had 20,000+ variables and a lot of for loops where I programmatically construct temporary names.

pdeffebach on 17 Mar 2020

In practice I imagine this would become a noticeable problem only when you had say 100_000_000 variables over the lifetime of the session. This would likely amount to several GB of memory which can only be reclaimed when the process exits.

c42f on 17 Mar 2020

So the topic is back again in Slack. To sum up the options we have:

do nothing and require wrapping strings in Symbol or prepending var (no work to be done; people find it inconvenient when they have e.g. spaces or characters like + in variable names)
allow strings as column names, but immediately convert them to Symbol internally (easy to implement; people might be confused - but probably we can explain this; non-breaking)
switch from Symbol to string for indexing at some point (breaking and will have a negative performance impact; the benefit is that it will be consistent)

Given the discussion in this tread can you please voice your opinion on this.

CC @oxinabox

bkamins on 30 Mar 2020

Caveat, I am only a very occasional DataFrames user, but I thought I'd offer my analysis of the design options here. From a UI perspective, I would describe the matrix of options in terms of two orthogonal choices:

When asking to list column names return a collection of symbols or strings;
When using a column name, accept only the type (strict) in 1 or either type (permissive).

Current behavior is symbol/strict. Allowing indexing column names as strings as well would move to symbol/permissive. One of the issues that has come up, however, is that when trying to reflectively work with column names, strings are more convenient to work with since operations like startswith and regex operations work with strings but not symbols and there is reluctance to extend string operations to symbols. So it might make sense to not only allow strings for getting columns, but to list column names as strings for the sake of easier filtering. However, moving to string/strict would be the most breaking option since all code that currently uses symbols for column names would break, and not just code that expects the list of column names to be returned as symbols.

All told, with this design matrix, there are four options from least to most breaking:

symbol / strict
- not breaking
- current behavior
symbol / permissive
- not breaking
- allow indexing with strings as well as symbols
- doesn't address the inconvenience of using reflection to filter lists of symbols.
string / permissive
- breaks some reflective code but not normal indexing code
- list columns as strings rather than symbols but continue to allow symbol indexing
string / strict
- very breaking — any usage of symbols for column names breaks
- only allow listing and accessing column names as strings

Option 1 is the status quo, so that's well understood. I'd say that 4 should be off the table—it's way too disruptive at this point.

Option 2 fixes a mild annoyance and adds a bit of convenience. Sometimes you have a string and you want to get a column by that string; why do you have to spell out the conversion to symbol? Sometimes this will be a convenience for advanced users, sometimes this will make life a bit easier for novices since something will "just work". Indeed, I have a hard time seeing why one wouldn't just allow this to work. After all, if the computer can tell me what I meant to do, why not just do it for me? I'm not sure why this would be all that confusing: data frame column names are symbols, but strings are also allowed and are converted to symbols for you.

Option 3 is an interesting one. In addition to what 2 does, it addresses the issues with inconvenience of reflecting on column names. I'm not sure that alone is really worth the breaking change though. Writing string.(names(df)) is pretty clear and then you can do your reflection and with option 2, you don't even need to convert back to symbols at the end.

StefanKarpinski on 30 Mar 2020

❤2

Just to finish the picture:

Option 3 is more than names as it would affect also rename and unstack functions which have methods that transform column names into other column names. Both functions allow string or Symbol as a return value (so we are permissive here already) but they are passed a Symbol as column name now. Still it is easy enough in them to write string(colname) to have what is desired as with names.

bkamins on 30 Mar 2020

Can you please vote up :+1: or down :-1: under this post if you are for implementing Option 2 (allow to use strings as input, but always return Symbols as output). If it will get enough support (and no serious drawback) I will implement it after we are done with major API changes we are now finishing (as this PR will affect almost all functionality we provide).

bkamins on 10 Apr 2020

👎2

I actually like 3.
Once you've seen a few dozen instances of filter(col->startwith("LON_", string(col)), names)
and its varients
you find yourself writing a package to make it stop.

But I am not opposed to Option 2

oxinabox on 10 Apr 2020

As I have noted - 3 differs from 2 only in what we return in names, rename and unstack. Internally we will keep Symbols anyway.

Now as I think about it - I do not see a big risk of 3 vs 2, as mostly when you use names you probably do not rely on Symbols being returned and in rename and unstack actually having a string passed to a function is more convenient.

However, if we wanted to go for 3 we should do it now, as I would prefer not to make such changes after 1.0 release.

EDIT
If we go for Option 3 then I think it is best that names returns Vector{String} but propertynames will still return Symbols, so both options will be available.

TODO: remember to add a method for Tables.columnindex(::AbstractDataFrame, ::AbstractString) then.

bkamins on 10 Apr 2020

@oxinabox one more thing (I am sorry for pushing things here but this is the last "grand" issue that requires decision before 1.0 API is frozen).

Note that with https://github.com/JuliaData/DataFrames.jl/pull/2177 instead of:

filter(col->startwith("LON_", string(col))

you will write:

names(df, r"^LON_")

to get a Vector{Symbol} matching the pattern.

Given this do we really feel it would be better to return strings rather than Symbols in names?

Essentially it coult still be needed for things that cannot be handled by Regex but in my experience 99% of things can be handled by Regex.

Even if we keep names to return Symbol we could still change unstack and rename to pass string to the transformation functions (for convenience).

So essentially when discussing option 2 vs option 3 in what @StefanKarpinski listed we have three options actually:

Option 2: names, unstack and rename pass Symbol
Option 3: names, unstack and rename pass String
Option "between 2 and 3": names returns Vector{String}, but unstack and rename pass String

bkamins on 11 Apr 2020

Essentially it coult still be needed for things that cannot be handled by Regex but in my experience 99% of things can be handled by Regex.

A great many people hate Regex.
I would much rather a function using boolean logic around startswith and endswith and contains

Overall my preference is not too strong.
I have already written the symbol manipulation in Wrangling.jl and will be using it anyway for other things.

In the long term I think the real solution is to be able to attach metadata to columns.
Then it will be more like:

filter(column_meta(df)) do col
    col.city=="LON"
end

And so how the columns are named doesn't matter.

oxinabox on 11 Apr 2020

A great many people hate Regex.
I would much rather a function using boolean logic around startswith and endswith and contains

We could have startswith and endswith return regular expressions just as easily and keep the logic.

I am coming around to option 3, have everything return Strings by default but still allow symbol indexing so that getproperty can work. I like the division between code and data that using Symbols gives, but I agree with @oxinabox that we are underestimating the amount of automated named generation that goes on in a large code base.

pdeffebach on 11 Apr 2020

OK - given the comments we will likely go with option 3, unless someone raises a flag during the next few days.

bkamins on 11 Apr 2020

👍3 👎1

I feel like that's going to be the most pleasant to use in the long term at the cost of some minor breakage to reflection code in the short term.

StefanKarpinski on 12 Apr 2020

I may be late to this, but a point of interest: I thought Julia tended to embrace the "don't offer too many ways to do the same thing" design philosophy (e.g. using " for strings, and not allowing multiple options like ' and " like Python)? Or am I making that up?

I will confess (particularly if there's a 2x performance advantage) I'm inclined to restricting to the current Option 1. This is more flexible, but also a potential invitation for confusion among new users?

I suspect that if we go to option 3, then over time Strings will become what everyone uses (similar to pandas / R), people will basically forget about the symbol functionality most of the time, and people will end up with column accesses that are 2x slower than they could be...). But we'll still have to support two use cases instead of one.

EDIT: Typo.
EDIT 2: Add note about eventual shift in use tendencies.

nickeubank on 12 Apr 2020

👍4

I think this is exactly the moment to discuss it. In one month 0.21 will be out implementing whatever decision we make here.

Let me phrase, what we will roughly say to the users if we go for Option 3.

A column of a data frame has a name which is a String, and a property name which is Symbol. These two values follow the contract Symbol(name) == property_name.

And some more comments:

Supporting both is not a problem. It will be a big (as we have to update signatures of many functions) but relatively simple PR.
Yes - column access with Symbol is faster, but I expect that it should not be a bottleneck in the code (probably what you do with the column is much more expensive, also you will probably have one dynamic dispatch anyway after you extract the column)
I assume that people still will write df.columnname as this is easier to type and will be given by autocompletion
Finally - if we see people switching to strings in other ecosystems, then probably there is a reason for this (from my experience a chief one is that you can add spaces in column names which is very common in practice, e.g. when you read in data from human-made Excel files).

So in summary - I think that this will be the case - people will switch to strings, but then Symbol can be thought of as a faster way to get things for more advanced users (as for novices they otherwise have to learn e.g. Symbol("a b c ") or :var"a b c" which is not that obvious).

I see the point in purity of Option 1 (and initially I preferred this), but I think that if we see many people want strings to be allowed (we get such comments) and this is not a problem to support them (I confirm that it is not a problem) then why not go for it.

EDIT

One serious drawback is generic code. If people will want to write code against Tables.jl interface (so that DataFrame can be easily replaced by some other tabular type without changing the code) they probably should use Symbol not strings.

bkamins on 12 Apr 2020

❤2

cc @quinnj for input with regards to Tables.jl

StefanKarpinski on 13 Apr 2020

In Tables.jl, we opted for Symbol/strict mode just to keep things simpler, _and_ because it's more of a "developer's" API, as opposed to something casual users may encounter. That said, I wouldn't be opposed to supporting more string operations w/ column names if it would help or be more convenient for sources/sinks in some way. For example, I wouldn't mind officially having something like: Tables.stringcolumnnames(x) = [string(x) for x in Tables.columnnames(x)] and specific sources could overload the definition themselves if they already store names as strings. I'm too worried about what we call it either, since, as I mentioned, this is mainly a package-developer API anyway.

quinnj on 14 Apr 2020

👍3

OK - unless there is some strong opposition a PR for Option 3 will be opened in a few days.

bkamins on 14 Apr 2020

I don't have a lot to add here but FWIW I prefer option 1. I think https://github.com/JuliaData/DataFrames.jl/issues/1926#issuecomment-612663592 summarizes my thoughts on this pretty well. It seems unnecessary to have two ways to do it when we can just document what's expected. Seems I'm in the minority here and I'm not going to stonewall changes on this front, just thought I'd put in my $0.02 USD.

ararslan on 14 Apr 2020

👍1

I am still fighting with myself what is better.

@kleinschmidt - what do you think about this PR in the context of StatsModels.jl and formula interface (which now requires Symbols).

The alternative is to stick to Option 1 + recommend Wrangling.jl (it is more powerful than what we would have anyway) in the manual if someone wants more flexibility (possibly we would change rename! and unstack API for convenience).

Now the report from the field (I have started preparing for the PR): almost all functions would need some minor update with this change (this is not super bad, but just shows how big this PR is).

Maybe let me also ask - is there someone who "strongly" wants strings accepted? (just to hear the other side, as it seems that most people do not have a strong preference but only mild one in one way or the other).

I am sorry for possible bikeshedding here, but this is a fundamental design decision for me that will have very long lasting consequences.

bkamins on 14 Apr 2020

👍3 ❤1

I'm a fairly new Julia user, and just to share my experience, for the first week or two I was definitely a bit confused between Symbol("col_1") and :col_1. I was also a little thrown off by why Symbol("col 1") worked but not :col 1. I obviously figured these things out, but for someone just starting Julia, and if the assumption is that many new Julia users are coming from Python like myself, I think switching to string indexing instead of symbol would be great.

sbut1992 on 14 Apr 2020

❤2 👍2

Thanks for the feedback, @sbut1992, it's really helpful to have your perspective.

StefanKarpinski on 14 Apr 2020

👍1

Indeed one of the main advantages of strings is to allow accessing columns with spaces in their names via df[!, "col 1"] or df."col 1" instead of having to use Symbol("col 1"). And since e.g. CSV.jl preserves spaces in column names, this isn't just a corner case.

nalimilan on 15 Apr 2020

Just to complete the discussion. Internally DataFrame will store column names as Symbols this means that when indexing with strings, the only overhead will be conversion of string to Symbol. I assume that:

in cases when constant propagation happens and we have a static string - the compiler will optimize-out the conversion
in case in which we have a dynamic string the overhead is not more than 100ns for normal lengths of column names

Now, given that DataFrame is not type stable this overhead is probably not the end of the world (and one will have an option to use Symbol if performance matters here).

bkamins on 15 Apr 2020

👍1

Now, given that DataFrame is not type stable this overhead is probably not the end of the world (and one will have an option to use Symbol if performance matters here).

They aren't type stable yet, but we are aspiring towards that, no? Does this have any bearing on that? (I can't think of a way it would, but want to check!)

nickeubank on 15 Apr 2020

but we are aspiring towards that, no?

I do not want to say "no", but at least for now I do not see how to achieve this.

The current design is the following:

we are type unstable
we have all the benefits of type instability - we can add/remove columns, we can change column types, we can change column names, we can have thousands of heterogeneous columns without huge compilation cost (even recently we changed some bits of code to be type unstable, as otherwise CSV.jl was very slow when saving files)
all exposed methods are type stable internally - i.e. they process things fast and only input and output is type unstable - roughly there is at most one dynamic dispatch per column processed (unless we explicitly want to be unstable or we have forgotten to fix something); in particular purposefully by default we drop column names when processing data to avoid constant recompilation even in type-stable branches (as passing around column names would trigger recompilation each time names change)
if you want type stability for your own methods then call Tables.columntable or Tables.namedtupleiterator to have a no-copy type-stable object (or Tables.rowtable - at the cost of performing a copy)
Hopefully one day Julia will be able to cache compiled functions better than it does now between sessions, so given the points 1-4 DataFrames.jl will have a very fast lading and response time (as many things will already be compiled in cache)
There are other packages in the ecosystem that focus on type stability, so if despite points 1-5 one still needs type stability there are other options

bkamins on 15 Apr 2020

👍5

There are other packages in the ecosystem that focus on type stability

Yes, my experience trying to use TypedTables.jl (circa ~1 year ago) is that it's really great for small numbers of columns (say 10-20 and maybe more if you don't have any columns containing missing data). But for lots of columns the design easily hits the limits of the compiler and there's not really any recourse. On the other hand, DataFrames can easily handle thousands of columns.

Typed columns can make a lot of sense if you're manipulating data with well known structure (TypedTables was designed partly with point clouds in mind). But for arbitrary data files like wide CSVs the untyped approach is just a lot more reliable.

c42f on 16 Apr 2020

👍1

I spoke to a bunch of people at Invenia who use names a lot, and the over all things that came back were:

"I don't really mind. We have Wrangling.jl now anyway, so it doesn't matter so much"
"As long as it works with DataFramesMeta.jl etc"

oxinabox on 16 Apr 2020

👍1

As long as it works with DataFramesMeta.jl

There you just use @with(df, :var"x y" .+ 1) syntax if you have spaces in your column names (strings would not be allowed there as this would be ambiguous - you can always write :var in front to get what you want in such cases).

"I don't really mind. We have Wrangling.jl now anyway, so it doesn't matter so much"

@oxinabox Do you think that when we would start returning strings from names and start accepting strings for indexing then Wrangling.jl would not be needed (or you would use anyway)?

@sbut1992 - I wanted to check one thing with you to make sure we make a right decision here (just to summarize the discussion - most of "old" (not all) users of DataFrames.jl feel only allowing Symbol is enough, most "new" users of DataFrames.jl feel it would be nice to allow strings).

Do you think that if we:

highlighted more what Symbol means and what is its purpose, and why it is preferred, in the introduction to DataFrames.jl manual (one will need to learn what Symbol is at some point anyway if one uses Julia), in particular highlighted that you can write :var"x x" and do not have to write Symbol("x x") which is admittedly cumbersome
explained that one can use https://github.com/invenia/Wrangling.jl to conveniently work with Symbols

then you would be comfortable with not allowing strings, or still you would prefer to have them accepted? Thank you for the feedback.

bkamins on 16 Apr 2020

@oxinabox Do you think that when we would start returning strings from names and start accepting strings for indexing then Wrangling.jl would not be needed (or you would use anyway)?

Would still be using it anyway as it has other functions like contains_any and such

oxinabox on 16 Apr 2020

The very existence of Wrangling.jl seems to indicate that we should use strings as names: guys, you found working with data frames and symbols so painful that you created a package defining string operations on symbols! :-D

However there's the additional issue of currying startswith(s) and endswith(s) methods. I wonder whether these could be defined in Base. At any rate, it's a problem that working conveniently with DataFrames would require type piracy.

nalimilan on 16 Apr 2020

👍3

Thank you all for the feedback. I think we got enough evidence what to do (still if @sbut1992 would want to add something in response to my question - you are welcome).

Now I start implementing the PR allowing AbstractString indexing.

bkamins on 16 Apr 2020

However there's the additional issue of currying startswith(s) and endswith(s) methods. I wonder whether these could be defined in Base. At any rate, it's a problem that working conveniently with DataFrames would require type piracy.

They are in 1.5

oxinabox on 16 Apr 2020

❤3

If you are interested you can test/review #2199 to make this issue materialized (and then we tag 0.21).

bkamins on 18 Apr 2020

🚀1 ❤1

Was this page helpful?

0 / 5 - 0 ratings