DataFrames.jl 🚀 - Policy regarding in-place operations

Today I hit a problem with dropmissing! related to this. I created a dataframe with a subset of two columns to perform pairwise deletion of missings and it corrupted the original wide DataFrame. I knew what to do (at least I hope I did 😄), but probably newcomers to DataFrames.jl will be caught off-guard here unless we properly document it.

bkamins on 1 Feb 2019

A possible idea: use a ! suffix for operations that mutate the data frame but not the column vectors (the majority), and a !! suffix for operations that also mutate the vectors (which are not that common, and often trap users who don't want to mutate column vectors). That would be both explicit, consistent are still relatively convenient.

nalimilan on 11 Feb 2019

In general silently mutating vector type except for df.x = something (in this case it is obvious it can be mutating) should be signaled to the user so !! seems reasonable. As I have said in #1716 an implicit type mutation without signaling can lead to nasty bugs IMO.

bkamins on 11 Feb 2019

Continuing from #1716

So is your proposal to make push! to do autopromotion (and create a new vector) and push!! to mutate the vector in place?

This kind of confusion shows why this kind of syntax-y solution is less than ideal. Though I do think it's interesting to think of a syntax that shows clearly the nested structure of a DataFrame

pdeffebach on 11 Feb 2019

So the idea was that a single ! only mutates the data frame, so it can do anything to the object (including replacing column vectors), but !! can also mutate the column vectors (so it can't change their type). That's not really in contradiction with Base, since the single ! version still doesn't change the type of the data frame, just like push(Int[], 1.0) converts 1.0 to Int since the type cannot change.

That said, I'm not saying that's the best solution.

nalimilan on 11 Feb 2019

I agree that there are pros and cons so my idea was to list all the affected functions and then see how many changes we would have to make.

bkamins on 11 Feb 2019

@nalimilan Given the discussion we had yesterday with @oxinabox I think I can explain both arguments behind !! briefly:

what I and @oxinabox initially thought: is that !! signals a more dangerous thing than !; with this reasoning a more dangerous thing is to mutate vectors withing a data frame than to replace them by some other vectors (why: because these vectors can be used somewhere else and mutating them might break something)
what @nalimilan proposes is ! mutates only data and !! mutates data and metadata

Probably both reasonings are valid, but I just wanted to spell out the difference between them.

Now - regarding mutating vectors in place. Here we have a double-risk after the recent changes allowing to propagate @inbounds in getindex. Why - because we do check vector length only at adding/creation time. Later we only check first column for its length and assume that all else matches. Without propagating @inbounds this is OK, but when we started to propagate them you can start getting segfaults (and I guess @oxinabox can confirm this 😄 - although this is probably not the case he has encountered) in strange places (of course due to the error on the programmer side - but the error might be tricky to spot).

bkamins on 13 Feb 2019

I guess really there are 4 kinds of things

0 Operations that just add columns. These are always safe
1 basic mutating options that just change the values in existing columns. These are always safe
2 operations that change the size of existing columns
3 operations that replace columns with new columns at same index

2 and 3 are both dangerous.

I wonder if we don't actually want a new Array type that is guarded against mutation except by a particular set of methods that are not exported by DataFrames.

So noone can actually mutate columns except via intended interfaces.

oxinabox on 13 Feb 2019

* what I and @oxinabox initially thought: is that `!!` signals a more dangerous thing than `!`; with this reasoning a more dangerous thing is to mutate vectors withing a data frame than to replace them by some other vectors (why: because these vectors can be used somewhere else and mutating them might break something)

* what @nalimilan  proposes is `!` mutates only data and `!!` mutates data and metadata

@bkamins I don't think that's accurate. At least in my comment above I (meant to) propose the first interpretation. But I think @oxinabox said the reverse on Slack.

I wonder if we don't actually want a new Array type that is guarded against mutation except by a particular set of methods that are not exported by DataFrames.

@oxinabox I'm not sure that would be a good protection nor convenient.

Good protection: even if we wrap all columns in a special array type, users can always resize the original array. And actually resizing the vectors isn't the worst possible operation, since at least you're likely to get an error at some point: but if you sort the values and still use that vector in another data frame, the consistency of observations will be silently lost.
Convenient: you would never be able to push! or append! to a data frame, except using special functions; but we don't need to use a custom array type for that, we can just do what we want with/in our DataFrame methods. And returning these custom array types from e.g. df.col would be weird for users.

I've had a look at our current mutating methods. Here are a few thoughts about what would happen if we required a !! for methods which mutate column vectors:

For allowmissing!, disallowmissing! and categorical!, this is perfectly fine since by definition these methods replace some columns with new vectors.
For deletecols!, insertcols!, permutecols!, names! and rename! it's also fine since these affect the data frame but not the vectors.
For push! and append! it would be both a good and a bad thing. Good, since it would allow changing the type of the columns if needed using promote_type, like vcat. Bad, since by default we would make a copy on each addition. Since that's not consistent with what happens for vectors, people won't necessarily think about push!! and append!!.
For filter!, deleterows!, unique! and dropmissing! this convention would also be a bit annoying but safer, since they would replace column vectors with copies before mutating them. Variants with !! would be needed to actually mutate vectors.
For sort! it would be quite weird to allocate new vectors before sorting them. I guess we could say that ! functions are allowed to mutate column vectors as long as they don't resize them. But even that is very dangerous if these vectors are used in another data frame (and you may well not notice it).

Overall I guess the main argument in favor of changing the current behavior would be to make it less appealing to use in-place operations (in the strong sense of operations that mutate column vectors). Indeed currently it looks nicer to do sort!(df, :x) than df = sort(df, :x). OTOH it's one of the strengths of Julia that you can easily avoid copies (and worse, the compiler isn't as good as R currently if you don't use in-place operations since it doesn't do copy-on-write to avoid copies where possible).

Crazy idea: maybe we could have a global reference count keeping track of the column vectors that are currently used by a data frame (which would be automatically updated by a DataFrame finalizer). When doing operations which would mutate column vectors used elsewhere, we could throw an error, or at least print a warning to protect again data corruption (safety matters a lot, and in R it's kind of implicit due to copying semantics). Of course, that wouldn't protect you from things like sort!(df.x), but it's more obvious that this kind of thing is a bad idea anyway.

nalimilan on 13 Feb 2019

👍1

First let me reply to @oxinabox:

0 Operations that just add columns. These are always safe

The problem is that these operations are chiefly setindex! and setproperty and they can either add a column or mutate an existing column and you only know which thing they do at run time.

1 basic mutating options that just change the values in existing columns. These are always safe

They are not always safe if you have created as GroupedDataFrame based on some AbstractDataFrame.

I wonder if we don't actually want a new Array type that is guarded against mutation

Actually it was recently proposed, https://github.com/bkamins/ReadOnlyArrays.jl, but did not get much traction (so for the time being this idea is on hold). Although we intended to use it for a bit different purpose.

Now @nalimilan:

At least in my comment above I (meant to) propose the first interpretation.

I think it is best to discuss concrete methods (what you do below), as this avoids ambiguity.

Looking at your list I think we can stay with only ! (as most of the time the conclusion is that !! would be annoying) and concentrate on improving the documentation and add methods without ! (so that there is always an option to do a super-safe thing). In particular:

For allowmissing!, disallowmissing! and categorical!: add information in the documentation that they create new vectors ONLY IF NEEDED; add a methods for allowmissing, disallowmissing, categorical that support AbstractDataFrame as argument (and decide if they should do copy when this operation would be no-op or not).
For deletecols!, insertcols!, permutecols!, names! and rename!: decide if we want to have deletecols, insertcols, permutecols (we already have rename).
For push! and append!: I would leave them as is and ADD A STRONG WARNING in the documentation about risk of using them; I would promote using vcat for other use-cases (EDIT we should add required functionalities to vcat and reference it in the documentation; the decision is if we want to add push method that is like vcat but adds one row)
For filter!, deleterows!, unique! and dropmissing!: I would leave them as is and ADD A STRONG WARNING in the documentation about risk of using them; I would add deleterows method for completeness (all other methods we already have)
For sort!: I would leave them as is and ADD A STRONG WARNING in the documentation about risk of using them; we have sort variant already

Whenever I say ADD A STRONG WARNING this should include:

explaining the risks
pointing at methods without ! as safe alternatives (that is why I proposed to add them everywhere where they are missing)

maybe we could have a global reference count keeping track of the column vectors that are currently used by a data frame (which would be automatically updated by a DataFrame finalizer).

I do not think this would work well as there is no guarantee when GC runs the finalizer.

bkamins on 13 Feb 2019

Unfortunately docs can only help mitigating issues, but they are not really a fix (and many users don't read docstrings until they have a problem). The fact that you've been trapped by dropmissing! shows that even knowing the problem doesn't fully protect from it. I'm particularly concerned by sort!, which could give a bad reputation to Julia if some people screw up their analyses with that.

Another crazy idea: change df[cols] to copy the column vectors by default, and require @view/view to get a SubDataFrame. That would be consistent with what getindex does with arrays (remember that the design of DataFrames predates the existence of array views in Julia).

nalimilan on 13 Feb 2019

I agree and we have two ways to go here:

what I have proposed above: define functions without ! for every operation and advocate using them; have functions with ! and warn users not to use them unless they know what they are doing
the !! option: make functions without ! and with ! safe (i.e. would not mutate vectors, ! works in the same DataFrame but creates new vectors) and !! variants would mutate vectors

Actually looking at this and thinking about it more - I start to like !! option after all as !! is a really clear sign of something dangerous. So +1 for your proposal, even if it would be unwieldy in some cases.

We still have another issue here (that I have sprinkled in several places in my comments): some functions (like e.g. EDIT: ~dropmissing~ disallowmissing) return the original vector when no action is needed and create a new vector in other cases (this is a convert vs constructor distinction in Base). Any thoughts about this issue (apart from the fact that it should be clearly documented). If we go the way I feel we started moving towards - they should always create a new vector if they create a new data frame.

Another crazy idea: change df[cols] to copy the column vectors by default

If we wanted to go this way actually I would do the following (we could also skip step 2 and stay at step 1 - but this is a long term decision):

In the first step disallow df[col] and df[cols] indexing altogether (and retain only df.col to get a source vector and df[:, col] and df[:, cols] and @view)
In the second step (probably after at least 1 year) reintroduce them as df[row] or df[rows] (so in the long term single index would select rows not columns which would be consistent with row-based interpretation of a DataFrame)

bkamins on 13 Feb 2019

I agree and we have two ways to go here:
1. what I have proposed above: define functions without `!` for every operation and advocate using them; have functions with `!` and warn users not to use them unless they know what they are doing

2. the `!!` option: make functions without `!` and with `!` safe (i.e. would not mutate vectors, `!` works in the same `DataFrame` but creates new vectors) and `!!` variants would mutate vectors
Actually looking at this and thinking about it more - I start to like !! option after all as !! is a really clear sign of something dangerous. So +1 for your proposal, even if it would be unwieldy in some cases.

I didn't say I'm completely convinced by my proposal. ;-) Let me think about it...

We still have another issue here (that I have sprinkled in several places in my comments): some functions (like e.g. disallowmissing) return the original vector when no action is needed and create a new vector in other cases (this is a convert vs constructor distinction in Base). Any thoughts about this issue (apart from the fact that it should be clearly documented). If we go the way I feel we started moving towards - they should always create a new vector if they create a new data frame.

Yes, disallowmissing should probably do that by default when we introduce it (https://github.com/JuliaData/DataFrames.jl/issues/1720).

If we wanted to go this way actually I would do the following (we could also skip step 2 and stay at step 1 - but this is a long term decision):

1. In the first step disallow `df[col]` and `df[cols]` indexing altogether (and retain only `df.col` to get a source vector and `df[:, col]` and `df[:, cols]` and `@view`)

2. In the second step (probably after at least 1 year) reintroduce them as `df[row]` or `df[rows]` (so in the long term single index would select rows not columns which would be consistent with row-based interpretation of a `DataFrame`)

That's indeed appealing. But we need to provide a syntax to access a vector without copy when it's not a literal symbol: is getproperty(x, col) OK? Or should we export another function?

nalimilan on 13 Feb 2019

getproperty(x, col) OK? Or should we export another function?

We could add getcol for example, but I am not sure if actually we need it as we have getproperty as you have noticed.

This is needed also in cases when you want to have a programmatic access to the column (e.g. in a function when column name is passed as a Symbol via a parameter).

bkamins on 13 Feb 2019

If we introduce a rule that transformation of AbstractDataFrame that produces a DataFrame never reuses the column from the source actually the need for !! is a bit lower I think as the probability that two data frames will share the same underlying vectors is low.

Note that you still will be able to reuse columns, but this will require a special treatment, in which case I guess the user will know what one is doing, e.g. this would still reuse the columns from source:

DataFrame(eachol(df, false), names(df))

bkamins on 13 Feb 2019

If we introduce a rule that transformation of AbstractDataFrame that produces a DataFrame never reuses the column

In particular we should remember to decide about copy(::DataFrame) and DataFrame(::DataFrame).

bkamins on 13 Feb 2019

Just to recap my understanding of the desirable policy from Slack:

we introduce the concept of column ownership by a DataFrame which means that DataFrame assumes that it can safely mutate the object as long as it stays internally consistent;
we do not perform copy-on-create of vectors passed to a DataFrame for performance reasons; from the moment of its creation an DataFrame object considers itself as an owner of the columns
when SubDataFrame, DataFrameRow or GroupedDataFrame is created they are not owners and may be corrupted if the owner is mutated (conclusions: users should be warned not to mutate parent DataFrame if they use one of these views). In practice this means the views should be used only for short-term operations and not passed around; also view creation is now fully flexible and fast in DataFrames.jl so they are the preferred method to subset a DataFrame if you care about performance for short-term operations.
If one uses eachcol one has to be careful not to mutate the columns one gets (especially not to resize them). The same with direct column access via getproperty
finally if an operation that takes an AbstractDataFrame creates a fresh DataFrame it always allocates fresh columns (so that the newly returned DataFrame can assume that it has the ownership of its columns) (edited)
we could drop the df[col] and df[cols] column getindex/setindex! methods as they are dangerous (and leave only getproperty and setproperty!) - ADDITION after thinking about it I think we can leave them as they might be useful in some cases, as long as they behave as expected (i.e. df[cols] performs a copy of the columns)
ADDINION - if we go this way then we do not need a ! and !! distinction I think.

I am summarizing this again, because I think we will not get much new input in this discussion and we should decide if we should go this way (then I will, in particular, update https://github.com/JuliaData/DataFrames.jl/pull/1646 and a PR cleaning up whole code-base to follow "ownership" rule should be implemented).

bkamins on 17 Feb 2019

Regarding point 6:

6. we could drop the df[col] and df[cols] column getindex/setindex! methods as they are dangerous (and leave only getproperty and setproperty!) - ADDITION after thinking about it I think we can leave them as they might be useful in some cases, as long as they behave as expected (i.e. df[cols] performs a copy of the columns)

I think we should either deprecate df[cols] (I admit I can get used to df[:, col]), or keep it, but I think having df[col] and df.col having different behaviors, with one copying and one not, would be confusing for the user.

pdeffebach on 17 Feb 2019

If we left df[col] I think it would do the same as df.col - i.e. return the vector without copying it (except that then col in df[col] can be an integer and any symbol). This is consistent with the rules above as df[col] returns a vector not a DataFrame.

bkamins on 17 Feb 2019

If we go towards the "ownership" rule we should make sure how we handle stackdf and meltdf.

bkamins on 20 Feb 2019

Regarding earlier issues to sum up wat I think. We would have:

df[col] that returns a vector without a copy
df[cols] that returns a DataFrame making a copy (because it returns a DataFrame) (change)
sub_df[col] that returns a view
sub_df[col] that returns a SubDataFrame (so no copy is made as SubDataFrame does not claim to have an ownership)

And we would have an inconsistency between points 2 and 4. If this is OK, then we can leave this syntax. Any thoughts on this?

If in the future GroupedDataFrame would be <:AbstractDataFrame we will have to make similar decisions there.

bkamins on 20 Feb 2019

We also have some wiggle room with df[:, cols] vs. df[cols]. Perhaps one could be a copy and one would not? Or one be a sub-dataframe and one not?

pdeffebach on 20 Feb 2019

Thank you for the comments (this issue is bugging all of us for some time with no ideal idea what to do 😄).

Perhaps one could be a copy and one would not?

This is the current design and it is fully mentally consistent. The problem is that it is safe and even me (who has implemented this) keep forgetting that df[cols] is dangerous.

Or one be a sub-dataframe and one not?

Making df[cols] return a SubDataFrame is a possible solution to our problem. This would be a second exception in pair with df[row, cols] which also returns a view. Then only df[rows, cols] would return a DataFrame. However, I think that this is more risky than defining that df[cols] is equivalent to df[:, cols]. @nalimilan - what is your opinion?

bkamins on 20 Feb 2019

I'd find it a bit weird to have df[cols] return a SubDataFrame. The df[row, cols] exception is really just because technically we cannot make it fast with named tuple, better not add more exceptions. Also we already have view to get a SubDataFrame.

nalimilan on 20 Feb 2019

So do we make df[cos] make a copy or better deprecate it altogether? (making a copy is safe and convenient, but deprecating it will give us more freedom for the future after 1.0 release). I would vote for making a copy.

bkamins on 20 Feb 2019

Deprecate it all together.
Keep df[:, cols]

oxinabox on 20 Feb 2019

👍1

OK, so the conclusion is to:

deprecate df[cols];
I would leave df[col] though after trying to work without using it for some time. The problem is that writing getproperty(df, col) and setproperty(df, col, something) is really clumsy and this operation is often needed when col is a variable.
we make any operation on DataFrame that creates a new DataFrame make copies of columns (we introduce a column ownership concept);
the only exceptions are stackdf and meltdf which operate on views of the source DataFrame, but if someone uses these functions I guess one knows the consequences;

I will ask on discourse to vote up/down this proposal here and if gets accepted implement a PR.

bkamins on 23 Feb 2019

👍2

I still haven't been able to make up my mind on this... I can't help thinking we will regret this wart if we keep df[cols] and df[col] in 1.0 despite deciding that data frames are collections of rows, but they are so convenient... At least I think that if we remove df[cols] we should also remove df[col] (and vice-versa).

You have a point about setproperty! being inconvenient. Of course there's df[:, col] = ... but it makes a copy. Or maybe we could see it as a feature that the safe behavior has the most convenient syntax? Or should we make df[:, col] not copying?

I agree about the rest of the plan. stackdf and meltdf have always looked like sub-optimal API to me, we should probably find another syntax before 1.0 (https://github.com/JuliaData/DataFrames.jl/issues/1736).

nalimilan on 23 Feb 2019

👍1

Or should we make df[:, col] not copying?

It should copy IMHO.

I can't help thinking we will regret this wart if we keep df[cols] and df[col] in 1.0 despite deciding that data frames are collections of rows

Personally I would drop them both, but I am afraid that people that do not know the details of DataFrames.jl might find it problematic.

We need the syntax for the following operations if col is a variable:

get the column col from the DataFrame as is; currently it is df[col] and getproperty(df, col) (direct access - no copy)
set the column col in the DataFrame without copying; currently it is df[col] = val and setproperty!(df, col, val)`.

These operations are quite common in practice. I am also not sure what is best. Maybe having only setproperty! and getproperty is good enough?

bkamins on 23 Feb 2019

IndexedTables has a specific select function to select columns: select(df, col) or select(df, cols) (with the extra advantage that one can pass some fancier object for example select(df, ::Regex) would keep all columns that match the Regex).

The main concern that I see with using select is that in IndexedTables the table is immutable and you can't reassign a column to another array, so it's not clear how one would combine the two things.

Keeping that in mind the main alternatives to get/setproperty! seem either some get/setcol! or asking to wrap a DataFrame into a collection of columns object on which getindex selects the column. Either some cols(df)[:mycol] or even transpose (to go from row object to column object and vice-versa) would make some sense in my mind: df'[:mycol] though the latter feels a bit more experimental.

piever on 23 Feb 2019

Yeah, whether we call the function select, getcol, columns or just use the existing getproperty method, the result is more or less the same. Using a columns wrapper is interesting since it would allow assignment via cols(df)[:mycol] = v or df'[:mycol] = v, but I'm not sure that would really make things more convenient than sdetproperty(df, :mycol, v). In the same vein, we could define a custom object similar to : to allow df[custom, :mycol] to behave like df[col] currently, but that wouldn't be obvious either.

The main concern that I see with using select is that in IndexedTables the table is immutable and you can't reassign a column to another array, so it's not clear how one would combine the two things.

What do you mean?

nalimilan on 23 Feb 2019

The main concern that I see with using select is that in IndexedTables the table is immutable and you can't reassign a column to another array, so it's not clear how one would combine the two things.

What do you mean?

I wrote things in a very confusing way. I simply meant that select is a good name to select one or more columns from a DataFrame but it can't be used as a replacement of df[:mycol] = v: this was never a concern for IndexedTables as replacing a column with another vector is not allowed there (you'd need to create a new table that shares some columns with the old one).

piever on 23 Feb 2019

I agree select is a good name. Here the only issue is where we should "source" it from to extend methods for it.

Maybe we could use mutate! or transform! for setproperty!. This should be non-conflicting AFAICT (but we should confirm it and choose a more non-conflicting option).

If we agree to this then select could use the API of select from JuliaDB for consistency (or a subset of this API). With the rule that select(df, col) returns a vector in col as is, and all other forms return a DataFrame and perform a copy.

Then mutate! could have the following API (that would be reasonably close what we have with by):

mutate!(df; col=value...)
mutate!(df, col=>value...)
mutate!(df, ::Vector{Pair{Symbol,<:AbstractVector}) (here )
mixture of option 2 and 3

(possibly also added Tuple versions of options 3 and 4 - depending if we want to keep tuples in #1620 - in general I would try to be consistent here)

bkamins on 23 Feb 2019

I agree select is a good name. Here the only issue is where we should "source" it from to extend methods for it.

Tables.jl has a select function to select columns so I guess that's a natural candidate (JuliaDB does not extend it though, maybe some coordination is required here).

In terms of signatures I like mutate!(df, col => value...), I think it is the same signature as setcol in JuliaDB.

piever on 23 Feb 2019

I always forget that Tables is not re-exported and exports nothing 😢.

As a side note select would be copying while Tables.select:

does not perform a copy, but passes columns as-is;
Tables.select(df, col) after materializing is a table;

Of course we do not have to be 100% consistent here, but just to remember.

EDIT Just the question is if we find that select(df, col) returns a raw vector not a one column data frame.

bkamins on 23 Feb 2019

Tables.jl has a select function to select columns so I guess that's a natural candidate (JuliaDB does not extend it though, maybe some coordination is required here).

Having thought about that it is exactly what we should not do either, like JuliaDB.jl. The reason is that Tables.select is catch-all so we would override its current behavior. Maybe simply we should define select as stand alone in DataFrames.jl?

bkamins on 23 Feb 2019

Why not support select, but that discussion sounds quite orthogonal to the df[col] issue, as select(df, col) isn't really more convenient than getproperty(df, col). mutate! is also interesting, but mutate!(df, col=>v) isn't simpler than setproperty!(df, col, v) either.

nalimilan on 24 Feb 2019

In my opinion the difference is that select and mutate! would do this job and be much more flexible at the same time. So user learns to use select and mutate! in general and does not have to bother about getproperty and setproperty! functions (if one wants one can still use them as they are now defined anyway).

Or put it differently: I am afraid that everything except leaving df[col] defined will be at least equally complex as getproperty and setproperty! syntaxwise.

bkamins on 24 Feb 2019

👍1

Of all of these I like cols(df)[:x] and cols!(df)[:x] a decent amount. One for copying and one for not copying. We could have setindex for each of them too.

pdeffebach on 24 Feb 2019

I have just looked through internals of DataFrames.jl. We actually use df[col] quite a bit (both to set and to get). df[cols] is not used much.

bkamins on 27 Feb 2019

@nalimilan, @oxinabox:

Maybe we can make a two-step approach here if we are not clear what to do:

for now implement the "ownership" concept for DataFrame (I can make a PR) and leave df[col] and df[cols] working?
later decide if we want to deprecate df[col] and df[cols] before 1.0 release?

bkamins on 2 Mar 2019

Yes, I feel like the first and second are unrelated.
Or at least related only by the question of which ops release ownership and which don't.

oxinabox on 2 Mar 2019

OK - I will go ahead with 1, and let us see how we like it.

bkamins on 2 Mar 2019

Ownership PR is in https://github.com/JuliaData/DataFrames.jl/pull/1742

bkamins on 2 Mar 2019

Given the recent discussion #1742 should be closed, and I see the target functionality the way I have described in #1749 for @kescobo:

DataFrames are allowed to share underlying vectors (so we drop column "ownership" concept)
all ! functions do not modify underlying vectors but allocate new vectors if needed
the exception is setindex!, which may modify underlying vectors in some cases (we should add a clear documentation of this behavior)
We add functions that are allowed to modify underlying vectors in-place, but they end with !!; push!! and append!! the most important functions here
we add copycolumns kwarg to some functions (e.g. DataFrame constructor, hcat) to force copying of columns on operation (by default it will be false) - this is an optional feature, but can be handy in some situations.
Deprecate df[col] and df[cols] in favor of getcol(df, col) and select(df, cols) (this is indirectly related as both functions expose inner vectors)
In the future df[idx] will select a row and df[idxs] will select rows (but we should probably wait with this at least 6 months)

bkamins on 11 Mar 2019

If we were to accept safety-first policy then this could be a frame of the target functionality:

DataFrame can optionally take ownership of the columns it contains
this means that DataFrame constructor and insert_single_column! (and all methods that call these two methods) shall have added additional copycolumns::Bool kwarg that is true by default, but can be changed to false to avoid copying (the kwarg name is tentative)
if we went this way no other things in DataFrame.jl need to be changed - all functions in DataFrames.jl assume that they can safely modify the data frame in any way they like as copycolumns=false has to be explicit, so a user has to opt-in for it.

The only problem is with DataFrame(kwargs...) constructor, so we should choose a name for copycolumns kwarg that has low-conflict probability and at the same time it is not very verbose.
Maybe copycolumns should be a positional argument?

bkamins on 14 Mar 2019

The only problem is with DataFrame(kwargs...) constructor, so we should choose a name for copycolumns kwarg that has low-conflict probability and at the same time it is not very verbose.
Maybe copycolumns should be a positional argument?

We could also deprecate DataFrame(a = 1, b = 2) so scalars don't work, ensuring no conflict about copycolumns = true.

pdeffebach on 14 Mar 2019

Yes. The problem is that would still be a conflict if some wanted to write copycolumns=[1,2,3].
Also we would have to think if we would want to disallow scalar broadcasting in this case (I am not sure).

bkamins on 14 Mar 2019

We could use a different name for the copying and noncopying constructor.
Outer constructors are basically just functions.

DataFrame!! is a legal function name.

As is unsafe_DataFrame
as is DataFrameView
(I think we can hook into @views even?)

oxinabox on 14 Mar 2019

The keyword argument constructor is mostly intended to build data frames from literal names, so it could be OK to prevent creating columns called copycolumns. People should use DataFrame(cols, names) for generic operations. (Anyway copycolumns isn't a very likely column name.)

DataFrame! is an interesting option, but that pattern isn't used anywhere AFAIK. See also https://github.com/JuliaLang/julia/issues/24388#issuecomment-341908329.

nalimilan on 14 Mar 2019

The DataFrame!! suggestion is also opposite what Bugamil was saying with having the default for copycolyumns = true. Now the more complicated function name is the only one that matches dplyr in terms of performance.

pdeffebach on 14 Mar 2019

Given what we say maybe this is a general rule we could use for functions operating on DataFrames:

function without !, including DataFrame: source DataFrame is not modified and return value does not reuse columns form source DataFrame;
function with a single !: may modify a source DataFrame but do not create a new DataFrame (a pure in-place operation); even under this rule we may still decide that filter! and sort! etc. do create new verctors;
function with a double !!: may modify a DataFrame and return a new DataFrame that reuses the columns.

In particular we would have then DataFrame!! (reusing vectors), hcat!! (as opposed to hcat).

For this to work we have to deprecate df[cols] (which we want to do anyway) and add select, select! and select!!, similarly we can have mutate, mutate! and mutate!!.

EDIT The particular cases without !! returning original vectors would be getcol and eachcol (but they do not return a DataFrame but something else and the rule of no ! and !! is for the case when a DataFrame is returned).

The decision we would have to make is if:

df.x returns a column or copies it (probably we want a column returned without copying);
should df.x = y do a copy of y at assignment or not (this is relevant if we write df1.x = df2.y in combination with the rule above); also probably it should not copy - so technically it will be possible to have columns reused in two data frames - but probably this is an extreme case

(sorry for giving different options all the time but I want to explore all possibilities before we maka a decision as it is a hard one)

bkamins on 14 Mar 2019

Just to add - the logic behind the proposal above is that we make sure that two data frames can reuse the same column only if:

a function with !! is called (and then we assume that the user knows what one is doing)
a column from a data frame is extracted manually and manually added to another data frame (this is a bit more risky as it can go through unnoticed, but is unavoidable unless we totally ruin the workflow of DataFrames.jl usage); also I would assume that such an operation is not very frequent (you would have to work with two data frames "logically" sharing the same rows (so it requires a kind of branch & merge workflow which is probably rare); most of the time people work "lineary" with data frames)

bkamins on 14 Mar 2019

a function with !! is called (and then we assume that the user knows what one is doing)

I am coming around to the !! syntax for mutating the underlying vectors. maybe we should benchmark with and without copying. I'm worried about a flood of help posts asking "why is this slower than R" and we have to introduce this scary syntax to people who are just starting Julia from a less technical language.

pdeffebach on 14 Mar 2019

I am coming around to the !! syntax for mutating the underlying vectors.

Actually this is what I (and I guess @nalimilan) wanted to propose in the first place.

But this would mean that we drop the "safety first" assumptions (and this was something a lot of people were opposing strongly). But I think it is good to discuss such things without rushing, as we can clearly understand the consequences of different design decisions (e.g. here - if we "copy by default" then things will be slow by default).

bkamins on 14 Mar 2019

👍2

Along the same lines, I found this behavior very surprising:

julia> df = DataFrame(:x => [1,2,3])
3×1 DataFrame
│ Row │ x     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 2     │
│ 3   │ 3     │

julia> copy(df)[:x] = 4
4

julia> df
3×1 DataFrame
│ Row │ x     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 4     │
│ 2   │ 4     │
│ 3   │ 4     │

My mental model was that df[:x] = 4 was a shortcut for df[:x] = fill(4, size(df, 1)), not if haskey(df, :x) df[:x] .= 4 else df[:x] = fill(4, size(df, 1)) end.

cstjean on 15 Mar 2019

Yes, that's a problem but it's surprising even without copy: we just need to implement .=/broadcast! and deprecate this (to have it later create/replace with a new column). See https://github.com/JuliaData/DataFrames.jl/issues/1645.

nalimilan on 15 Mar 2019

👍1

Exactly, but I am waiting with #1645 (and also #1737 and #1735) till we decide how we handle the issue in this thread.

@nalimilan - there were several possible design options considered in this thread (and several other places). Could I please ask you to give your current thinking about it? (my general thinking is that whatever we choose to do we should minimize exceptions or number of rules one has to remember; like the list in the answer to https://stackoverflow.com/questions/23296282/what-rules-does-pandas-use-to-generate-a-view-vs-a-copy - hopefully our could be simpler)

bkamins on 15 Mar 2019

👍1

OK, let's make copies by default and follow your plan at https://github.com/JuliaData/DataFrames.jl/issues/1695#issuecomment-472941450. I'm not sure we want to introduce !! functions in that scenario, as they wouldn't be in-place, but that can be decided later anyway. I agree df.col shouldn't make a copy (it would kill performance in most common operations and it's easy to reason about). The case of df.x = y is more difficult, but if it makes copies, there will be no short alternative syntax to avoid copying; I guess it's fine if it doesn't copy, as it will be consistent with df.x not copying.

nalimilan on 15 Mar 2019

Given https://discourse.julialang.org/t/dataframe-how-to-change-value-of-a-cell-without-knowing-the-row-number/22024/7 my opinion is that when we copy then we should not detect aliases (this is probably what we would have done without thinking about it, but probably it is better to have an explicit decision here).

bkamins on 19 Mar 2019

What do you mean exactly? In that issue AFAICT the problem was due to the fact that insertcols! doesn't make a copy of the vector, right? We could make it copy by default, but then it wouldn't be consistent with df[col] = v, unless we also change that...

nalimilan on 20 Mar 2019

What I want to say (it is a corner case):

let df have columns a and b that are aliases (this will still be possible in the future), i.e. df.a === df.b
we copy df (by copy I mean we also copy columns) to create df2, then we unalias a and b, i.e. isequal(df2.a, df2.b) but df2a. !== df2.b
if we create df3 from df2 without copying columns then still df3.a === df3.b.

(this is probably obvious but I wanted to make it clear - in short we unalias aliases when copying)

EDIT I have edited my comment to make it more precise.

bkamins on 20 Mar 2019

👍1

Sorry to be super late to this conversation, but saw it pinned and so (assuming this is the purpose of pinning), wanted to throw in a couple thoughts on "goals".

First, thank you so much for thinking seriously about. I've long felt that this is one of the biggest problems with tools like pandas, which has very inconsistent rules around this and has just tried to solve it by throwing up (apparently random, given how complicated the rules are for when problems emerge) warnings.
Early in this discussion there was talk of adding strong warnings in the docs where problems arose. It seems like that's been abandoned, but wanted to throw in that I agree that's deeply problematic -- is a language doesn't have a coherent logic you can keep in your head while programming, you're doomed to make mistakes in practice.
I agree with safety / simplicity over performance in the DataFrames package. People who really need in-place mutations and views for performance always have the option of reverting to working with simple arrays in base Julia. I'm a social scientist and educator, and I just find the idea that there are ways my students (not computer scientists) could accidentally corrupt their data through small understandings (which are inevitable) a little horrifying.
- One partial solution to the performance problem is the copy-on-write philosophy used in R. It can be very tricky to properly implement (I tried with pandas at one point and eventually gave up), but if someone with more software development skill than I can solve it, it can ameliorate a lot of the performance hits to what looks like "copy by default" to users.
So as not to overwhelm users, I'd love for it to be the case that a beginning user (especially one coming from R who is new to these ideas (most of my students)) could get along fine by just not using !! function. Maybe they'd get lower performance, but they could just ignore this issue all together. If we can structure this so people don't have to understand all these nuances, it'll make the package much easier for newbies. I think that's where the current proposals stand, but I will admit even I'm a little confused at the moment :)

nickeubank on 20 Mar 2019

👍1

Thank you for the comment. I think we intentionally wait to get all the feedback we can.

copy-on-write

We would have to have a global pool of vectors stored in DataFrames to do it. It is doable, but at least I was against it as it is not 100% safe anyway (a vector can be extracted out of a DataFrame and manually modified which we are not able to intercept).

In general I read your voice as safety first / speed as an option - and yes I think we are converging towards this decision.

bkamins on 20 Mar 2019

👍1

OK, let's make copies by default and follow your plan

In summary this is what I understand we go (I will start planning to implement it when we are in agreement with some details I mention below):

function without ! (including DataFrame) by default have copycols kwarg set to true: source DataFrame is not modified and return value does not reuse columns form source DataFrame;
function without ! (including DataFrame) with copycols kwarg set to ~true~ false (we add this kwarg only for the functions that it makes sense - for some functions it is obvious that they have to copy so they will not have this kwarg): source DataFrame is not modified and return value does reuse columns form source DataFrame; ~~(this is a decision point so please comment if we go this way before I start implementation)~~
In particular all DataFrame constructors take copycols kwarg and do copy by default (i.e. we only ensure that by default data frames do not share vectors between them but they can share vectors with some other variables pointing to them) (this is a decision point so please comment if we go this way before I start implementation)
function with a single !: may modify a source DataFrame but do not create a new DataFrame (a pure in-place operation); even under this rule we may still decide that filter! and sort! etc. do create new verctors;
df.x does not copy, nor df.x = y does copy if y is a vector
we deprecate df[col] and df[cols] and introduce getcol, setcol! and select and mutate! functions (this is a tentative list - the details will be worked out in a separate PR); in general the idea is that the first two work on single columns (and return a vector) and the latter on multiple columns (and return a DataFrame)

The general rule is that if we:

extract a single column from a DataFrame we do not copy by default (this is essentially exposed by getindex, getproperty, select and eachcol functions, and indirectly by setproperty! function)
extract possible multiple columns (including a collection of one element) then we do copy by default

If we have this settled then #1737 and #1735 can be finished.

The starting point of implementation of these rules will be an update of https://github.com/JuliaData/DataFrames.jl/pull/1742.

After all this is finished the next step will be work on #1645.

bkamins on 22 Mar 2019

👍1

function without ! (including DataFrame) with copycols kwarg set to true: source DataFrame is not modified and return value does reuse columns form source DataFrame;

Rather than copycols, I think the keyword should be reusecols or sharecols. Normally "copy" implies making a new, distinct copy of something, and this is the opposite.

I'm agnostic on Point 3. Feels a little strange to allow that, but I'm ok with it.

Could you please clarify point 5?

nickeubank on 22 Mar 2019

Actually, more general point, can we define terms? I would propose:

"copy": make a NEW copy. like what copy does.
"reuse" or "share": use the same object again (so modifications to columns in one dataframe will also change the data in the other).

nickeubank on 22 Mar 2019

Normally "copy" implies making a new, distinct copy of something, and this is the opposite.

It was a typo. Fixed. I meant that copy is a copy. But we can use reusecols or sharecols if people prefer (I am not a native speaker).

Actually, more general point, can we define terms? I would propose:

With "copy" we are on the same page. As for "reuse" or "share" I think of this as "alias" (but this might be again the issue of me not being a native speaker, so I am OK with any of these, but aliasing is what we actually want to avoid by the design).

I'm agnostic on Point 3. Feels a little strange to allow that, but I'm ok with it.

We can change this to force make copies (again - this is something I am not sure what is best). But you can put a "reused/shared/aliased" column in a DataFrame any way by other means. Therefore the idea of the proposal above is that we ensure that DataFrames do not "reuse/share/alias" columns between themselves by default, but they can do with other objects. (I thought it was a general consensus we go this way).

Could you please clarify point 5?

This means that by writing df.x = df2.y you actually "reused/shared/aliased" column between two DataFrames as now :x in df and :y in df2 are the same object. In the future to avoid such sharing you will write df.x = df2[:, :y] or df.x = copy(df2.y) or df.x = copy(getcol(df2, :y)).

bkamins on 22 Mar 2019

Great, ok. I’m fine with copycols now. :)

And copycols defaults to true for non-! Functions?

nickeubank on 22 Mar 2019

Yes - the default is true (I also add it for clarity above).

bkamins on 22 Mar 2019

👍1

Here is a case to consider.

Let's say setcol(df, c::Symbol) = a becomes a thing and this is used to guarantee a copy when a gets placed in a dataframe. And we have, intuitively,

setcol: copy
setcol!: no copy
getcol: copy
getcol!: no copy

This means that setcol(df1, :x) = getcol(df2, :x) .+ 1 does a redundant copying, right? They should use setcol(df, :x) = getcol!(df2, :x) to get equally safe code without an extra copy.

This seems like a confusing 2 by 2 for the user to remember.

pdeffebach on 23 Mar 2019

@pdeffebach Thank you for the comment.

I was considering something similar. The problem with such a proposal is exactly the mental overhead + the fact that by default users will use getproperty and setproperty! (the getcol function is for the cases when column name is a "weird symbol" that cannot be used with df.x form). Also setcol would break the rule that we add ! to functions that mutate the object.

So I think it is simpler to think that we "guarantee copy by default" if we do X to Y operation where both X and Y are DataFrame and in all other cases we do not make this guarantee (the rest of the cases are mostly when X or Y is a vector).

bkamins on 23 Mar 2019

I'm fine with that plan. Two remarks, though:

Regarding 3, it may be too complex to have constructors make a copy when a DataFrame is passed, but not when vectors are passed directly. So maybe we should always copy or never, unless we can find a simple way to express the rule (e.g. "we never copy when vectors are passed or extracted individually, i.e. separately from a data frame").
Regarding 6, I'm still not sure we need getcol in addition to getproperty and select. I guess getcol could be useful if select(df, col) returns a copy by default. But IIRC in IndexedTables it doesn't make a copy, which kind of makes sense since that would be consistent with df.col.

Regarding setcol: setcol(df, c::Symbol) = a isn't valid syntax (it defines a setcol method), so we have to use setindex! and setproperty!.

nalimilan on 23 Mar 2019

"we never copy when vectors are passed or extracted individually, i.e. separately from a data frame"

This is exactly the rule I have in mind - and 3 follows this rule. Do you have any other relatively simple rule that we could define? (I initially wanted DataFrame constructor to make a copy by default but it would be against this rule so that is why I have proposed 3 the way I did). Also - the general thinking should be that DataFrame constructor normally does not copy. The constructor DataFrame(::DataFrame) in practice will be almost never used in "standard" data analysis workflows I think.

Regarding 6

You are right, we can live with select only given select(df, col) returns a vector.

There was some discussion to make select(df, col) return a single column DataFrame (so that select always returns a DataFrame and in this way users can safely use it in their data pipelines). That is why I initially proposed a separate function. But if we agree that select(df, col) returns a single column I will update my post above to reflect this. (my only objection was that then user has to learn that select returns different objects depending on the type of its argument and also depending on this type it either makes a copy or not - this might be confusing for newcomers).

Regarding setcol

setcol(df, col) = a is an invalid syntax, but one could potentially write getcol(df, col) .= a in the future and call setcol(df, col, a) (but again - I think I am against introducing setcol function).

bkamins on 23 Mar 2019

"we never copy when vectors are passed or extracted individually, i.e. separately from a data frame"

This seems like a really nice rule. Clean, and simple. And because we're all (hopefully) familiar for when individual columns can be mutated (base julia rules), that feels relatively safe.

Constructors

Even if we adopt a "when working with vectors individually, we don't copy" rule, it's not quite clear how that applies in a constructor (where one may be passing several vectors at one). Is this really a "working with an individual column" situation: DataFrame(x=1:20, y=30:40) ?

I am inclined to agree that maybe constructors should always copy be default. Making a special case for DataFrame(df::DataFrame) feels a little like a little gotcha...

nickeubank on 23 Mar 2019

Sorry, overlapping comments. But re: @bkamins

Do you have any other relatively simple rule that we could define? (I initially wanted DataFrame constructor to make a copy by default but it would be against this rule so that is why I have proposed 3 the way I did).

As I was trying to say but maybe wasn't coherent in saying, I don't think that a constructor feels like "working with individual columns", so I don't think it's a violation to make the copy by default. One could do a one column constructor (DataFrame(x=1:20)), but in general constructors are for many-column operations, so I don't think they violate the rule.

Or if we framed differently, maybe the rule can be "functions that can only get or set a single column never copy". Since DataFrame can set more than one column at once, we're good.

nickeubank on 23 Mar 2019

Regarding setcol: setcol(df, c::Symbol) = a isn't valid syntax (it defines a setcol method), so we have to use setindex! and setproperty!.

Right, sorry about that. I get that setcol is an unnecessary function. (Though if we eventually add a mutate function we will have to revisit this.

I think my point still stands, though, about unnecessary intermediate copying. I wonder if

select(df, col) should never make a copy.
setindex! should always make a copy.

This way , the only time a user would need to worry about messing up a data frame is when they venture out of a data frame. In which case we assume they know what they are doing.

x = df.a
filter!(x)

pdeffebach on 23 Mar 2019

setindex! should always make a copy.

I say no - this would be a performance problem ()

select(df, col) should never make a copy.

This is I guess what the consensus is now (i.e. it returns a single vector that is not a copy)

"functions that can only get or set a single column never copy"

this is a good rule.

So I update my summary above to reflect @pdeffebach and @nalimilan recommendations (i.e. DataFrame copies by default, and we drop getcol idea)

bkamins on 23 Mar 2019

setindex! should always make a copy.
I say no - this would be a performance problem ()

I agree -- it's a ! function, so I think it's ok to not copy.

nickeubank on 23 Mar 2019

I have updated https://github.com/JuliaData/DataFrames.jl/issues/1695#issuecomment-475763711.
The only thing we still need to decide is how do we want users to express the operation:

df.x = y

if x is not a valid identifier (so that df.x will not parse properly). The operation should make column :x in df get value y without copying it.

It could be either:

setcol!(df, :x, y) (essentially insert_single_column! function exposed - maybe with some tweaks)
defer this to a more general mutate! function (that will probably allow to insert multiple columns using e.g. Pairs)

bkamins on 23 Mar 2019

Just to be clear, we need a substitute for df.x = y in case x is not a valid identifier because we will deprecate df[:x] = y.

bkamins on 23 Mar 2019

I would really like it to look as close to df.x = y as possible. I don't want users to be like "well, I would put this in a function, but then I have to change and use some different notation". Going from df$col to df[[x]] is a big annoyance for me in R.

I propose df.sym(x) where sym(c::Symbol) does makes a custom type we can dispatch on. Or use ^ instead of sym or some other valid unicode character.

Ultimately this will make the logic a lot easier as well, as you have dot syntax for non-copying and then some other function syntax for copying. And the logic is preserved with or without literals.

pdeffebach on 23 Mar 2019

👎1

It's also import to realize how much of a pain parentheses are. something like setcol!(df, x, y) makes lengthy map and array comprehensions more annoying.

pdeffebach on 23 Mar 2019

Ugh bummer. I really hate that the df.x behavior will depend on whether :x is already defined. pandas has that problem and drives me nuts, though at least here it won’t be a silent failure (in pandas it “works” in that it just sticks y into a property, but doesn’t make a column).

No hopes of preserving that behavior with the df.x=y syntax with some clever syntactic sugar behind the scenes?

nickeubank on 23 Mar 2019

df.x = y will work also when :x is not defined and then the column will be added. Of course df.x will fail if :x is undefined for reading.

We are only talking about the case when you have a column name that is an invalid identifier. So here all stays unchanged, but because we want to deprecate df[:x] syntax for reading and writing we should define some other means to get column and set column using a function.

For getting a column we have already settled for select(df, :x) syntax.

What is left is the syntax for setting the column. Up to my understanding we have two natural choices:

function working on single column like setcol!(df, v, :x)
a more general function possibly changing many columns, e.g. mutate!(df, :x=>v2, :y=>v2), where single column operation is a special case

bkamins on 23 Mar 2019

Oh -- ok, sorry, please excuse my ignorance. What is an example where we have an invalid identifier? Do you mean something like Symbol("this has space") where you can't use the df-dot (df.) notation because julia can't parse df.this has a space?

In that case, I'm fine with setcol!. Seems descriptive and simple.

[edited for clarity]

nickeubank on 23 Mar 2019

Your understanding is correct 😄.

In fact setcol! would be defined very simply (if we decided to add it):

setcol!(df::DataFrame, col, v) = insert_single_column!(df, col, v)

so we are essentially exposing an already existing inner function.

My only hesitation is related to the decision that we might want to introduce mutate! function, in which case setcol! would be redundant (@nalimilan - any opinion here)?

bkamins on 23 Mar 2019

This is exactly the rule I have in mind - and 3 follows this rule. Do you have any other relatively simple rule that we could define? (I initially wanted DataFrame constructor to make a copy by default but it would be against this rule so that is why I have proposed 3 the way I did). Also - the general thinking should be that DataFrame constructor normally does not copy. The constructor DataFrame(::DataFrame) in practice will be almost never used in "standard" data analysis workflows I think.

It would be even simpler to say "constructors do not copy columns". If we don't have common uses cases for DataFrame(::DataFrame), maybe that would be better.

There was some discussion to make select(df, col) return a single column DataFrame (so that select always returns a DataFrame and in this way users can safely use it in their data pipelines). That is why I initially proposed a separate function. But if we agree that select(df, col) returns a single column I will update my post above to reflect this. (my only objection was that then user has to learn that select returns different objects depending on the type of its argument and also depending on this type it either makes a copy or not - this might be confusing for newcomers).

Yes, it may be unexpected for select to return a single column given that it's a term taken from SQL. Another argument for select(df, col) to return a DataFrame would be to also allow select(df, col1, col2) and things like select(df, :col1, z = :col2 => -), for consistency with by/combine and a possible mutate!/transform!.

Anyway this issue is relatively self-contained, as it just concerns convenience methods for getproperty and setproperty!. We should probably start without these, and discuss this in a dedicated issue (this one is already quite long), trying to synchronize with other table packages if possible.

nalimilan on 24 Mar 2019

👍1

If we don't have common uses cases for DataFrame(::DataFrame)

DataFrame(adf) is the function one calls, when adf is a SubDataFrame and you want to get back something that you can mutate safely without changing the original.
It would be weird if that was copying for some AbstractDataFrames, and not copying for others.

oxinabox on 24 Mar 2019

Anyway this issue is relatively self-contained

Agreed. I have written down in the "main" post my current thinking, but let us discuss it in a separate thread (I will open it when we have finished underlying infrastructure implementation)

It would be weird if that was copying for some AbstractDataFrames, and not copying for others.

This is what is a current consensus so 👍. A nice thing is that catch-all DataFrame(x) from other/tables.jl will inherit this behavior automatically 😄.

OK - I will start implementing the agreed functionality. The fist step is copying columns behavior ("data ownership" PR update).

bkamins on 24 Mar 2019

https://github.com/JuliaData/DataFrames.jl/pull/1772 adds column(df, col) as an equivalent of getproperty(df, col). That would replace getcol(df, col), with the advantage that it's consistent with the JuliaDB API.

If we do that, adding setcol!(df, col, v) wouldn't be very consistent. Should it be setcolumn!(df, col, v) or column!(df, col, v) instead? As noted above, we could also use mutate!(df; col => v) or transform!(df; col => v) for that, though the syntax is a bit subtle (; with pair to generate a keyword argument).

nalimilan on 15 Apr 2019

column(df, col) as an equivalent of getproperty(df, col)

Just to be precise column also accepts integer indexing (at least under current design, but I think we should keep it).

I think that, if we add such a function then column! would be a good name. But given we are not sure if we want to deprecate df[col] = v maybe we do not need it. Do you know what name JuliaDB uses for this operation?

bkamins on 15 Apr 2019

I think that, if we add such a function then column! would be a good name. But given we are not sure if we want to deprecate df[col] = v maybe we do not need it.

If we don't deprecate df[col] and df[col] = v, maybe we don't need columns either? If that's only for consistency with JuliaDB, we can add this later once we have stabilized our own API.

Do you know what name JuliaDB uses for this operation?

JuliaDB uses setcol(t, col, v) (there's no in-place equivalent). That function also supports a broader interface which is so flexible that it's equivalent to mutate!/transform!. But I don't really like that approach: the name is singular even if you can set multiple columns, it's not consistent with columns, and the function mixes setcol(t, col, v), setcol(t, col => f) and setcol(t, col => f, col2 => f2) which have very different behaviors despite their similarity.

nalimilan on 15 Apr 2019

That I why in #1772 I wanted to give a reference implementation so that it is clear that column is essentially getindex with the restriction to allow only a single column. I am OK to revert adding column as in my view it was added only for consistency with Julia DB (until we stabilize API we do not know if it is needed or not).

bkamins on 15 Apr 2019

JuliaDB uses setcol(t, col, v) (there's no in-place equivalent). That function also supports a broader interface which is so flexible that it's equivalent to mutate!/transform!. But I don't really like that approach: the name is singular even if you can set multiple columns, it's not consistent with columns, and the function mixes setcol(t, col, v), setcol(t, col => f) and setcol(t, col => f, col2 => f2) which have very different behaviors despite their similarity.

Yes, I also think that's not ideal. Would a better approach be to deprecate setcol(t, col, v) in favor of setcol(t, col => v) and rename it to mutate? I think originally it only supported setting one column, but then it became more feature-rich to the point that the original name no longer makes so much sense.

From what I understand mutate is the correct term here as it keeps the remaining columns, unlike transform which drops them and is actually a special case of IndexedTables.select (which means I should also rename the JuliaDBMeta macro to @mutate I guess).

piever on 15 Apr 2019

Why do you think transform should drop columns? Is that in reference to some implementation? Contrary to "select", nothing in the term "transform" indicates that remaining columns should be dropped (it's essentially a synonym of "mutate" AFAICT).

nalimilan on 15 Apr 2019

From this discourse post, but my intuition agrees with yours, which is why I named the macro @transform in JuliaDBMeta. Maybe a better pair of functions would be:

transform (keep columns)
select (drop columns)

piever on 15 Apr 2019

👍2

I would close this issue. Is there anything left from it on the table? (maybe the transform function, but if we feel it should be added we should open a separate PR/Issue for this)

bkamins on 2 Sep 2019

👍1

So our current policy is that ! functions may either replace or mutate the column vectors, depending on what makes sense: when possible, they mutate the vectors. This has the advantage of being efficient (when introducing !! variants for mutation could have trapped users into using slow functions, especially for push! and append!). See the list I wrote at https://github.com/JuliaData/DataFrames.jl/issues/1695#issuecomment-463204540.

nalimilan on 3 Sep 2019

Dataframes.jl: Policy regarding in-place operations

Most helpful comment

All 99 comments

Related issues