Some in-place operations like append! operate in-place on column vectors, but others like allowmissing! replace vectors. We should check whether this difference is justified and whether it is the most useful behavior.
For example, it might not be very useful for append! to resize column vectors in-place, since that does not save any memory and prevents it from changing the column eltypes. OTOH that could be a useful difference from vcat if one wants to preserve the original eltypes.
Today I hit a problem with dropmissing! related to this. I created a dataframe with a subset of two columns to perform pairwise deletion of missings and it corrupted the original wide DataFrame. I knew what to do (at least I hope I did 😄), but probably newcomers to DataFrames.jl will be caught off-guard here unless we properly document it.
A possible idea: use a ! suffix for operations that mutate the data frame but not the column vectors (the majority), and a !! suffix for operations that also mutate the vectors (which are not that common, and often trap users who don't want to mutate column vectors). That would be both explicit, consistent are still relatively convenient.
In general silently mutating vector type except for df.x = something (in this case it is obvious it can be mutating) should be signaled to the user so !! seems reasonable. As I have said in #1716 an implicit type mutation without signaling can lead to nasty bugs IMO.
Continuing from #1716
So is your proposal to make
push!to do autopromotion (and create a new vector) andpush!!to mutate the vector in place?
This kind of confusion shows why this kind of syntax-y solution is less than ideal. Though I do think it's interesting to think of a syntax that shows clearly the nested structure of a DataFrame
So the idea was that a single ! only mutates the data frame, so it can do anything to the object (including replacing column vectors), but !! can also mutate the column vectors (so it can't change their type). That's not really in contradiction with Base, since the single ! version still doesn't change the type of the data frame, just like push(Int[], 1.0) converts 1.0 to Int since the type cannot change.
That said, I'm not saying that's the best solution.
I agree that there are pros and cons so my idea was to list all the affected functions and then see how many changes we would have to make.
@nalimilan Given the discussion we had yesterday with @oxinabox I think I can explain both arguments behind !! briefly:
!! signals a more dangerous thing than !; with this reasoning a more dangerous thing is to mutate vectors withing a data frame than to replace them by some other vectors (why: because these vectors can be used somewhere else and mutating them might break something)! mutates only data and !! mutates data and metadataProbably both reasonings are valid, but I just wanted to spell out the difference between them.
Now - regarding mutating vectors in place. Here we have a double-risk after the recent changes allowing to propagate @inbounds in getindex. Why - because we do check vector length only at adding/creation time. Later we only check first column for its length and assume that all else matches. Without propagating @inbounds this is OK, but when we started to propagate them you can start getting segfaults (and I guess @oxinabox can confirm this 😄 - although this is probably not the case he has encountered) in strange places (of course due to the error on the programmer side - but the error might be tricky to spot).
I guess really there are 4 kinds of things
2 and 3 are both dangerous.
I wonder if we don't actually want a new Array type that is guarded against mutation except by a particular set of methods that are not exported by DataFrames.
So noone can actually mutate columns except via intended interfaces.
* what I and @oxinabox initially thought: is that `!!` signals a more dangerous thing than `!`; with this reasoning a more dangerous thing is to mutate vectors withing a data frame than to replace them by some other vectors (why: because these vectors can be used somewhere else and mutating them might break something) * what @nalimilan proposes is `!` mutates only data and `!!` mutates data and metadata
@bkamins I don't think that's accurate. At least in my comment above I (meant to) propose the first interpretation. But I think @oxinabox said the reverse on Slack.
I wonder if we don't actually want a new Array type that is guarded against mutation except by a particular set of methods that are not exported by DataFrames.
@oxinabox I'm not sure that would be a good protection nor convenient.
push! or append! to a data frame, except using special functions; but we don't need to use a custom array type for that, we can just do what we want with/in our DataFrame methods. And returning these custom array types from e.g. df.col would be weird for users.I've had a look at our current mutating methods. Here are a few thoughts about what would happen if we required a !! for methods which mutate column vectors:
allowmissing!, disallowmissing! and categorical!, this is perfectly fine since by definition these methods replace some columns with new vectors.deletecols!, insertcols!, permutecols!, names! and rename! it's also fine since these affect the data frame but not the vectors.push! and append! it would be both a good and a bad thing. Good, since it would allow changing the type of the columns if needed using promote_type, like vcat. Bad, since by default we would make a copy on each addition. Since that's not consistent with what happens for vectors, people won't necessarily think about push!! and append!!.filter!, deleterows!, unique! and dropmissing! this convention would also be a bit annoying but safer, since they would replace column vectors with copies before mutating them. Variants with !! would be needed to actually mutate vectors.sort! it would be quite weird to allocate new vectors before sorting them. I guess we could say that ! functions are allowed to mutate column vectors as long as they don't resize them. But even that is very dangerous if these vectors are used in another data frame (and you may well not notice it).Overall I guess the main argument in favor of changing the current behavior would be to make it less appealing to use in-place operations (in the strong sense of operations that mutate column vectors). Indeed currently it looks nicer to do sort!(df, :x) than df = sort(df, :x). OTOH it's one of the strengths of Julia that you can easily avoid copies (and worse, the compiler isn't as good as R currently if you don't use in-place operations since it doesn't do copy-on-write to avoid copies where possible).
Crazy idea: maybe we could have a global reference count keeping track of the column vectors that are currently used by a data frame (which would be automatically updated by a DataFrame finalizer). When doing operations which would mutate column vectors used elsewhere, we could throw an error, or at least print a warning to protect again data corruption (safety matters a lot, and in R it's kind of implicit due to copying semantics). Of course, that wouldn't protect you from things like sort!(df.x), but it's more obvious that this kind of thing is a bad idea anyway.
First let me reply to @oxinabox:
- 0 Operations that just add columns. These are always safe
The problem is that these operations are chiefly setindex! and setproperty and they can either add a column or mutate an existing column and you only know which thing they do at run time.
- 1 basic mutating options that just change the values in existing columns. These are always safe
They are not always safe if you have created as GroupedDataFrame based on some AbstractDataFrame.
I wonder if we don't actually want a new Array type that is guarded against mutation
Actually it was recently proposed, https://github.com/bkamins/ReadOnlyArrays.jl, but did not get much traction (so for the time being this idea is on hold). Although we intended to use it for a bit different purpose.
Now @nalimilan:
At least in my comment above I (meant to) propose the first interpretation.
I think it is best to discuss concrete methods (what you do below), as this avoids ambiguity.
Looking at your list I think we can stay with only ! (as most of the time the conclusion is that !! would be annoying) and concentrate on improving the documentation and add methods without ! (so that there is always an option to do a super-safe thing). In particular:
allowmissing!, disallowmissing! and categorical!: add information in the documentation that they create new vectors ONLY IF NEEDED; add a methods for allowmissing, disallowmissing, categorical that support AbstractDataFrame as argument (and decide if they should do copy when this operation would be no-op or not).deletecols!, insertcols!, permutecols!, names! and rename!: decide if we want to have deletecols, insertcols, permutecols (we already have rename).push! and append!: I would leave them as is and ADD A STRONG WARNING in the documentation about risk of using them; I would promote using vcat for other use-cases (EDIT we should add required functionalities to vcat and reference it in the documentation; the decision is if we want to add push method that is like vcat but adds one row)filter!, deleterows!, unique! and dropmissing!: I would leave them as is and ADD A STRONG WARNING in the documentation about risk of using them; I would add deleterows method for completeness (all other methods we already have)sort!: I would leave them as is and ADD A STRONG WARNING in the documentation about risk of using them; we have sort variant alreadyWhenever I say ADD A STRONG WARNING this should include:
! as safe alternatives (that is why I proposed to add them everywhere where they are missing)maybe we could have a global reference count keeping track of the column vectors that are currently used by a data frame (which would be automatically updated by a
DataFramefinalizer).
I do not think this would work well as there is no guarantee when GC runs the finalizer.
Unfortunately docs can only help mitigating issues, but they are not really a fix (and many users don't read docstrings until they have a problem). The fact that you've been trapped by dropmissing! shows that even knowing the problem doesn't fully protect from it. I'm particularly concerned by sort!, which could give a bad reputation to Julia if some people screw up their analyses with that.
Another crazy idea: change df[cols] to copy the column vectors by default, and require @view/view to get a SubDataFrame. That would be consistent with what getindex does with arrays (remember that the design of DataFrames predates the existence of array views in Julia).
I agree and we have two ways to go here:
! for every operation and advocate using them; have functions with ! and warn users not to use them unless they know what they are doing!! option: make functions without ! and with ! safe (i.e. would not mutate vectors, ! works in the same DataFrame but creates new vectors) and !! variants would mutate vectorsActually looking at this and thinking about it more - I start to like !! option after all as !! is a really clear sign of something dangerous. So +1 for your proposal, even if it would be unwieldy in some cases.
We still have another issue here (that I have sprinkled in several places in my comments): some functions (like e.g. EDIT: ~dropmissing~ disallowmissing) return the original vector when no action is needed and create a new vector in other cases (this is a convert vs constructor distinction in Base). Any thoughts about this issue (apart from the fact that it should be clearly documented). If we go the way I feel we started moving towards - they should always create a new vector if they create a new data frame.
Another crazy idea: change
df[cols]to copy the column vectors by default
If we wanted to go this way actually I would do the following (we could also skip step 2 and stay at step 1 - but this is a long term decision):
df[col] and df[cols] indexing altogether (and retain only df.col to get a source vector and df[:, col] and df[:, cols] and @view)df[row] or df[rows] (so in the long term single index would select rows not columns which would be consistent with row-based interpretation of a DataFrame)I agree and we have two ways to go here:
1. what I have proposed above: define functions without `!` for every operation and advocate using them; have functions with `!` and warn users not to use them unless they know what they are doing 2. the `!!` option: make functions without `!` and with `!` safe (i.e. would not mutate vectors, `!` works in the same `DataFrame` but creates new vectors) and `!!` variants would mutate vectorsActually looking at this and thinking about it more - I start to like
!!option after all as!!is a really clear sign of something dangerous. So +1 for your proposal, even if it would be unwieldy in some cases.
I didn't say I'm completely convinced by my proposal. ;-) Let me think about it...
We still have another issue here (that I have sprinkled in several places in my comments): some functions (like e.g.
disallowmissing) return the original vector when no action is needed and create a new vector in other cases (this is aconvertvs constructor distinction in Base). Any thoughts about this issue (apart from the fact that it should be clearly documented). If we go the way I feel we started moving towards - they should always create a new vector if they create a new data frame.
Yes, disallowmissing should probably do that by default when we introduce it (https://github.com/JuliaData/DataFrames.jl/issues/1720).
If we wanted to go this way actually I would do the following (we could also skip step 2 and stay at step 1 - but this is a long term decision):
1. In the first step disallow `df[col]` and `df[cols]` indexing altogether (and retain only `df.col` to get a source vector and `df[:, col]` and `df[:, cols]` and `@view`) 2. In the second step (probably after at least 1 year) reintroduce them as `df[row]` or `df[rows]` (so in the long term single index would select rows not columns which would be consistent with row-based interpretation of a `DataFrame`)
That's indeed appealing. But we need to provide a syntax to access a vector without copy when it's not a literal symbol: is getproperty(x, col) OK? Or should we export another function?
getproperty(x, col)OK? Or should we export another function?
We could add getcol for example, but I am not sure if actually we need it as we have getproperty as you have noticed.
This is needed also in cases when you want to have a programmatic access to the column (e.g. in a function when column name is passed as a Symbol via a parameter).
If we introduce a rule that transformation of AbstractDataFrame that produces a DataFrame never reuses the column from the source actually the need for !! is a bit lower I think as the probability that two data frames will share the same underlying vectors is low.
Note that you still will be able to reuse columns, but this will require a special treatment, in which case I guess the user will know what one is doing, e.g. this would still reuse the columns from source:
DataFrame(eachol(df, false), names(df))
If we introduce a rule that transformation of
AbstractDataFramethat produces aDataFramenever reuses the column
In particular we should remember to decide about copy(::DataFrame) and DataFrame(::DataFrame).
Just to recap my understanding of the desirable policy from Slack:
DataFrame which means that DataFrame assumes that it can safely mutate the object as long as it stays internally consistent;DataFrame for performance reasons; from the moment of its creation an DataFrame object considers itself as an owner of the columnsSubDataFrame, DataFrameRow or GroupedDataFrame is created they are not owners and may be corrupted if the owner is mutated (conclusions: users should be warned not to mutate parent DataFrame if they use one of these views). In practice this means the views should be used only for short-term operations and not passed around; also view creation is now fully flexible and fast in DataFrames.jl so they are the preferred method to subset a DataFrame if you care about performance for short-term operations.eachcol one has to be careful not to mutate the columns one gets (especially not to resize them). The same with direct column access via getpropertyAbstractDataFrame creates a fresh DataFrame it always allocates fresh columns (so that the newly returned DataFrame can assume that it has the ownership of its columns) (edited) df[col] and df[cols] column getindex/setindex! methods as they are dangerous (and leave only getproperty and setproperty!) - ADDITION after thinking about it I think we can leave them as they might be useful in some cases, as long as they behave as expected (i.e. df[cols] performs a copy of the columns)! and !! distinction I think.I am summarizing this again, because I think we will not get much new input in this discussion and we should decide if we should go this way (then I will, in particular, update https://github.com/JuliaData/DataFrames.jl/pull/1646 and a PR cleaning up whole code-base to follow "ownership" rule should be implemented).
Regarding point 6:
6. we could drop the
df[col]anddf[cols]columngetindex/setindex!methods as they are dangerous (and leave onlygetpropertyandsetproperty!) - ADDITION after thinking about it I think we can leave them as they might be useful in some cases, as long as they behave as expected (i.e.df[cols]performs a copy of the columns)
I think we should either deprecate df[cols] (I admit I can get used to df[:, col]), or keep it, but I think having df[col] and df.col having different behaviors, with one copying and one not, would be confusing for the user.
If we left df[col] I think it would do the same as df.col - i.e. return the vector without copying it (except that then col in df[col] can be an integer and any symbol). This is consistent with the rules above as df[col] returns a vector not a DataFrame.
If we go towards the "ownership" rule we should make sure how we handle stackdf and meltdf.
Regarding earlier issues to sum up wat I think. We would have:
df[col] that returns a vector without a copydf[cols] that returns a DataFrame making a copy (because it returns a DataFrame) (change)sub_df[col] that returns a viewsub_df[col] that returns a SubDataFrame (so no copy is made as SubDataFrame does not claim to have an ownership)And we would have an inconsistency between points 2 and 4. If this is OK, then we can leave this syntax. Any thoughts on this?
If in the future GroupedDataFrame would be <:AbstractDataFrame we will have to make similar decisions there.
We also have some wiggle room with df[:, cols] vs. df[cols]. Perhaps one could be a copy and one would not? Or one be a sub-dataframe and one not?
Thank you for the comments (this issue is bugging all of us for some time with no ideal idea what to do 😄).
Perhaps one could be a copy and one would not?
This is the current design and it is fully mentally consistent. The problem is that it is safe and even me (who has implemented this) keep forgetting that df[cols] is dangerous.
Or one be a sub-dataframe and one not?
Making df[cols] return a SubDataFrame is a possible solution to our problem. This would be a second exception in pair with df[row, cols] which also returns a view. Then only df[rows, cols] would return a DataFrame. However, I think that this is more risky than defining that df[cols] is equivalent to df[:, cols]. @nalimilan - what is your opinion?
I'd find it a bit weird to have df[cols] return a SubDataFrame. The df[row, cols] exception is really just because technically we cannot make it fast with named tuple, better not add more exceptions. Also we already have view to get a SubDataFrame.
So do we make df[cos] make a copy or better deprecate it altogether? (making a copy is safe and convenient, but deprecating it will give us more freedom for the future after 1.0 release). I would vote for making a copy.
Deprecate it all together.
Keep df[:, cols]
OK, so the conclusion is to:
df[cols];df[col] though after trying to work without using it for some time. The problem is that writing getproperty(df, col) and setproperty(df, col, something) is really clumsy and this operation is often needed when col is a variable.DataFrame that creates a new DataFrame make copies of columns (we introduce a column ownership concept);stackdf and meltdf which operate on views of the source DataFrame, but if someone uses these functions I guess one knows the consequences;I will ask on discourse to vote up/down this proposal here and if gets accepted implement a PR.
I still haven't been able to make up my mind on this... I can't help thinking we will regret this wart if we keep df[cols] and df[col] in 1.0 despite deciding that data frames are collections of rows, but they are so convenient... At least I think that if we remove df[cols] we should also remove df[col] (and vice-versa).
You have a point about setproperty! being inconvenient. Of course there's df[:, col] = ... but it makes a copy. Or maybe we could see it as a feature that the safe behavior has the most convenient syntax? Or should we make df[:, col] not copying?
I agree about the rest of the plan. stackdf and meltdf have always looked like sub-optimal API to me, we should probably find another syntax before 1.0 (https://github.com/JuliaData/DataFrames.jl/issues/1736).
Or should we make
df[:, col]not copying?
It should copy IMHO.
I can't help thinking we will regret this wart if we keep
df[cols]anddf[col]in 1.0 despite deciding that data frames are collections of rows
Personally I would drop them both, but I am afraid that people that do not know the details of DataFrames.jl might find it problematic.
We need the syntax for the following operations if col is a variable:
col from the DataFrame as is; currently it is df[col] and getproperty(df, col) (direct access - no copy)col in the DataFrame without copying; currently it is df[col] = val and setproperty!(df, col, val)`.These operations are quite common in practice. I am also not sure what is best. Maybe having only setproperty! and getproperty is good enough?
IndexedTables has a specific select function to select columns: select(df, col) or select(df, cols) (with the extra advantage that one can pass some fancier object for example select(df, ::Regex) would keep all columns that match the Regex).
The main concern that I see with using select is that in IndexedTables the table is immutable and you can't reassign a column to another array, so it's not clear how one would combine the two things.
Keeping that in mind the main alternatives to get/setproperty! seem either some get/setcol! or asking to wrap a DataFrame into a collection of columns object on which getindex selects the column. Either some cols(df)[:mycol] or even transpose (to go from row object to column object and vice-versa) would make some sense in my mind: df'[:mycol] though the latter feels a bit more experimental.
Yeah, whether we call the function select, getcol, columns or just use the existing getproperty method, the result is more or less the same. Using a columns wrapper is interesting since it would allow assignment via cols(df)[:mycol] = v or df'[:mycol] = v, but I'm not sure that would really make things more convenient than sdetproperty(df, :mycol, v). In the same vein, we could define a custom object similar to : to allow df[custom, :mycol] to behave like df[col] currently, but that wouldn't be obvious either.
The main concern that I see with using
selectis that in IndexedTables the table is immutable and you can't reassign a column to another array, so it's not clear how one would combine the two things.
What do you mean?
The main concern that I see with using
selectis that in IndexedTables the table is immutable and you can't reassign a column to another array, so it's not clear how one would combine the two things.What do you mean?
I wrote things in a very confusing way. I simply meant that select is a good name to select one or more columns from a DataFrame but it can't be used as a replacement of df[:mycol] = v: this was never a concern for IndexedTables as replacing a column with another vector is not allowed there (you'd need to create a new table that shares some columns with the old one).
I agree select is a good name. Here the only issue is where we should "source" it from to extend methods for it.
Maybe we could use mutate! or transform! for setproperty!. This should be non-conflicting AFAICT (but we should confirm it and choose a more non-conflicting option).
If we agree to this then select could use the API of select from JuliaDB for consistency (or a subset of this API). With the rule that select(df, col) returns a vector in col as is, and all other forms return a DataFrame and perform a copy.
Then mutate! could have the following API (that would be reasonably close what we have with by):
mutate!(df; col=value...)mutate!(df, col=>value...)mutate!(df, ::Vector{Pair{Symbol,<:AbstractVector}) (here )(possibly also added Tuple versions of options 3 and 4 - depending if we want to keep tuples in #1620 - in general I would try to be consistent here)
I agree
selectis a good name. Here the only issue is where we should "source" it from to extend methods for it.
Tables.jl has a select function to select columns so I guess that's a natural candidate (JuliaDB does not extend it though, maybe some coordination is required here).
In terms of signatures I like mutate!(df, col => value...), I think it is the same signature as setcol in JuliaDB.
I always forget that Tables is not re-exported and exports nothing 😢.
As a side note select would be copying while Tables.select:
Tables.select(df, col) after materializing is a table;Of course we do not have to be 100% consistent here, but just to remember.
EDIT Just the question is if we find that select(df, col) returns a raw vector not a one column data frame.
Tables.jl has a
selectfunction to select columns so I guess that's a natural candidate (JuliaDB does not extend it though, maybe some coordination is required here).
Having thought about that it is exactly what we should not do either, like JuliaDB.jl. The reason is that Tables.select is catch-all so we would override its current behavior. Maybe simply we should define select as stand alone in DataFrames.jl?
Why not support select, but that discussion sounds quite orthogonal to the df[col] issue, as select(df, col) isn't really more convenient than getproperty(df, col). mutate! is also interesting, but mutate!(df, col=>v) isn't simpler than setproperty!(df, col, v) either.
In my opinion the difference is that select and mutate! would do this job and be much more flexible at the same time. So user learns to use select and mutate! in general and does not have to bother about getproperty and setproperty! functions (if one wants one can still use them as they are now defined anyway).
Or put it differently: I am afraid that everything except leaving df[col] defined will be at least equally complex as getproperty and setproperty! syntaxwise.
Of all of these I like cols(df)[:x] and cols!(df)[:x] a decent amount. One for copying and one for not copying. We could have setindex for each of them too.
I have just looked through internals of DataFrames.jl. We actually use df[col] quite a bit (both to set and to get). df[cols] is not used much.
@nalimilan, @oxinabox:
Maybe we can make a two-step approach here if we are not clear what to do:
DataFrame (I can make a PR) and leave df[col] and df[cols] working?df[col] and df[cols] before 1.0 release?Yes, I feel like the first and second are unrelated.
Or at least related only by the question of which ops release ownership and which don't.
OK - I will go ahead with 1, and let us see how we like it.
Ownership PR is in https://github.com/JuliaData/DataFrames.jl/pull/1742
Given the recent discussion #1742 should be closed, and I see the target functionality the way I have described in #1749 for @kescobo:
DataFrames are allowed to share underlying vectors (so we drop column "ownership" concept)! functions do not modify underlying vectors but allocate new vectors if neededsetindex!, which may modify underlying vectors in some cases (we should add a clear documentation of this behavior)!!; push!! and append!! the most important functions herecopycolumns kwarg to some functions (e.g. DataFrame constructor, hcat) to force copying of columns on operation (by default it will be false) - this is an optional feature, but can be handy in some situations.df[col] and df[cols] in favor of getcol(df, col) and select(df, cols) (this is indirectly related as both functions expose inner vectors)df[idx] will select a row and df[idxs] will select rows (but we should probably wait with this at least 6 months)If we were to accept safety-first policy then this could be a frame of the target functionality:
DataFrame can optionally take ownership of the columns it containsDataFrame constructor and insert_single_column! (and all methods that call these two methods) shall have added additional copycolumns::Bool kwarg that is true by default, but can be changed to false to avoid copying (the kwarg name is tentative)copycolumns=false has to be explicit, so a user has to opt-in for it.The only problem is with DataFrame(kwargs...) constructor, so we should choose a name for copycolumns kwarg that has low-conflict probability and at the same time it is not very verbose.
Maybe copycolumns should be a positional argument?
The only problem is with
DataFrame(kwargs...)constructor, so we should choose a name forcopycolumnskwarg that has low-conflict probability and at the same time it is not very verbose.
Maybecopycolumnsshould be a positional argument?
We could also deprecate DataFrame(a = 1, b = 2) so scalars don't work, ensuring no conflict about copycolumns = true.
Yes. The problem is that would still be a conflict if some wanted to write copycolumns=[1,2,3].
Also we would have to think if we would want to disallow scalar broadcasting in this case (I am not sure).
We could use a different name for the copying and noncopying constructor.
Outer constructors are basically just functions.
DataFrame!! is a legal function name.
As is unsafe_DataFrame
as is DataFrameView
(I think we can hook into @views even?)
The keyword argument constructor is mostly intended to build data frames from literal names, so it could be OK to prevent creating columns called copycolumns. People should use DataFrame(cols, names) for generic operations. (Anyway copycolumns isn't a very likely column name.)
DataFrame! is an interesting option, but that pattern isn't used anywhere AFAIK. See also https://github.com/JuliaLang/julia/issues/24388#issuecomment-341908329.
The DataFrame!! suggestion is also opposite what Bugamil was saying with having the default for copycolyumns = true. Now the more complicated function name is the only one that matches dplyr in terms of performance.
Given what we say maybe this is a general rule we could use for functions operating on DataFrames:
!, including DataFrame: source DataFrame is not modified and return value does not reuse columns form source DataFrame;!: may modify a source DataFrame but do not create a new DataFrame (a pure in-place operation); even under this rule we may still decide that filter! and sort! etc. do create new verctors;!!: may modify a DataFrame and return a new DataFrame that reuses the columns.In particular we would have then DataFrame!! (reusing vectors), hcat!! (as opposed to hcat).
For this to work we have to deprecate df[cols] (which we want to do anyway) and add select, select! and select!!, similarly we can have mutate, mutate! and mutate!!.
EDIT The particular cases without !! returning original vectors would be getcol and eachcol (but they do not return a DataFrame but something else and the rule of no ! and !! is for the case when a DataFrame is returned).
The decision we would have to make is if:
df.x returns a column or copies it (probably we want a column returned without copying);df.x = y do a copy of y at assignment or not (this is relevant if we write df1.x = df2.y in combination with the rule above); also probably it should not copy - so technically it will be possible to have columns reused in two data frames - but probably this is an extreme case(sorry for giving different options all the time but I want to explore all possibilities before we maka a decision as it is a hard one)
Just to add - the logic behind the proposal above is that we make sure that two data frames can reuse the same column only if:
!! is called (and then we assume that the user knows what one is doing)
- a function with
!!is called (and then we assume that the user knows what one is doing)
I am coming around to the !! syntax for mutating the underlying vectors. maybe we should benchmark with and without copying. I'm worried about a flood of help posts asking "why is this slower than R" and we have to introduce this scary syntax to people who are just starting Julia from a less technical language.
I am coming around to the
!!syntax for mutating the underlying vectors.
Actually this is what I (and I guess @nalimilan) wanted to propose in the first place.
But this would mean that we drop the "safety first" assumptions (and this was something a lot of people were opposing strongly). But I think it is good to discuss such things without rushing, as we can clearly understand the consequences of different design decisions (e.g. here - if we "copy by default" then things will be slow by default).
Along the same lines, I found this behavior very surprising:
julia> df = DataFrame(:x => [1,2,3])
3×1 DataFrame
│ Row │ x │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 1 │
│ 2 │ 2 │
│ 3 │ 3 │
julia> copy(df)[:x] = 4
4
julia> df
3×1 DataFrame
│ Row │ x │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 4 │
│ 2 │ 4 │
│ 3 │ 4 │
My mental model was that df[:x] = 4 was a shortcut for df[:x] = fill(4, size(df, 1)), not if haskey(df, :x) df[:x] .= 4 else df[:x] = fill(4, size(df, 1)) end.
Yes, that's a problem but it's surprising even without copy: we just need to implement .=/broadcast! and deprecate this (to have it later create/replace with a new column). See https://github.com/JuliaData/DataFrames.jl/issues/1645.
Exactly, but I am waiting with #1645 (and also #1737 and #1735) till we decide how we handle the issue in this thread.
@nalimilan - there were several possible design options considered in this thread (and several other places). Could I please ask you to give your current thinking about it? (my general thinking is that whatever we choose to do we should minimize exceptions or number of rules one has to remember; like the list in the answer to https://stackoverflow.com/questions/23296282/what-rules-does-pandas-use-to-generate-a-view-vs-a-copy - hopefully our could be simpler)
OK, let's make copies by default and follow your plan at https://github.com/JuliaData/DataFrames.jl/issues/1695#issuecomment-472941450. I'm not sure we want to introduce !! functions in that scenario, as they wouldn't be in-place, but that can be decided later anyway. I agree df.col shouldn't make a copy (it would kill performance in most common operations and it's easy to reason about). The case of df.x = y is more difficult, but if it makes copies, there will be no short alternative syntax to avoid copying; I guess it's fine if it doesn't copy, as it will be consistent with df.x not copying.
Given https://discourse.julialang.org/t/dataframe-how-to-change-value-of-a-cell-without-knowing-the-row-number/22024/7 my opinion is that when we copy then we should not detect aliases (this is probably what we would have done without thinking about it, but probably it is better to have an explicit decision here).
What do you mean exactly? In that issue AFAICT the problem was due to the fact that insertcols! doesn't make a copy of the vector, right? We could make it copy by default, but then it wouldn't be consistent with df[col] = v, unless we also change that...
What I want to say (it is a corner case):
df have columns a and b that are aliases (this will still be possible in the future), i.e. df.a === df.bdf (by copy I mean we also copy columns) to create df2, then we unalias a and b, i.e. isequal(df2.a, df2.b) but df2a. !== df2.bdf3 from df2 without copying columns then still df3.a === df3.b.(this is probably obvious but I wanted to make it clear - in short we unalias aliases when copying)
EDIT I have edited my comment to make it more precise.
Sorry to be super late to this conversation, but saw it pinned and so (assuming this is the purpose of pinning), wanted to throw in a couple thoughts on "goals".
pandas, which has very inconsistent rules around this and has just tried to solve it by throwing up (apparently random, given how complicated the rules are for when problems emerge) warnings. !! function. Maybe they'd get lower performance, but they could just ignore this issue all together. If we can structure this so people don't have to understand all these nuances, it'll make the package much easier for newbies. I think that's where the current proposals stand, but I will admit even I'm a little confused at the moment :)Thank you for the comment. I think we intentionally wait to get all the feedback we can.
copy-on-write
We would have to have a global pool of vectors stored in DataFrames to do it. It is doable, but at least I was against it as it is not 100% safe anyway (a vector can be extracted out of a DataFrame and manually modified which we are not able to intercept).
In general I read your voice as safety first / speed as an option - and yes I think we are converging towards this decision.
OK, let's make copies by default and follow your plan
In summary this is what I understand we go (I will start planning to implement it when we are in agreement with some details I mention below):
! (including DataFrame) by default have copycols kwarg set to true: source DataFrame is not modified and return value does not reuse columns form source DataFrame;! (including DataFrame) with copycols kwarg set to ~true~ false (we add this kwarg only for the functions that it makes sense - for some functions it is obvious that they have to copy so they will not have this kwarg): source DataFrame is not modified and return value does reuse columns form source DataFrame; DataFrame constructors take copycols kwarg and do copy by default !: may modify a source DataFrame but do not create a new DataFrame (a pure in-place operation); even under this rule we may still decide that filter! and sort! etc. do create new verctors;df.x does not copy, nor df.x = y does copy if y is a vectordf[col] and df[cols] and introduce getcol, setcol! and select and mutate! functions (this is a tentative list - the details will be worked out in a separate PR); in general the idea is that the first two work on single columns (and return a vector) and the latter on multiple columns (and return a DataFrame)The general rule is that if we:
DataFrame we do not copy by default (this is essentially exposed by getindex, getproperty, select and eachcol functions, and indirectly by setproperty! function)If we have this settled then #1737 and #1735 can be finished.
The starting point of implementation of these rules will be an update of https://github.com/JuliaData/DataFrames.jl/pull/1742.
After all this is finished the next step will be work on #1645.
function without ! (including DataFrame) with copycols kwarg set to true: source DataFrame is not modified and return value does reuse columns form source DataFrame;
Rather than copycols, I think the keyword should be reusecols or sharecols. Normally "copy" implies making a new, distinct copy of something, and this is the opposite.
I'm agnostic on Point 3. Feels a little strange to allow that, but I'm ok with it.
Could you please clarify point 5?
Actually, more general point, can we define terms? I would propose:
copy does. Normally "copy" implies making a new, distinct copy of something, and this is the opposite.
It was a typo. Fixed. I meant that copy is a copy. But we can use reusecols or sharecols if people prefer (I am not a native speaker).
Actually, more general point, can we define terms? I would propose:
With "copy" we are on the same page. As for "reuse" or "share" I think of this as "alias" (but this might be again the issue of me not being a native speaker, so I am OK with any of these, but aliasing is what we actually want to avoid by the design).
I'm agnostic on Point 3. Feels a little strange to allow that, but I'm ok with it.
We can change this to force make copies (again - this is something I am not sure what is best). But you can put a "reused/shared/aliased" column in a DataFrame any way by other means. Therefore the idea of the proposal above is that we ensure that DataFrames do not "reuse/share/alias" columns between themselves by default, but they can do with other objects. (I thought it was a general consensus we go this way).
Could you please clarify point 5?
This means that by writing df.x = df2.y you actually "reused/shared/aliased" column between two DataFrames as now :x in df and :y in df2 are the same object. In the future to avoid such sharing you will write df.x = df2[:, :y] or df.x = copy(df2.y) or df.x = copy(getcol(df2, :y)).
Great, ok. I’m fine with copycols now. :)
And copycols defaults to true for non-! Functions?
Yes - the default is true (I also add it for clarity above).
Here is a case to consider.
Let's say setcol(df, c::Symbol) = a becomes a thing and this is used to guarantee a copy when a gets placed in a dataframe. And we have, intuitively,
setcol: copysetcol!: no copygetcol: copygetcol!: no copyThis means that setcol(df1, :x) = getcol(df2, :x) .+ 1 does a redundant copying, right? They should use setcol(df, :x) = getcol!(df2, :x) to get equally safe code without an extra copy.
This seems like a confusing 2 by 2 for the user to remember.
@pdeffebach Thank you for the comment.
I was considering something similar. The problem with such a proposal is exactly the mental overhead + the fact that by default users will use getproperty and setproperty! (the getcol function is for the cases when column name is a "weird symbol" that cannot be used with df.x form). Also setcol would break the rule that we add ! to functions that mutate the object.
So I think it is simpler to think that we "guarantee copy by default" if we do X to Y operation where both X and Y are DataFrame and in all other cases we do not make this guarantee (the rest of the cases are mostly when X or Y is a vector).
I'm fine with that plan. Two remarks, though:
DataFrame is passed, but not when vectors are passed directly. So maybe we should always copy or never, unless we can find a simple way to express the rule (e.g. "we never copy when vectors are passed or extracted individually, i.e. separately from a data frame").getcol in addition to getproperty and select. I guess getcol could be useful if select(df, col) returns a copy by default. But IIRC in IndexedTables it doesn't make a copy, which kind of makes sense since that would be consistent with df.col.Regarding setcol: setcol(df, c::Symbol) = a isn't valid syntax (it defines a setcol method), so we have to use setindex! and setproperty!.
"we never copy when vectors are passed or extracted individually, i.e. separately from a data frame"
This is exactly the rule I have in mind - and 3 follows this rule. Do you have any other relatively simple rule that we could define? (I initially wanted DataFrame constructor to make a copy by default but it would be against this rule so that is why I have proposed 3 the way I did). Also - the general thinking should be that DataFrame constructor normally does not copy. The constructor DataFrame(::DataFrame) in practice will be almost never used in "standard" data analysis workflows I think.
Regarding 6
You are right, we can live with select only given select(df, col) returns a vector.
There was some discussion to make select(df, col) return a single column DataFrame (so that select always returns a DataFrame and in this way users can safely use it in their data pipelines). That is why I initially proposed a separate function. But if we agree that select(df, col) returns a single column I will update my post above to reflect this. (my only objection was that then user has to learn that select returns different objects depending on the type of its argument and also depending on this type it either makes a copy or not - this might be confusing for newcomers).
Regarding
setcol
setcol(df, col) = a is an invalid syntax, but one could potentially write getcol(df, col) .= a in the future and call setcol(df, col, a) (but again - I think I am against introducing setcol function).
"we never copy when vectors are passed or extracted individually, i.e. separately from a data frame"
This seems like a really nice rule. Clean, and simple. And because we're all (hopefully) familiar for when individual columns can be mutated (base julia rules), that feels relatively safe.
Constructors
Even if we adopt a "when working with vectors individually, we don't copy" rule, it's not quite clear how that applies in a constructor (where one may be passing several vectors at one). Is this really a "working with an individual column" situation: DataFrame(x=1:20, y=30:40) ?
I am inclined to agree that maybe constructors should always copy be default. Making a special case for DataFrame(df::DataFrame) feels a little like a little gotcha...
Sorry, overlapping comments. But re: @bkamins
Do you have any other relatively simple rule that we could define? (I initially wanted DataFrame constructor to make a copy by default but it would be against this rule so that is why I have proposed 3 the way I did).
As I was trying to say but maybe wasn't coherent in saying, I don't think that a constructor feels like "working with individual columns", so I don't think it's a violation to make the copy by default. One could do a one column constructor (DataFrame(x=1:20)), but in general constructors are for many-column operations, so I don't think they violate the rule.
Or if we framed differently, maybe the rule can be "functions that can only get or set a single column never copy". Since DataFrame can set more than one column at once, we're good.
Regarding setcol: setcol(df, c::Symbol) = a isn't valid syntax (it defines a setcol method), so we have to use setindex! and setproperty!.
Right, sorry about that. I get that setcol is an unnecessary function. (Though if we eventually add a mutate function we will have to revisit this.
I think my point still stands, though, about unnecessary intermediate copying. I wonder if
select(df, col) should never make a copy. setindex! should always make a copy.This way , the only time a user would need to worry about messing up a data frame is when they venture out of a data frame. In which case we assume they know what they are doing.
x = df.a
filter!(x)
setindex!should always make a copy.
I say no - this would be a performance problem ()
select(df, col)should never make a copy.
This is I guess what the consensus is now (i.e. it returns a single vector that is not a copy)
"functions that can only get or set a single column never copy"
this is a good rule.
So I update my summary above to reflect @pdeffebach and @nalimilan recommendations (i.e. DataFrame copies by default, and we drop getcol idea)
setindex! should always make a copy.
I say no - this would be a performance problem ()
I agree -- it's a ! function, so I think it's ok to not copy.
I have updated https://github.com/JuliaData/DataFrames.jl/issues/1695#issuecomment-475763711.
The only thing we still need to decide is how do we want users to express the operation:
df.x = y
if x is not a valid identifier (so that df.x will not parse properly). The operation should make column :x in df get value y without copying it.
It could be either:
setcol!(df, :x, y) (essentially insert_single_column! function exposed - maybe with some tweaks)mutate! function (that will probably allow to insert multiple columns using e.g. Pairs)Just to be clear, we need a substitute for df.x = y in case x is not a valid identifier because we will deprecate df[:x] = y.
I would really like it to look as close to df.x = y as possible. I don't want users to be like "well, I would put this in a function, but then I have to change and use some different notation". Going from df$col to df[[x]] is a big annoyance for me in R.
I propose df.sym(x) where sym(c::Symbol) does makes a custom type we can dispatch on. Or use ^ instead of sym or some other valid unicode character.
Ultimately this will make the logic a lot easier as well, as you have dot syntax for non-copying and then some other function syntax for copying. And the logic is preserved with or without literals.
It's also import to realize how much of a pain parentheses are. something like setcol!(df, x, y) makes lengthy map and array comprehensions more annoying.
Ugh bummer. I really hate that the df.x behavior will depend on whether :x is already defined. pandas has that problem and drives me nuts, though at least here it won’t be a silent failure (in pandas it “works” in that it just sticks y into a property, but doesn’t make a column).
No hopes of preserving that behavior with the df.x=y syntax with some clever syntactic sugar behind the scenes?
df.x = y will work also when :x is not defined and then the column will be added. Of course df.x will fail if :x is undefined for reading.
We are only talking about the case when you have a column name that is an invalid identifier. So here all stays unchanged, but because we want to deprecate df[:x] syntax for reading and writing we should define some other means to get column and set column using a function.
For getting a column we have already settled for select(df, :x) syntax.
What is left is the syntax for setting the column. Up to my understanding we have two natural choices:
setcol!(df, v, :x)mutate!(df, :x=>v2, :y=>v2), where single column operation is a special caseOh -- ok, sorry, please excuse my ignorance. What is an example where we have an invalid identifier? Do you mean something like Symbol("this has space") where you can't use the df-dot (df.) notation because julia can't parse df.this has a space?
In that case, I'm fine with setcol!. Seems descriptive and simple.
[edited for clarity]
Your understanding is correct 😄.
In fact setcol! would be defined very simply (if we decided to add it):
setcol!(df::DataFrame, col, v) = insert_single_column!(df, col, v)
so we are essentially exposing an already existing inner function.
My only hesitation is related to the decision that we might want to introduce mutate! function, in which case setcol! would be redundant (@nalimilan - any opinion here)?
This is exactly the rule I have in mind - and 3 follows this rule. Do you have any other relatively simple rule that we could define? (I initially wanted
DataFrameconstructor to make a copy by default but it would be against this rule so that is why I have proposed 3 the way I did). Also - the general thinking should be thatDataFrameconstructor normally does not copy. The constructorDataFrame(::DataFrame)in practice will be almost never used in "standard" data analysis workflows I think.
It would be even simpler to say "constructors do not copy columns". If we don't have common uses cases for DataFrame(::DataFrame), maybe that would be better.
There was some discussion to make
select(df, col)return a single columnDataFrame(so thatselectalways returns aDataFrameand in this way users can safely use it in their data pipelines). That is why I initially proposed a separate function. But if we agree thatselect(df, col)returns a single column I will update my post above to reflect this. (my only objection was that then user has to learn thatselectreturns different objects depending on the type of its argument and also depending on this type it either makes a copy or not - this might be confusing for newcomers).
Yes, it may be unexpected for select to return a single column given that it's a term taken from SQL. Another argument for select(df, col) to return a DataFrame would be to also allow select(df, col1, col2) and things like select(df, :col1, z = :col2 => -), for consistency with by/combine and a possible mutate!/transform!.
Anyway this issue is relatively self-contained, as it just concerns convenience methods for getproperty and setproperty!. We should probably start without these, and discuss this in a dedicated issue (this one is already quite long), trying to synchronize with other table packages if possible.
If we don't have common uses cases for DataFrame(::DataFrame)
DataFrame(adf) is the function one calls, when adf is a SubDataFrame and you want to get back something that you can mutate safely without changing the original.
It would be weird if that was copying for some AbstractDataFrames, and not copying for others.
Anyway this issue is relatively self-contained
Agreed. I have written down in the "main" post my current thinking, but let us discuss it in a separate thread (I will open it when we have finished underlying infrastructure implementation)
It would be weird if that was copying for some
AbstractDataFrames, and not copying for others.
This is what is a current consensus so 👍. A nice thing is that catch-all DataFrame(x) from other/tables.jl will inherit this behavior automatically 😄.
OK - I will start implementing the agreed functionality. The fist step is copying columns behavior ("data ownership" PR update).
https://github.com/JuliaData/DataFrames.jl/pull/1772 adds column(df, col) as an equivalent of getproperty(df, col). That would replace getcol(df, col), with the advantage that it's consistent with the JuliaDB API.
If we do that, adding setcol!(df, col, v) wouldn't be very consistent. Should it be setcolumn!(df, col, v) or column!(df, col, v) instead? As noted above, we could also use mutate!(df; col => v) or transform!(df; col => v) for that, though the syntax is a bit subtle (; with pair to generate a keyword argument).
column(df, col)as an equivalent ofgetproperty(df, col)
Just to be precise column also accepts integer indexing (at least under current design, but I think we should keep it).
I think that, if we add such a function then column! would be a good name. But given we are not sure if we want to deprecate df[col] = v maybe we do not need it. Do you know what name JuliaDB uses for this operation?
I think that, if we add such a function then
column!would be a good name. But given we are not sure if we want to deprecatedf[col] = vmaybe we do not need it.
If we don't deprecate df[col] and df[col] = v, maybe we don't need columns either? If that's only for consistency with JuliaDB, we can add this later once we have stabilized our own API.
Do you know what name JuliaDB uses for this operation?
JuliaDB uses setcol(t, col, v) (there's no in-place equivalent). That function also supports a broader interface which is so flexible that it's equivalent to mutate!/transform!. But I don't really like that approach: the name is singular even if you can set multiple columns, it's not consistent with columns, and the function mixes setcol(t, col, v), setcol(t, col => f) and setcol(t, col => f, col2 => f2) which have very different behaviors despite their similarity.
That I why in #1772 I wanted to give a reference implementation so that it is clear that column is essentially getindex with the restriction to allow only a single column. I am OK to revert adding column as in my view it was added only for consistency with Julia DB (until we stabilize API we do not know if it is needed or not).
JuliaDB uses setcol(t, col, v) (there's no in-place equivalent). That function also supports a broader interface which is so flexible that it's equivalent to mutate!/transform!. But I don't really like that approach: the name is singular even if you can set multiple columns, it's not consistent with columns, and the function mixes setcol(t, col, v), setcol(t, col => f) and setcol(t, col => f, col2 => f2) which have very different behaviors despite their similarity.
Yes, I also think that's not ideal. Would a better approach be to deprecate setcol(t, col, v) in favor of setcol(t, col => v) and rename it to mutate? I think originally it only supported setting one column, but then it became more feature-rich to the point that the original name no longer makes so much sense.
From what I understand mutate is the correct term here as it keeps the remaining columns, unlike transform which drops them and is actually a special case of IndexedTables.select (which means I should also rename the JuliaDBMeta macro to @mutate I guess).
Why do you think transform should drop columns? Is that in reference to some implementation? Contrary to "select", nothing in the term "transform" indicates that remaining columns should be dropped (it's essentially a synonym of "mutate" AFAICT).
From this discourse post, but my intuition agrees with yours, which is why I named the macro @transform in JuliaDBMeta. Maybe a better pair of functions would be:
transform (keep columns)select (drop columns)I would close this issue. Is there anything left from it on the table? (maybe the transform function, but if we feel it should be added we should open a separate PR/Issue for this)
So our current policy is that ! functions may either replace or mutate the column vectors, depending on what makes sense: when possible, they mutate the vectors. This has the advantage of being efficient (when introducing !! variants for mutation could have trapped users into using slow functions, especially for push! and append!). See the list I wrote at https://github.com/JuliaData/DataFrames.jl/issues/1695#issuecomment-463204540.
Most helpful comment
From this discourse post, but my intuition agrees with yours, which is why I named the macro
@transformin JuliaDBMeta. Maybe a better pair of functions would be:transform(keep columns)select(drop columns)