Dataframes.jl: Provide invert from SplitApplyCombine

Created on 22 Aug 2020  Â·  34Comments  Â·  Source: JuliaData/DataFrames.jl

We get a question about splitting single column into multiple columns in a data frame repeatedly. This is essentially invert from SplitApplyCombine.jl. I think it would be good to guide users how to efficiently use invert (which would imply that just docs update is needed), alternatively we can start depending on SplitApplyCombine.jl and re-export some of its functions (as soon in joins probably we will start depending on SplitApplyCombine.jl anyway).

doc question

All 34 comments

Here's the context:

The tidyr library provides two functions - separate and unite. The separate function splits up a single string column by regular expression and turn it into multiple columns. The unite function combines multiple string columns into a single one.

https://tidyr.tidyverse.org/reference/separate.html
https://tidyr.tidyverse.org/reference/unite.html

I don't really care about the invert function itself although it's quite useful once I understand how it works. However, I think it's more important to implement a higher level feature like separate. With Julia, we don't have to restrict to string columns either. It can be a function that takes a value and turn into a tuple. Then the separate function just do the invert magic and turn that into multiple columns.

In fact, I cooked up a sample implementation for these the other day. Thoughts?
https://gist.github.com/tk3369/f87d1de0f2dda8480a370c47c4d85f8b

One issue regarding separate and unite is that I frequently see them used in very hacky implementations of multi-column stack and unstack. Something like this.

So if we implement this with the goal of making stack easier, we should think of a different API. Nonetheless this is a good feature to have.

Couldn't unite and separate be implemented using transform? For separate, we will need to allow returning a named tuple and split it across columns.

Maybe separate is a bit more convenient than that, e.g. to fill with missing values when needed, but I wonder whether we should add this kind of utility function to DataFrames itself or to a companion package.

I do not think we need unite - it is very easy do do now using Base Julia functionalities.

Regarding separate - I think it should be added, but it should be more generic I think:

function separate(column, target_names, splitter=identity)
    tmp = splitter === identity ? tmp : splitter.(column)
    return (;[Symbol(n) => getindex.(tmp, i) for (i, n) in enumerate(target_names)]...)
end

then such a named tuple can be worked with later as needed.

the question is where it should live?

as an idea:

we could add insertcols!(df, Pair{<:AbstractVector{<:SymbolOrString}, <:AbstractVector}) that would add multiple columns at once (assuming that the last of Pair is vector of collections)?

we could add insertcols!(df, Pair{<:AbstractVector{<:SymbolOrString}, <:AbstractVector}) that would add multiple columns at once (assuming that the last of Pair is vector of collections)?

We rely on vectors of pairs instead of pairs of vectors in transform so we should do that here, too.

transform(df, [[:a, :b]] => [[fun1, fun2]]

Isn't valid in favor of

transform(df, [:x1 => fun1, :x2 => fun2])

But insertcols! has a different syntax in general it is column => data right now also [column => data1, column => data2] is disallowed currently.

Anyway - so you would drop it or you have some other proposal for insertcols!?

I woould have insertcols! be vec => name and allow [vec1 => name1, vec2 => name2]

vec => name

it would be inconsistent with DataFrame constructor (and I know we could also change it, but it is too late for that I think - we were aware of this inconsistency unfortunately)

[vec1 => name1, vec2 => name2]

But this is the point of this proposal that this is now what we want, as then you would have to create vec1 and vec2 separately, which is exactly what is problematic.

Ah sorry, I see now.

I wonder if unzip were in Base, would it also work for vectors of vectors? That way we would do ["firstname", "lastname"] .=> unzip(split.(df.fullname))

I like this syntax for functions that output multiple columns. Isn't separate just a special case? E.g.

df_out = transform(df, :incol1, :incol2 => x->separate(x, args...) => [:outcol1, outcol2, outcol3])

Isn't separate just a special case? E.g.

So this is a bit challenging as we can have a collection of columns vs collection of rows to be spread. In insertcols! collection of columns is handled easily by broadcasting and splatting, so collection of rows is left for a special treatment. In the case of transform we are lacking both, so more thought is required.

I think separate should be used with ByRow in select/transform, as it could allow for optimizations like multithreading. Do we really need insertcols! to support this?

i think this feature is very useful because one use-case is the word embedding to vector space for categorical values with huge number of categories such that hot-bit encoding will produce lots of column which is not scalable. also, there are many cases where a table column contains multiple information that needs to be represented as separate features or some multiple columns just represent a single feature to be concatenated as primary key.

Do we really need insertcols! to support this?

No, we do not

I think separate should be used with ByRow in select/transform, as it could allow for optimizations like multithreading

How do you envision this exactly (I mean in terms of syntax)? There are two scenarios:

  1. I have a column containing eg. generators that I just want to separate (i.e. no operation needs to be done)
  2. I have a column that still some fun should be called to do the separation

How about:

transform(df, :fullname=> ByRow(split) => [:firstname, :lastname])

How do we envision supporting the fill functionality as in this prototype?

Do we expect the generator function to return the exact number of output elements as required by the number of target columns? Perhaps we can wrap a function that guarantees that as follows:

transform(df, :fullname=> ByRow(fill(split, 2; fill = :left, fit = :right, unite = *)) => [:firstname, :lastname])

So the fill function wraps the generator function and guarantee the number of output elements to match the provided value. If there are fewer elements, then it fills with missing or the previous value from either side. If there are too many elements, then it either drops the beginning or ending elements, or find a way to re-unite the extra values using the function provided by the unite argument.

can we an option of automatic naming in cases where the columns are huge?

What we discuss here seems to converge with https://github.com/JuliaData/DataFrames.jl/issues/2410.

So here is a design that would fulfill all the needs and is relatively easy to implement (which is important given the complexity we have):

src_cols => fun => dst_cols

and for automatic column naming:

src_cols => fun => AsTable(n)

where n is number of columns to create (and passing n is crucial as then we can statically generate column names before processing fun).

This would solve all the problems and allow a quick implementation of multiple columns in the output.

The design would be:

  • if fun returns DataFrame, NamedTuple, DataFrameRow or AbstractDict then column names must match
  • if fun returns something else then it should be an iterable of a given length

The two decisions to make are then:

  • combine(fun, df/gdf) and combine(cols => fun) currently allow creation of tables without specifying the third argument; do we want to keep it?
  • there is a little problem that if you write cols => fun => col (a single column in the output) or cols => fun (no columns in the output) there are two corner cases if fun is NOT ByRow:

    • if a vector is returned it produces multiple rows (and with cols => fun => cols it would produce multiple columns which could be tricky for the users to understand)

    • if a table-type is returned it is currently an error (a lesser problem)

However, maybe it is enough to have this multiple column creation in ByRow context only? (in which case it is not problematic as ByRow does not do pseudo broadcasting)

  • if fun returns DataFrame, NamedTuple, DataFrameRow or AbstractDict then column names must match
  • if fun returns something else then it should be an iterable of a given length

I understand the design difficulties here, but requiring users to give the number of new columns in advance strikes me as very limiting. The convenience is returning a Table output is so you don't have to worry about the number of columns you have. Say you have a function

function mysumstats(x)
    (mean = mean(x), std = std(x))
end
transform(df, :val => mysumstats => AsTable(2))

Now in order to add, say, median, you have to remember all the places where you apply that transformation and update each AsTable call individually.

OK, this is just a convenience as you say (so we could drop it later).

The major design consideration is what to do with non-ByRow scenario.

I think we can get around this by requiring multiple column returns to be tuples, right?

We could - but this would be a restriction of similar type you have just objected :smile: - for instance your example would not be allowed if I understand things correctly.

also, given a column name with values to be separated, can we have an option that the new columns will carry the source column name together with number sequence like xname_1, xname_2, xname_3..., xname_n. this is to be able to identify that the new features where derived from the original xname column feature.

I would benefit from this functionality.

Do we really need to specify AsTable(n) to indicate that n columns should be generated? If split returns a named tuple, it should work without it, right?

Then one question is whether split applies to a single row or applies to the full column. If it's row-wise (and used via ByRow) as I suggested, the question of whether the compiler is able to infer the return type is essential. Maybe something like transform(df, :x => ByRow(x -> split(x, n=2))) could automatically generate column names and still be inferred correctly by the compiler thanks to inlining? I.e. the 2 would be treated as a constant and the fact that the function returns a named tuple with names :x1 and :x2.

Though I realize that one limitation of the row-wise approach is that the function has no way to know how many columns were generated for the previous rows. So for example it cannot automatically generate more columns when it encounters a row that has more occurrences of the separator than the previous ones. Not sure whether that's really useful: you can always say you expect up to 10 or 100 columns (with automatic filling if needed), and only use the ones that actually contain something. dplyr's separate doesn't support that AFAICT.

This is a special case of #2410, so I would discuss it there.
As for automatic filling of missing rows - this is something that is problematic, as you would have to do it both forward and backward. I would expect that splitting function should be responsible for handling properly the number of columns in the output.

This is done with the new syntax for multiple column output from transformations.

I would like to have this functionality, perhaps enriched with various splitting options, such as those provided by Power Query.
Meanwhile, for the problem which is covered here:
https://www.juliabloggers.com/separating-a-column-into-multiple-columns-in-dataframes-jl/

I would also use a solution of the following type:
insertcols! (df, ([: first,: last]. => collect. (zip (split. (df.name) ...))) ...)

@sprmnt21 - could you please elaborate what you are missing? I have closed this issue as in 0.22 release splitting is supported.

Hi,
I did not realize that the discussion had been closed. I had only read part of the first messages.
I had seen, in the blog from which I started, that it was proposed to use the invert function that is part of a specific package.
I tried to simulate the invert function with the zip function of the base library and I saw that it worked for the purpose.
When I arrived, following the link, at the discussion of implementing a split function in DataFrames, in a post from this discussion, I read that it would be helpful if there was an unzip function in the base library.
At this point I submitted my script, without reading the rest.

I saw in the other message you sent me that the split function is already available in version 0.22.

If I may, I would like to ask if there is a feature of the groupby function that does a grouping using a user-defined function as a comparison criterion.
II would like to know if there is already an option or is it planned to decide whether the grouping should be local or global.
Thanks again for your attention.
Ciao Rocco

----Messaggio originale----

Da: [email protected]

Data: 15-nov-2020 18.41

A: "JuliaData/DataFrames.jl"DataFrames.jl@noreply.github.com

Cc: "sprmnt21"sprmnt21@alice.it, "Mention"mention@noreply.github.com

Ogg: Re: [JuliaData/DataFrames.jl] Provide invert from SplitApplyCombine (#2375)

@sprmnt21 - could you please elaborate what you are missing? I have closed this issue as in 0.22 release splitting is supported.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.

If I may, I would like to ask if there is a feature of the groupby function that does a grouping using a user-defined function as a comparison criterion.

No, you have to materialize a column you want to group-by on.

II would like to know if there is already an option or is it planned to decide whether the grouping should be local or global.
Thanks again for your attention.

I am not clear what you mean by "local" and "global" distinction in grouping. Could you please elaborate on this. Thank you!

I saw this option by practicing microsoft's power query system for a few months.
As I am not familiar with other systems, I thought it was a general concept in the context of databases.
(now I would like to try to do using Julia - which I really like - environment, some of the things I did with power query)
The idea is to build groups by comparing only the consecutive rows.
When a row with different key values is found, a new group starts even if a group with those keys already exists.

I'll give an example to be clearer.

code valuea 1a 2a 3b 2b 5a 3a 2

In this case, if you groupby on the first "code" column and use, for example, the sum function on the value column, the groups, with the "local" option set, are the following:
a 6b 7a 5

----Messaggio originale----

Da: [email protected]

Data: 16-nov-2020 10.29

A: "JuliaData/DataFrames.jl"DataFrames.jl@noreply.github.com

Cc: "sprmnt21"sprmnt21@alice.it, "Mention"mention@noreply.github.com

Ogg: Re: [JuliaData/DataFrames.jl] Provide invert from SplitApplyCombine (#2375)

If I may, I would like to ask if there is a feature of the groupby function that does a grouping using a user-defined function as a comparison criterion.

No, you have to materialize a column you want to group-by on.

II would like to know if there is already an option or is it planned to decide whether the grouping should be local or global.
Thanks again for your attention.

I am not clear what you mean by "local" and "global" distinction in grouping. Could you please elaborate on this. Thank you!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.

If you want something like this you have to first create a column that has a proper encoding.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

CameronBieganek picture CameronBieganek  Â·  6Comments

davidanthoff picture davidanthoff  Â·  4Comments

bkamins picture bkamins  Â·  7Comments

gustafsson picture gustafsson  Â·  6Comments

jangorecki picture jangorecki  Â·  7Comments