Dataframes.jl: More intuitive functions

Created on 19 Sep 2017 · 3Comments · Source: JuliaData/DataFrames.jl

Hello,
I have been using DataFrames for over a year now and I really like it. My only complain is that I found myself coming back to documentation more often than I would like to. Would it be possible to implement a set of more intuitive functions to wrap all the magic achieved with maps and weird calls such as:

julia> ptable[:Liquid] = map((x,y) -> y - x, ptable[:MP], ptable[:BP])
ptable[ptable[:, :Melt] .< 100, :]

It would be very useful to have functions like find, filter, pivot, addcolumn and so on...

thanks!

Source

bbrunaud

Most helpful comment

I'm sorry that you find the current interface unintuitive and to be too full of "magic" and "weird calls". It is not designed to be magical or weird at all, it's just designed to reuse as many features of the base Julia language as possible. If you're coming from R or Python (and probably other langauges too), you're likely to find the interface confusing because the data frames libraries in those languages (data.frame, data.table, and tibble in R, pandas in Python) implement thorough libraries of functions that call more efficient compiled code under the hood. They require that you to learn expansive, package-specific APIs that bypass the core language in order to use the library effectively, which to play devil's advocate, sounds more magical, less intuitive, and weirder than what's going on in the functions you've shown here. And due to inconsistencies between languages/packages on what functions are called, there has been an active effort to remove aspects of the API inspired by those packages because it became clear that what some users found helpful and intuitive, others found painful and overly-preferential to the design choices of other languages.

julia> # generate some fake data similar to your example
       using DataFrames

julia> ptable = DataFrame(Melt = 0:100:1000, MP = rand(1:1000, 11), BP = rand(1001:2000, 11))
11×3 DataFrames.DataFrame
│ Row │ Melt │ MP  │ BP   │
├─────┼──────┼─────┼──────┤
│ 1   │ 0    │ 485 │ 1089 │
│ 2   │ 100  │ 929 │ 1968 │
│ 3   │ 200  │ 131 │ 1264 │
│ 4   │ 300  │ 484 │ 1911 │
│ 5   │ 400  │ 286 │ 1528 │
│ 6   │ 500  │ 76  │ 1279 │
│ 7   │ 600  │ 168 │ 1835 │
│ 8   │ 700  │ 595 │ 1885 │
│ 9   │ 800  │ 951 │ 1451 │
│ 10  │ 900  │ 854 │ 1043 │
│ 11  │ 1000 │ 110 │ 1488 │

If you don't like using map, you don't have to.

julia> # what you wrote here
       ptable[:Liquid] = map((x,y) -> y - x, ptable[:MP], ptable[:BP])
11-element Array{Int64,1}:
  604
 1039
 1133
 1427
 1242
 1203
 1667
 1290
  500
  189
 1378

julia> # can be written more concisely with element-wise subtraction, which you may like better
       ptable[:Liquid] = ptable[:BP] .- ptable[:MP]
11-element Array{Int64,1}:
  604
 1039
 1133
 1427
 1242
 1203
 1667
 1290
  500
  189
 1378

There isn't an addcolumn function, but you've shown only 1 of 3 ways I can think of for adding a column to a DataFrame

julia> # Assignment of the new column to the column name you want, as you showed in your example
       ptable[:Liquid] = ptable[:BP] .- ptable[:MP]
11-element Array{Int64,1}:
  604
 1039
 1133
 1427
 1242
 1203
 1667
 1290
  500
  189
 1378

The above method is calling setindex! behind the scenes, just like it would if you were setting the values of specific indices in a vector or array in the base Julia language. If you'd prefer to have an explicit function to call, you can call setindex! directly

julia> setindex!(ptable, ptable[:BP] .- ptable[:MP], :Liquid)
11-element Array{Int64,1}:
  604
 1039
 1133
 1427
 1242
 1203
 1667
 1290
  500
  189
 1378

julia> # you can also use insert! if you'd like to specify an index other than the end
       insert!(ptable, 1, ptable[:BP] .- ptable[:MP], :Liquid_insert)
11×5 DataFrames.DataFrame
│ Row │ Liquid_insert │ Melt │ MP  │ BP   │ Liquid │
├─────┼───────────────┼──────┼─────┼──────┼────────┤
│ 1   │ 604           │ 0    │ 485 │ 1089 │ 604    │
│ 2   │ 1039          │ 100  │ 929 │ 1968 │ 1039   │
│ 3   │ 1133          │ 200  │ 131 │ 1264 │ 1133   │
│ 4   │ 1427          │ 300  │ 484 │ 1911 │ 1427   │
│ 5   │ 1242          │ 400  │ 286 │ 1528 │ 1242   │
│ 6   │ 1203          │ 500  │ 76  │ 1279 │ 1203   │
│ 7   │ 1667          │ 600  │ 168 │ 1835 │ 1667   │
│ 8   │ 1290          │ 700  │ 595 │ 1885 │ 1290   │
│ 9   │ 500           │ 800  │ 951 │ 1451 │ 500    │
│ 10  │ 189           │ 900  │ 854 │ 1043 │ 189    │
│ 11  │ 1378          │ 1000 │ 110 │ 1488 │ 1378   │

subsetting and find

julia> # I'm not sure what is confusing about this
       ptable[ptable[:, :Melt] .< 100, :]
1×5 DataFrames.DataFrame
│ Row │ Liquid_insert │ Melt │ MP  │ BP   │ Liquid │
├─────┼───────────────┼──────┼─────┼──────┼────────┤
│ 1   │ 604           │ 0    │ 485 │ 1089 │ 604    │

julia> # It's creating a boolean vector to subset on, which is also how you'd subset in base Julia       
          my_subset = ptable[:, :Melt] .< 100
11-element BitArray{1}:
  true
 false
 false
 false
 false
 false
 false
 false
 false
 false
 false

julia> ptable[my_subset, :]
1×5 DataFrames.DataFrame
│ Row │ Liquid_insert │ Melt │ MP  │ BP   │ Liquid │
├─────┼───────────────┼──────┼─────┼──────┼────────┤
│ 1   │ 604           │ 0    │ 485 │ 1089 │ 604    │

julia> # find already works, and you can use that instead if you'd like
       my_subset = find(melt -> melt < 100, ptable[:, :Melt])
1-element Array{Int64,1}:
 1

julia> # another way
       my_subset = find(ptable[:, :Melt] .< 100)
1-element Array{Int64,1}:
 1

julia> ptable[my_subset, :]
1×5 DataFrames.DataFrame
│ Row │ Liquid_insert │ Melt │ MP  │ BP   │ Liquid │
├─────┼───────────────┼──────┼─────┼──────┼────────┤
│ 1   │ 604           │ 0    │ 485 │ 1089 │ 604    │

It looks like filtering is a problem. I would have expected this to return the same output as above, but it doesn't.

julia> # This isn't actually filtering anything
       filter(row -> row[:Melt] < 100, eachrow(ptable))
Base.Iterators.Filter{##11#12,DataFrames.DFRowIterator{DataFrames.DataFrame}}(#11, DataFrames.DFRowIterator{DataFrames.DataFrame}(11×5 DataFrames.DataFrame
│ Row │ Liquid_insert │ Melt │ MP  │ BP   │ Liquid │
├─────┼───────────────┼──────┼─────┼──────┼────────┤
│ 1   │ 604           │ 0    │ 485 │ 1089 │ 604    │
│ 2   │ 1039          │ 100  │ 929 │ 1968 │ 1039   │
│ 3   │ 1133          │ 200  │ 131 │ 1264 │ 1133   │
│ 4   │ 1427          │ 300  │ 484 │ 1911 │ 1427   │
│ 5   │ 1242          │ 400  │ 286 │ 1528 │ 1242   │
│ 6   │ 1203          │ 500  │ 76  │ 1279 │ 1203   │
│ 7   │ 1667          │ 600  │ 168 │ 1835 │ 1667   │
│ 8   │ 1290          │ 700  │ 595 │ 1885 │ 1290   │
│ 9   │ 500           │ 800  │ 951 │ 1451 │ 500    │
│ 10  │ 189           │ 900  │ 854 │ 1043 │ 189    │
│ 11  │ 1378          │ 1000 │ 110 │ 1488 │ 1378   │))

I hope those examples take some of the confusion out of what's going on. If the methods for interacting with DataFrames are still unpalatable in light of that brief explanation, Query implements the LINQ API which many users prefer to learn and use instead of concepts from the base language. If you think any of the above examples would be helpful to have in the documentation, please open a PR with the changes and we'd be happy to help you get those examples merged.

I'll open a new issue to track the filter issue, and if you're unable to achieve the functionality you're looking for with unstack and would like a pivot function please consider helping to finish this PR https://github.com/JuliaData/DataFrames.jl/pull/1181. Because the other aspects of this issue are based on preferences rather than functionality and the intuitiveness of the function names vary widely from person to person, I don't expect those are actionable, but I'll leave this open for others to comment before closing.

cjprybol on 19 Sep 2017

❤2 👍2

All 3 comments

julia> # generate some fake data similar to your example
       using DataFrames

julia> ptable = DataFrame(Melt = 0:100:1000, MP = rand(1:1000, 11), BP = rand(1001:2000, 11))
11×3 DataFrames.DataFrame
│ Row │ Melt │ MP  │ BP   │
├─────┼──────┼─────┼──────┤
│ 1   │ 0    │ 485 │ 1089 │
│ 2   │ 100  │ 929 │ 1968 │
│ 3   │ 200  │ 131 │ 1264 │
│ 4   │ 300  │ 484 │ 1911 │
│ 5   │ 400  │ 286 │ 1528 │
│ 6   │ 500  │ 76  │ 1279 │
│ 7   │ 600  │ 168 │ 1835 │
│ 8   │ 700  │ 595 │ 1885 │
│ 9   │ 800  │ 951 │ 1451 │
│ 10  │ 900  │ 854 │ 1043 │
│ 11  │ 1000 │ 110 │ 1488 │

If you don't like using map, you don't have to.

julia> # what you wrote here
       ptable[:Liquid] = map((x,y) -> y - x, ptable[:MP], ptable[:BP])
11-element Array{Int64,1}:
  604
 1039
 1133
 1427
 1242
 1203
 1667
 1290
  500
  189
 1378

julia> # can be written more concisely with element-wise subtraction, which you may like better
       ptable[:Liquid] = ptable[:BP] .- ptable[:MP]
11-element Array{Int64,1}:
  604
 1039
 1133
 1427
 1242
 1203
 1667
 1290
  500
  189
 1378

There isn't an addcolumn function, but you've shown only 1 of 3 ways I can think of for adding a column to a DataFrame

julia> # Assignment of the new column to the column name you want, as you showed in your example
       ptable[:Liquid] = ptable[:BP] .- ptable[:MP]
11-element Array{Int64,1}:
  604
 1039
 1133
 1427
 1242
 1203
 1667
 1290
  500
  189
 1378

julia> setindex!(ptable, ptable[:BP] .- ptable[:MP], :Liquid)
11-element Array{Int64,1}:
  604
 1039
 1133
 1427
 1242
 1203
 1667
 1290
  500
  189
 1378

julia> # you can also use insert! if you'd like to specify an index other than the end
       insert!(ptable, 1, ptable[:BP] .- ptable[:MP], :Liquid_insert)
11×5 DataFrames.DataFrame
│ Row │ Liquid_insert │ Melt │ MP  │ BP   │ Liquid │
├─────┼───────────────┼──────┼─────┼──────┼────────┤
│ 1   │ 604           │ 0    │ 485 │ 1089 │ 604    │
│ 2   │ 1039          │ 100  │ 929 │ 1968 │ 1039   │
│ 3   │ 1133          │ 200  │ 131 │ 1264 │ 1133   │
│ 4   │ 1427          │ 300  │ 484 │ 1911 │ 1427   │
│ 5   │ 1242          │ 400  │ 286 │ 1528 │ 1242   │
│ 6   │ 1203          │ 500  │ 76  │ 1279 │ 1203   │
│ 7   │ 1667          │ 600  │ 168 │ 1835 │ 1667   │
│ 8   │ 1290          │ 700  │ 595 │ 1885 │ 1290   │
│ 9   │ 500           │ 800  │ 951 │ 1451 │ 500    │
│ 10  │ 189           │ 900  │ 854 │ 1043 │ 189    │
│ 11  │ 1378          │ 1000 │ 110 │ 1488 │ 1378   │

subsetting and find

julia> # I'm not sure what is confusing about this
       ptable[ptable[:, :Melt] .< 100, :]
1×5 DataFrames.DataFrame
│ Row │ Liquid_insert │ Melt │ MP  │ BP   │ Liquid │
├─────┼───────────────┼──────┼─────┼──────┼────────┤
│ 1   │ 604           │ 0    │ 485 │ 1089 │ 604    │

julia> # It's creating a boolean vector to subset on, which is also how you'd subset in base Julia       
          my_subset = ptable[:, :Melt] .< 100
11-element BitArray{1}:
  true
 false
 false
 false
 false
 false
 false
 false
 false
 false
 false

julia> ptable[my_subset, :]
1×5 DataFrames.DataFrame
│ Row │ Liquid_insert │ Melt │ MP  │ BP   │ Liquid │
├─────┼───────────────┼──────┼─────┼──────┼────────┤
│ 1   │ 604           │ 0    │ 485 │ 1089 │ 604    │

julia> # find already works, and you can use that instead if you'd like
       my_subset = find(melt -> melt < 100, ptable[:, :Melt])
1-element Array{Int64,1}:
 1

julia> # another way
       my_subset = find(ptable[:, :Melt] .< 100)
1-element Array{Int64,1}:
 1

julia> ptable[my_subset, :]
1×5 DataFrames.DataFrame
│ Row │ Liquid_insert │ Melt │ MP  │ BP   │ Liquid │
├─────┼───────────────┼──────┼─────┼──────┼────────┤
│ 1   │ 604           │ 0    │ 485 │ 1089 │ 604    │

It looks like filtering is a problem. I would have expected this to return the same output as above, but it doesn't.

julia> # This isn't actually filtering anything
       filter(row -> row[:Melt] < 100, eachrow(ptable))
Base.Iterators.Filter{##11#12,DataFrames.DFRowIterator{DataFrames.DataFrame}}(#11, DataFrames.DFRowIterator{DataFrames.DataFrame}(11×5 DataFrames.DataFrame
│ Row │ Liquid_insert │ Melt │ MP  │ BP   │ Liquid │
├─────┼───────────────┼──────┼─────┼──────┼────────┤
│ 1   │ 604           │ 0    │ 485 │ 1089 │ 604    │
│ 2   │ 1039          │ 100  │ 929 │ 1968 │ 1039   │
│ 3   │ 1133          │ 200  │ 131 │ 1264 │ 1133   │
│ 4   │ 1427          │ 300  │ 484 │ 1911 │ 1427   │
│ 5   │ 1242          │ 400  │ 286 │ 1528 │ 1242   │
│ 6   │ 1203          │ 500  │ 76  │ 1279 │ 1203   │
│ 7   │ 1667          │ 600  │ 168 │ 1835 │ 1667   │
│ 8   │ 1290          │ 700  │ 595 │ 1885 │ 1290   │
│ 9   │ 500           │ 800  │ 951 │ 1451 │ 500    │
│ 10  │ 189           │ 900  │ 854 │ 1043 │ 189    │
│ 11  │ 1378          │ 1000 │ 110 │ 1488 │ 1378   │))

cjprybol on 19 Sep 2017

❤2 👍2

And in addition to the LINQ like syntax, I recently started to add another API to Query.jl that is more dplyr inspired: https://discourse.julialang.org/t/query-jl-v0-7x-released/5847. That API is not complete yet, but quite useful already.

davidanthoff on 19 Sep 2017

🎉1

Closing then. @bbrunaud Feel free to file new issues if you have specific requests about missing pieces in DataFrames or Query. If you're unsure, you can also ask on Discourse first.