Hello,
I have been using DataFrames for over a year now and I really like it. My only complain is that I found myself coming back to documentation more often than I would like to. Would it be possible to implement a set of more intuitive functions to wrap all the magic achieved with maps and weird calls such as:
julia> ptable[:Liquid] = map((x,y) -> y - x, ptable[:MP], ptable[:BP])
ptable[ptable[:, :Melt] .< 100, :]
It would be very useful to have functions like find, filter, pivot, addcolumn and so on...
thanks!
I'm sorry that you find the current interface unintuitive and to be too full of "magic" and "weird calls". It is not designed to be magical or weird at all, it's just designed to reuse as many features of the base Julia language as possible. If you're coming from R or Python (and probably other langauges too), you're likely to find the interface confusing because the data frames libraries in those languages (data.frame, data.table, and tibble in R, pandas in Python) implement thorough libraries of functions that call more efficient compiled code under the hood. They require that you to learn expansive, package-specific APIs that bypass the core language in order to use the library effectively, which to play devil's advocate, sounds more magical, less intuitive, and weirder than what's going on in the functions you've shown here. And due to inconsistencies between languages/packages on what functions are called, there has been an active effort to remove aspects of the API inspired by those packages because it became clear that what some users found helpful and intuitive, others found painful and overly-preferential to the design choices of other languages.
julia> # generate some fake data similar to your example
using DataFrames
julia> ptable = DataFrame(Melt = 0:100:1000, MP = rand(1:1000, 11), BP = rand(1001:2000, 11))
11ร3 DataFrames.DataFrame
โ Row โ Melt โ MP โ BP โ
โโโโโโโผโโโโโโโผโโโโโโผโโโโโโโค
โ 1 โ 0 โ 485 โ 1089 โ
โ 2 โ 100 โ 929 โ 1968 โ
โ 3 โ 200 โ 131 โ 1264 โ
โ 4 โ 300 โ 484 โ 1911 โ
โ 5 โ 400 โ 286 โ 1528 โ
โ 6 โ 500 โ 76 โ 1279 โ
โ 7 โ 600 โ 168 โ 1835 โ
โ 8 โ 700 โ 595 โ 1885 โ
โ 9 โ 800 โ 951 โ 1451 โ
โ 10 โ 900 โ 854 โ 1043 โ
โ 11 โ 1000 โ 110 โ 1488 โ
If you don't like using map, you don't have to.
julia> # what you wrote here
ptable[:Liquid] = map((x,y) -> y - x, ptable[:MP], ptable[:BP])
11-element Array{Int64,1}:
604
1039
1133
1427
1242
1203
1667
1290
500
189
1378
julia> # can be written more concisely with element-wise subtraction, which you may like better
ptable[:Liquid] = ptable[:BP] .- ptable[:MP]
11-element Array{Int64,1}:
604
1039
1133
1427
1242
1203
1667
1290
500
189
1378
There isn't an addcolumn function, but you've shown only 1 of 3 ways I can think of for adding a column to a DataFrame
julia> # Assignment of the new column to the column name you want, as you showed in your example
ptable[:Liquid] = ptable[:BP] .- ptable[:MP]
11-element Array{Int64,1}:
604
1039
1133
1427
1242
1203
1667
1290
500
189
1378
The above method is calling setindex! behind the scenes, just like it would if you were setting the values of specific indices in a vector or array in the base Julia language. If you'd prefer to have an explicit function to call, you can call setindex! directly
julia> setindex!(ptable, ptable[:BP] .- ptable[:MP], :Liquid)
11-element Array{Int64,1}:
604
1039
1133
1427
1242
1203
1667
1290
500
189
1378
julia> # you can also use insert! if you'd like to specify an index other than the end
insert!(ptable, 1, ptable[:BP] .- ptable[:MP], :Liquid_insert)
11ร5 DataFrames.DataFrame
โ Row โ Liquid_insert โ Melt โ MP โ BP โ Liquid โ
โโโโโโโผโโโโโโโโโโโโโโโโผโโโโโโโผโโโโโโผโโโโโโโผโโโโโโโโโค
โ 1 โ 604 โ 0 โ 485 โ 1089 โ 604 โ
โ 2 โ 1039 โ 100 โ 929 โ 1968 โ 1039 โ
โ 3 โ 1133 โ 200 โ 131 โ 1264 โ 1133 โ
โ 4 โ 1427 โ 300 โ 484 โ 1911 โ 1427 โ
โ 5 โ 1242 โ 400 โ 286 โ 1528 โ 1242 โ
โ 6 โ 1203 โ 500 โ 76 โ 1279 โ 1203 โ
โ 7 โ 1667 โ 600 โ 168 โ 1835 โ 1667 โ
โ 8 โ 1290 โ 700 โ 595 โ 1885 โ 1290 โ
โ 9 โ 500 โ 800 โ 951 โ 1451 โ 500 โ
โ 10 โ 189 โ 900 โ 854 โ 1043 โ 189 โ
โ 11 โ 1378 โ 1000 โ 110 โ 1488 โ 1378 โ
subsetting and find
julia> # I'm not sure what is confusing about this
ptable[ptable[:, :Melt] .< 100, :]
1ร5 DataFrames.DataFrame
โ Row โ Liquid_insert โ Melt โ MP โ BP โ Liquid โ
โโโโโโโผโโโโโโโโโโโโโโโโผโโโโโโโผโโโโโโผโโโโโโโผโโโโโโโโโค
โ 1 โ 604 โ 0 โ 485 โ 1089 โ 604 โ
julia> # It's creating a boolean vector to subset on, which is also how you'd subset in base Julia
my_subset = ptable[:, :Melt] .< 100
11-element BitArray{1}:
true
false
false
false
false
false
false
false
false
false
false
julia> ptable[my_subset, :]
1ร5 DataFrames.DataFrame
โ Row โ Liquid_insert โ Melt โ MP โ BP โ Liquid โ
โโโโโโโผโโโโโโโโโโโโโโโโผโโโโโโโผโโโโโโผโโโโโโโผโโโโโโโโโค
โ 1 โ 604 โ 0 โ 485 โ 1089 โ 604 โ
julia> # find already works, and you can use that instead if you'd like
my_subset = find(melt -> melt < 100, ptable[:, :Melt])
1-element Array{Int64,1}:
1
julia> # another way
my_subset = find(ptable[:, :Melt] .< 100)
1-element Array{Int64,1}:
1
julia> ptable[my_subset, :]
1ร5 DataFrames.DataFrame
โ Row โ Liquid_insert โ Melt โ MP โ BP โ Liquid โ
โโโโโโโผโโโโโโโโโโโโโโโโผโโโโโโโผโโโโโโผโโโโโโโผโโโโโโโโโค
โ 1 โ 604 โ 0 โ 485 โ 1089 โ 604 โ
It looks like filtering is a problem. I would have expected this to return the same output as above, but it doesn't.
julia> # This isn't actually filtering anything
filter(row -> row[:Melt] < 100, eachrow(ptable))
Base.Iterators.Filter{##11#12,DataFrames.DFRowIterator{DataFrames.DataFrame}}(#11, DataFrames.DFRowIterator{DataFrames.DataFrame}(11ร5 DataFrames.DataFrame
โ Row โ Liquid_insert โ Melt โ MP โ BP โ Liquid โ
โโโโโโโผโโโโโโโโโโโโโโโโผโโโโโโโผโโโโโโผโโโโโโโผโโโโโโโโโค
โ 1 โ 604 โ 0 โ 485 โ 1089 โ 604 โ
โ 2 โ 1039 โ 100 โ 929 โ 1968 โ 1039 โ
โ 3 โ 1133 โ 200 โ 131 โ 1264 โ 1133 โ
โ 4 โ 1427 โ 300 โ 484 โ 1911 โ 1427 โ
โ 5 โ 1242 โ 400 โ 286 โ 1528 โ 1242 โ
โ 6 โ 1203 โ 500 โ 76 โ 1279 โ 1203 โ
โ 7 โ 1667 โ 600 โ 168 โ 1835 โ 1667 โ
โ 8 โ 1290 โ 700 โ 595 โ 1885 โ 1290 โ
โ 9 โ 500 โ 800 โ 951 โ 1451 โ 500 โ
โ 10 โ 189 โ 900 โ 854 โ 1043 โ 189 โ
โ 11 โ 1378 โ 1000 โ 110 โ 1488 โ 1378 โ))
I hope those examples take some of the confusion out of what's going on. If the methods for interacting with DataFrames are still unpalatable in light of that brief explanation, Query implements the LINQ API which many users prefer to learn and use instead of concepts from the base language. If you think any of the above examples would be helpful to have in the documentation, please open a PR with the changes and we'd be happy to help you get those examples merged.
I'll open a new issue to track the filter issue, and if you're unable to achieve the functionality you're looking for with unstack and would like a pivot function please consider helping to finish this PR https://github.com/JuliaData/DataFrames.jl/pull/1181. Because the other aspects of this issue are based on preferences rather than functionality and the intuitiveness of the function names vary widely from person to person, I don't expect those are actionable, but I'll leave this open for others to comment before closing.
And in addition to the LINQ like syntax, I recently started to add another API to Query.jl that is more dplyr inspired: https://discourse.julialang.org/t/query-jl-v0-7x-released/5847. That API is not complete yet, but quite useful already.
Closing then. @bbrunaud Feel free to file new issues if you have specific requests about missing pieces in DataFrames or Query. If you're unsure, you can also ask on Discourse first.
Most helpful comment
I'm sorry that you find the current interface unintuitive and to be too full of "magic" and "weird calls". It is not designed to be magical or weird at all, it's just designed to reuse as many features of the base Julia language as possible. If you're coming from R or Python (and probably other langauges too), you're likely to find the interface confusing because the data frames libraries in those languages (data.frame, data.table, and tibble in R, pandas in Python) implement thorough libraries of functions that call more efficient compiled code under the hood. They require that you to learn expansive, package-specific APIs that bypass the core language in order to use the library effectively, which to play devil's advocate, sounds more magical, less intuitive, and weirder than what's going on in the functions you've shown here. And due to inconsistencies between languages/packages on what functions are called, there has been an active effort to remove aspects of the API inspired by those packages because it became clear that what some users found helpful and intuitive, others found painful and overly-preferential to the design choices of other languages.
If you don't like using
map, you don't have to.There isn't an
addcolumnfunction, but you've shown only 1 of 3 ways I can think of for adding a column to a DataFrameThe above method is calling
setindex!behind the scenes, just like it would if you were setting the values of specific indices in a vector or array in the base Julia language. If you'd prefer to have an explicit function to call, you can callsetindex!directlysubsetting and
findIt looks like
filtering is a problem. I would have expected this to return the same output as above, but it doesn't.I hope those examples take some of the confusion out of what's going on. If the methods for interacting with DataFrames are still unpalatable in light of that brief explanation, Query implements the LINQ API which many users prefer to learn and use instead of concepts from the base language. If you think any of the above examples would be helpful to have in the documentation, please open a PR with the changes and we'd be happy to help you get those examples merged.
I'll open a new issue to track the
filterissue, and if you're unable to achieve the functionality you're looking for withunstackand would like apivotfunction please consider helping to finish this PR https://github.com/JuliaData/DataFrames.jl/pull/1181. Because the other aspects of this issue are based on preferences rather than functionality and the intuitiveness of the function names vary widely from person to person, I don't expect those are actionable, but I'll leave this open for others to comment before closing.