I'd like to have a single issue that identifies every issue that's blocking a release of DataFrames that replaces DataArrays.
At a high-level, my understanding is the following:
NullableArrays.jl could use some love before the new release -- either https://github.com/JuliaStats/NullableArrays.jl/issues/148 or some subset of it.
EDIT: Though I suppose this isn't strictly blocking, it would be good to settle where lifted operators should live before making public any APIs that depend on them.
Would this be a good place to link to issues with other packages that rely on DataFrames or should that be handled in a separate issue?
@IainNZ -- I saw you wrote a tool to look at package dependencies. Would you mind running it on DataFrames and DataArrays so we can look for actively-maintained packages that might be affected? (If there's a way I should be able to do this let me know). Thanks -A
Thanks for your help getting the word out on the changes, @amellnik!
If I'm not mistaken, Pkg.dependents("DataFrames") can be used to get a list of packages that currently rely on DataFrames. And I think Jacob Quinn had requested that Tony Kelman run the package evaluator on the ecosystem but with DataFrames on master rather than release to see what catches fire. That will be immensely helpful.
Ah thanks Alex -- I didn't know about that Pkg command. I'll start looking through the list and trying to submit PRs for the low-hanging fruit now.
If I'm not mistaken, Pkg.dependents("DataFrames") can be used to get a list of packages that currently rely on DataFrames. And I think Jacob Quinn had requested that Tony Kelman run the package evaluator on the ecosystem but with DataFrames on master rather than release to see what catches fire. That will be immensely helpful.
Only for packages that you have installed though, right?
I think it scans your local copy of METADATA so it should find all dependents if METADATA is reasonably up to date.
You can also use JuliaPkgDb and do select * from versions where dependency = 'DataFrames';
It won't look at test dependencies though. Those are worth checking as well. I don't recall being asked explicitly to run pkgeval, I was asked how one would go about doing it which I answered.
Hey @tkelman, would you be willing to explicitly run PkgEval? :slightly_smiling_face:
latest master here then? anyone have a particular metadata fork+branch with a provisional test tag? skip the git tag and pr part of it for now.
latest master here then?
Yep.
anyone have a particular metadata fork+branch with a provisional test tag?
I don't but I can make one (much) later today if no one has beaten me to it.
Does this look right, @tkelman? https://github.com/JuliaLang/METADATA.jl/pull/6735
May want to rebase that right before I fire off the run. Been sick the last few days, bug me again if I haven't run this by Monday.
Looks like you had "enable edits" checked, so I did the rebase and fired off PkgEval. Hopefully it'll work, check https://github.com/JuliaCI/pkg.julialang.org/branches in 10-12ish hours to see if it creates a new "dataframes" branch with the results.
I should have run that differently since this version doesn't support Julia 0.4 and there's a bunch of unrelated SSL certificate failures on 0.6-dev right now, but focus on the 0.5 pulse changes here: https://htmlpreview.github.io/?https://raw.githubusercontent.com/JuliaCI/pkg.julialang.org/dataframes/pulse.html
Many of these, but not all, look like packages just making the now-wrong assumption that depending on DataFrames is enough to also ensure DataArrays will be installed. The easiest short term fix for those cases would be adding DataArrays to each of these packages' REQUIRE files (could even be done directly in METADATA without having to wait for a new tag), though there could easily still be remaining failures after doing that.
Publish a release of DataFrames that uses deprecations or some other mechanism to push people towards our new API's
What exactly is the plan here? We are not going to say that some query framework is the _only_ way to interact with a DataFrame, right? I had always assumed that Query.jl and StructuredQueries will add another API to interact with DataFrames, but not replace existing APIs. Is that not the plan?
I don't think we're going to remove the current API (at least not immediately, and not all of it), but even with lifted operators it's significantly more cumbersome to use after the port to Nullable. So I'm not sure anybody will want to use it anymore. (Yes, evil plan... :-)
Just as an aside and follow up to @ararslan's post on how you can query the package database with his JuliaPkgDb repo:
I took his code and turned it into a data source that can be queried with Query.jl. For now you need the code here, but I'll probably move that into Query.jl at some point. His example could then be written like this in pure julia terms, without savings things into a SQL database first:
using DataFrames, Query
include("packagedb.jl")
db = pkgdb()
@from i in pkgdb() begin
@from j in i.versions
@from k in j.requires
@where k.name=="DataFrames"
@select {i.name, j.version}
@collect DataFrame
end
Should be fairly simple to also read information about test requirements and add them to the mix.
@ararslan Let me know if you have any thoughts about this code, but maybe here davidanthoff/Query.jl#51. I just posted this info in the current issue in case it might be helpful for going through packages, any follow up discussion about the code we should probably have somewhere else.
@davidanthoff That's neat, thanks for sharing. 馃檪
I may have been premature in opening issues related to this change in other packages -- except in trivial cases like Mustache.jl, I don't anticipate any progress on those until we have a clear strategy for queries and lifting/conversion.
If we do go with either StructuredQueries or Queries, I think we also need to make sure there's still a way to do very simple operations inline.
I'd just like to provide some feedback here in regard to the query methods.
I personally rather dislike Query.jl (no offense to the developers of course); though it's cleverly done, it involves a rather elaborate interface which gets too far away from the normal structure of the language. To me, the great thing about pandas is that it is NOT like SQL, but is instead normal python code, although it can certainly be simpler in a number of cases, as it naturally would be in Julia.
By contrast, I really quite like DataFramesMeta.jl. It's beautifully elegant and (aesthetically) efficient, and fits very comfortably and uniformly into the existing style and structure of the Julia language.
To me, using macros to splice SQL into Julia doesn't make sense. Julia is a wonderful, incredibly well-thought-out language (things that I would never say about SQL), any query API for dataframes should take advantage of this, and embrace all the things that make Julia so much better than SQL. Furthermore, a query package is much more suitable for general, custom use if it doesn't hide the underlying structure of the dataframes or the queries themselves.
This is just my opinion of course, hopefully a few others will agree with me. Of course I'm not trying to discourage work on Query.jl, those who prefer that style should go with it, and certainly there will be some cases where it is much easier to use that DataFramesMeta.jl, I'm just hoping it won't become a paradigm. It's likely that at some point I'll start working on DataFramesMeta.jl, or something like it.
Both Query.jl and StructuredQueries.jl are new. As such, this is a good time to request features and describe use cases. For example, David has started coding some shortcuts in Query.jl that make it look more like DataFramesMeta and pandas. Feedback on that might be helpful. Unfortunately, I don't have a link handy where he showed an example.
@ExpandingMan I totally agree that something like Query.jl shouldn't become the only way to interact with DataFrames. It is really powerful is you want to do something complicated, but for the small, simple manipulations is is overkill and too verbose. I've started to add another alternative user-facing API that is a little less heavy weight, and the underlying architecture is flexible enough to potentially allow additions of more alternative user facing APIs (e.g. something that looks more like dplyr could be added as another option). But that is all future talk.
The more I think about it, the more convinced I am that the current transition plan for DataFrames to Nullable is a major mistake. Giving completely up on trying to support things like df[:a] = log.(df[:b]) seems a major step back to me. That is actually a _really_ good API for doing such a simple manipulation from the REPL. This is not a "I told you so" comment, I was actually strongly in favor of merging the NullableArrays PR into DataFrames. But I changed my mind lately, I think completely breaking the existing API is not a good strategy and we should have come up with a way to preserve the old API.
@ExpandingMan and @tshort Here is an issue discussing this new API davidanthoff/Query.jl#72. Feedback welcome!
Was there ever any real choice about transitioning to Nullable? Type stability is crucial for performance (and probably also stability) reasons, period. I don't see what the alternative would be. I don't know, I think with a good enough API most users will be happy with it, but it may take a lot of work to get there.
For example, df[:a] = log(df[:b]) should work out-of-the-box, because all basic math functions should be overloaded for NullableArray and eventually even Nullable (so that df[:a] = log.(df[:b]) would work). There should also be a macro like df[:a] = @lift userfunc.(df[:b]).
I think implementations of basic math functions for Nullable is one of the things that is missing right now that would make this process seem much less painful. If there were enough functionality already built into the language for Nullable this wouldn't seem like such a huge change.
I know it'll be a lot of work, and I haven't been involved in this, so I don't want to sound like a back-seat-driver, but it seems to me like you guys made the right decision.
I think there is a zero chance that we will be allowed to add a lifted version of things like log to base. Adding it in a package is type piracy, and we are also not allowed to do that (rightfully so!). If I'm right, then there is no chance that log.(df[:b]) will ever work, short of some major language level lifting support (which I also don't see happening any time soon). See also my general skepticism re a general lifting strategy here davidanthoff/Query.jl#71.
If we don't want to give up on this kind of API, I think the only way to get there is to use a different type for missing values, see JuliaData/Roadmap.jl#3. I've actually migrated Query.jl to that kind of design, and everything works just fine: I get type stability, I can provide lifted versions of all the functions I like etc.
I've now read many of these issues, and I am starting to see the point about why lifted methods should not be defined for Nullable.
First, let me just say that I still think type-stability is absolutely crucial. Even if performance were the only issue (and it isn't), there seems to be reason enough not to use something like NA. If you are going to write code specifically to handle NAs, you might as well write code specifically to handle Nullable.
Here's how I see it: there are two versions of null values which are currently in widespread use which should serve as pototypes: C++'s nullptr and IEEE floating point NaN. The default behavior for nullptr is not to propagate, but instead to throw errors. The default behavior of NaN is to propagate. For data purposes, we want a nullable type which behaves like NaN.
I think I agree with those who are saying that we need another type which resembles Nullable, but is specifically designed for cases which resemble NaN.
There's two other things I'd like to point out about lifting strategies:
f(df[:a]) is to return Nullable values, it doesn't necessarily follow that, for example Nullable(a) == Nullable(b) should return a Nullable{Bool} and not a Bool. In this case one doesn't necessarily expect a return value in the same space of a and b. It is a case in which it seems perfectly acceptable for it to return a completely different type. I don't really think that choosing a convention for this sort of thing is a huge obstacle for lifting schemes. Again, following the example of NaN seems like a reasonable choice. false == (NaN == NaN) has been true since its inception, and it works fine.@where(df, x .> a) behave somewhat differently than [x for x in df[:x] if x > a] doesn't seem like a big problem to me. The whole idea of having special query functions is for them to behave somewhat differently, and I think that it is entirely transparent to the user that this is the case. If it weren't, the user would probably have avoided using the query methods in the first place.df[:a] = @lift f.(df[:b]) is such a bad thing.Also, let's try not to lose sight of something: at the end of the day, when what's being stored in a dataframe is ultimately fed into a machine learning method, integrator, whatever, the null values _must_ be dealt with. I still haven't entirely given up on the idea of having a new type down-graph from AbstractDataFrame which has no null values, and uses ordinary Arrays for storage, because this will always be useful. The mathematical heavy-lifting, so to speak, is going to get done on good ol' floats, and I don't see that ever changing. I think that's reason for optimism.
I vote to reign this thread back to it's original purpose, which is to plan for the release of the new DataFrames.jl master. The foregoing comments suggest that we might want to offer at least a minimal native API that doesn't rely on a query framework. I think we can accommodate as much. Higher-order functions such as map, reduce and broadcast are easy enough to treat fairly generally, e.g.
julia> using NullableArrays
julia> immutable Lifted{F}
f::F
end
julia> lift(f) = Lifted(f)
lift (generic function with 1 method)
julia> @inline function (_f::Lifted{F}){F}(x)
f = _f.f
if Base.null_safe_op(f, typeof(x))
return Nullable(f(x.value), !isnull(x))
else
U = Core.Inference.return_type(f, Tuple{eltype(typeof(x))})
if isnull(x)
return Nullable{U}()
else
return Nullable(f(unsafe_get(x)))
end
end
end
julia> Base.broadcast(f::Function, X::NullableArray) = broadcast(Lifted(f), X)
julia> digamma.(NullableArray(rand(5), rand(Bool, 5)))
5-element Array{Nullable{Float64},1}:
#NULL
#NULL
#NULL
-1.6254
#NULL
So perhaps we may as well offer those, even if I would encourage a query framework over their use in general. In any case, we should make very clear which APIs will work, which will not, and why we're moving away from the one's that we're breaking. I can see a compromise where we deprecate the old API by including a _minimal_ set of lifted methods, to ease the transition.
I think that's a great plan/idea.
It also occurred to me that, due to potential reach of such a change, we could provide a limited-time "upgrade message" that is printed in the DataFrames __init__() function when the package is first loaded. It could spell out that the package has gone through a big internals change and point out a few key APIs that may not work like before and offer alternatives. We could also point out that the user can do Pkg.checkout("DataFrames", "0.8.3") (or whatever the current version is) with a Pkg.pin("DataFrames") to ensure they stay on the legacy DataFrames code that will still work with any scripts/code they have written against. As long as we make sure existing DataFrame dependency packages also have proper upper bounds, this also ensures their package environment should be stable.
Just to be clear, @davidagold's suggestion would have the following consequences:
isnull.(df[:a])
would not work in the way anyone would expect it to work. That case could probably be fixed by something more elaborate, but I believe nothing could be done about the following case, which would also not work as one would expect:
foo(x) = isnull(x) ? 0 : x
foo.(df[:a])
I seem to be in the minority on this, but I think both cases are really bad. The dot syntax is conceptually really simple, except that with this it all of a sudden it gets really complicated. I actually would much prefer if the difficulties with the DataFrame API didn't end up making the dot syntax more unpredictable...
I'm pretty sure that David's solution allows you to provide a custom definition for how isnull works as an argument to higher-order functions.
Yes, isnull.(X) could be made to work easily. But David Anthoff is right that
f(x) = isnull(x) ? foo : bar
f.(X)
cannot. Even broadcast(x -> isnull(x) ? foo : bar, X) cannot, though @select isnull(a) ? foo : bar can.
I agree that such inconsistencies with the behavior for arrays is not ideal. Neither lifting approach is ideal. By any means. What would be ideal (from my limited perspective) would be to have two versions of Nullable{T} -- one that propagates (A) and one that doesn't (B) -- and then have any function f called on an argument signature xs that contains any A arguments automatically lowered to lift(f, xs...). I have no idea how difficult this would be to implement, but I also can't think of a problem that this doesn't solve. I'd like to hear @JeffBezanson 's take on whether or not this is possible/feasible. If it is, then we should view whatever lifting strategy we adopt presently as a stop-gap until somebody with the know-how has the time to work on it. Even if that takes a year, or two years, I think it would be worth it. In the meantime, things won't be perfect.
In terms of timeline, I think we should be explicit about what we're thinking here and commit to something publicly. It gives the best chance for others to be aware and make changes and it also helps this package keep moving forward without being in this "limbo" state for too long (i.e. the difference between the latest release tag and master).
Of course, I'd say the bulk of what needs to happen before a new release depends on non-DataFrames things: NullableArarys, Nullable operations in general, StructuredQueries.jl/Query.jl, etc. I'd like to propose (and include in John's original post at the top), that we aim to have the bulk of the non-DataFrames package work done by the first of January. That gives us over two months to help update the ecosystem and iron out any remaining design decisions, as well as allowing StructuredQueries/Query to mature. We can think of the first of January as being a the moment we want to aim for a "release-candidate" of sorts, allowing a few weeks of final bugs to sort out before the final public release and announcement. We should obviously try to come up with some kind of announcement that can be released soonish letting everyone know about the the plan.
Another point in favor of @davidagold 's comment above:
Metaprogramming tools are unbelievably complicated unless we have a null type that propagates. That seems to me like the only good solution to that particular problem. I've now spent some time looking at DataFramesMeta with the intention of making it work for NullableArrays. The task of making it work at the same level of generality with any lifting strategy is quite daunting, even if it would be very simple for a few specific use cases.
@ExpandingMan so you're aware, I've been working on a query representation framework that automatically transforms user-specified calls f(xs...) to lift(f, xs...), for generic null propagation, with lift defined here. It's still under heavy development. But maybe the general strategy will be useful for whatever it is you want to do.
EDIT: That's by way of saying, the metaprogramming involved in a generic, macro-based lifting approach is not intractable. That's good, because I highly doubt that the solution I proposed above will happen any time soon. (It certainly won't happen in the form I suggested above, in which the transformation is conditional upon the types of the arguments.)
If I may add something to the API discussion, as someone who wasn't involved in its development: I'm currently getting a DataFrame as a result of an SQL query. I would rather not be forced to use another Query API to get my values out of that DataFrame a second time. And I totally agree that imitating SQL in Julia would be a step backwards.
The Nullable-based backend has been moved to the DataTables package. I'll close this issue here; further discussion should take place in an issue in DataTables. If you continue this particular discussion, please include a link to this issue for the sake of continuity.
Thanks, everyone!
Most helpful comment
I vote to reign this thread back to it's original purpose, which is to plan for the release of the new DataFrames.jl master. The foregoing comments suggest that we might want to offer at least a minimal native API that doesn't rely on a query framework. I think we can accommodate as much. Higher-order functions such as
map,reduceandbroadcastare easy enough to treat fairly generally, e.g.So perhaps we may as well offer those, even if I would encourage a query framework over their use in general. In any case, we should make very clear which APIs will work, which will not, and why we're moving away from the one's that we're breaking. I can see a compromise where we deprecate the old API by including a _minimal_ set of lifted methods, to ease the transition.