Dataframes.jl: DataFrames seems a little bloated with modeling-specific functionality

Created on 21 Jul 2016  ·  49Comments  ·  Source: JuliaData/DataFrames.jl

There's lots of functionality here that's not specific about data frames, like the formula language, contrast coding (if #870 is merged), model matrix construction, etc. Is it time to refactor the modeling-oriented bits out into one or more JuliaStats packages? I'd suggest something like

  • Formulas: model specification DSL
  • Contrasts: converting categorical data into numerical matrices for modeling, in a way that's agnostic to the underlying data type (e.g., PooledDataArray, CategoricalArray, etc.)
  • DataFramesModels: ModelFrame, ModelMatrix, DataFrameRegressionModel, etc. types.

Alternatively, we could just have one big DataFramesModels package that has all the modeling stuff that's currently in DataFrames.jl. It's not immediately clear to me how to cleanly separate the ModelMatrix and ModelFrame logic from the formula and contrasts logic, but that might be doable.

decision

Most helpful comment

I've taken a very crude first stab at pulling all the modeling-related code out, and putting it in a StatsModels.jl package, which passes all the tests from DataFrames.jl, and confirmed that the remaining DataFrames.jl tests pass there, too. Since I've based it on the master branch I can't figure out how to get the tests to pass on travis.

Still need to see about the modeling-related StatsBase stuff, too, and documentation.

All 49 comments

Let's move these to StatsBase?

Then wouldn't StatsBase have a dependency on DataFrames? That seems a little weird to me; IMO StatsBase should be agnostic of how the underlying data is stored. Unless I'm misunderstanding you, @nalimilan.

(Welcome back from vacation, by the way! ☀️)

What we should really do is build the Formulas, Contracts, and Models stuff to all be based on an AbstractDataFrame type (or as yet unannounced, AbstractTable).

With all the "AbstractTable" code able to live on it's own, it makes things like Formulas/Models much easier to split out because they only have a dependency on the AbstractTable (small, simple definitions of the Table interface) instead of all DataFrames (which would be a full implementation of the AbstractTable interface).

So if someone has data stored in an actual matrix, as in Array{Whatever,2}, they'll have to convert their data to a table type for use in modeling? Hm, I guess that does make sense, otherwise you don't have a clear way of referring to specific columns in the data for modeling purposes. (Well, you _do_; you have the column's position. But specifying a model in terms of matrix column positions sounds like a disaster.)

Btw @quinnj, I'd love to help out with JuliaData stuff. 😄

If someone already has data stored in a Matrix they don't need any of the formulas etc. stuff to use in modeling. The essential function of the modeling bits of DataFrames.jl is to convert data in a tabular form to a matrix suitable for, e.g. regression.

(please pardon my thumb-typing)

dave.f.[email protected]
http://davekleinschmidt.com
413-884-2741

On Jul 21, 2016, at 6:08 PM, Alex Arslan [email protected] wrote:

So if someone has data stored in an actual matrix, as in Array{Whatever,2}, they'll have to convert their data to a table type for use in modeling? Hm, I guess that does make sense, otherwise you don't have a clear way of referring to specific columns in the data for modeling purposes. (Well, you do; you have the column's position. But specifying a model in terms of matrix column positions sounds like a disaster.)


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.

@quinnj I agree. I think these packages should provide a backend-agnostic modeling API.

Really glad to see people trying to think about this from a data-source agnostic approach. This is probably obvious, but moving to a backend-agnostic API will eventually create some tensions between categorical arrays and working with data sources that have no representation of categorical data (which covers most SQL databases in their practical use in my experience).

Bumping this discussion now that #870 is merged and since interest in revising the interface for tabular data types seems to be at a high.

EDIT: I'm still working through modeling logic and don't have strong opinions yet, but my initial sense is that it would be handy to have things break down according to a StatsBase, AbstractTables, DataFrames, StatsModels package structure, where the latter includes ModelMatrix, Formula and contrasts logic.

@tbreloff is working on some general learning stuff in Juliaml. Pinging him here in the event he has any opinions on this, since the goal is to have abstractions that subsume both conventional stats and machine learning.

+1 to slimming down DataFrames in a major way. -1000000 to adding all the modeling stuff into StatsBase.

I really like the AbstractTable concept, and I hope everyone can start getting behind @quinnj's efforts there. DataFrames should be just one implementation of an AbstractTable, and all that modeling stuff should be based on AbstractTable (and it should be in a separate package).

@tbreloff is working on some general learning stuff in Juliaml.

I'm not the only one!! But yes we're actively working on experimental designs for general learning tools. If the project is a success I'll try to convince everyone to switch their workflow, but until then, carry on.

I think the modeling stuff that's currently in DataFrames is best thought of as transforming tabular data into a matrix-like format that's suitable for ingesting into models. As such, building it as a separate package that depends on an AbstractTable interface seems like the right way to go. @quinnj, have you made any progress on defining such an interface (even just for data frames)?

The AbstractTable progress has been coming along, though somewhat informally. I'm almost done with another round of updates to the DataStreams framework, where Sources and Sinks are actually decoupled; to avoid the combinatorial explosion of required Data.stream! methods for new Sources/Sinks. Once the update is done, a new Source/Sink will just have to "register" the kind of streaming it supports (row/field-based, and/or column-based) and it will get the rest of the DataStreams ecosystem for free (and fast!). Right now, this interface work is happening in DataStreams, but my plan is to move some of that abstraction into the AbstractTables.jl package that could then be depended on downstream (and improved upon).

Once the update is done, a new Source/Sink will just have to "register" the kind of streaming it supports (row/field-based, and/or column-based) and it will get the rest of the DataStreams ecosystem for free (and fast!).

GitHub needs a "yaaaas" reaction.

So what would this transformation mean to an out of core db? Chunks? Dagger lazyarray?

Out of curiosity, have you done any benchmarks @quinnj ? Would julia be up there with go for this kind of etl stuff?

I think the first priority should be to port the existing modeling stuff in DataFrame to the DataStream/AbstractTable interface, but still producing an in-memory model matrix. Ultimately I think it should be possible to create a streaming model matrix replacement (e.g., a Source/Sink) that can stream rows/columns/chunks (including out of core). But that depends first on decoupling the formulas/modelmatrix stuff from DataFrame

Is the idea that ModelMatrix or the such would be a viable type of Sink
with a logic for filling it in encoded in a Formula? And the creation of a
ModelMatrix would fall back on Data.Stream?

On Friday, August 5, 2016, Dave Kleinschmidt [email protected]
wrote:

I think the first priority should be to port the existing modeling stuff
in DataFrame to the DataStream/AbstractTable interface, but still
producing an in-memory model matrix. Ultimately I think it should be
possible to create a streaming model matrix replacement (e.g., a Source/
Sink) that can stream rows/columns/chunks (including out of core). But
that depends first on decoupling the formulas/modelmatrix stuff from
DataFrame


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/JuliaStats/DataFrames.jl/issues/1018#issuecomment-237856346,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALUCm2LR9MNvIggtdl8c0DBCB_kBuLv-ks5qc0CHgaJpZM4JSJvl
.

Yes, I think think that's one possibility. Especially since as far as I know DataFrames can be seen as a Source. The central logic for constructing a model matrix at the moment involves iterating over columns of the dataframe (which is one of the streaming modalities).

Given the discussions in #1025, I think we might want to consider the possibility of moving towards a tuple-based model. The model matrix transformation will need some column-level invariants, but it should be possible to formulate a plan based on those invariants that can be applied row-by-row given a source of tuples.

Yes, I think that's right. All you really need at the column level (if I understand correctly) is the type and (for categorical data) the levels.

Given that we're aiming for something that generalizes to other tabular-like data stores, what about calling the top-level package TableModels.jl?

I guess the question is whether we also want to move general definitions like AIC and BIC from StatsBase to the new package or not. I would think centralizing all modeling functions in a single package would be a good idea.

That seems like a good compromise to me (vs. putting all the table models stuff in StatsBase). Then the idea is that GLM.jl etc. would then re-export these?

Given that we're aiming for something that generalizes to other tabular-like data stores, what about calling the top-level package TableModels.jl?

I will plug for StatsModels.jl.

Then the idea is that GLM.jl etc. would then re-export these?

Yes.

StatsModels also sounds more standard to me than TableModels, which could be understood as a special class of models at first.

I've taken a very crude first stab at pulling all the modeling-related code out, and putting it in a StatsModels.jl package, which passes all the tests from DataFrames.jl, and confirmed that the remaining DataFrames.jl tests pass there, too. Since I've based it on the master branch I can't figure out how to get the tests to pass on travis.

Still need to see about the modeling-related StatsBase stuff, too, and documentation.

Awesome, thanks for taking the initiative, @kleinschmidt! For Travis, you can modify the YAML to do Pkg.checkout("DataFrames", "master"). You're also welcome to transfer that to JuliaStats if you'd like.

Cool. Though please preserve git history, it shouldn't be too hard. See for example http://gbayer.com/development/moving-files-from-one-git-repository-to-another-preserving-history/. Then moving it to JuliaStats sounds logical.

Preserving history is a good idea. It gets slightly tricky to do it both for both the tests and the src (since the tests are not in their own subdirectory). But at least it should be easy to preserve the src/statsmodels history at least (although it doesn't look to me like it handles re-names and re-organizations of files in that directory...)

(On the off chance this will help anyone in the future, this SO answer is a good way to filter history for any arbitrary subset of files/subfolders.)

I've updated that repo with the history, and the tests pass (at least on linux, still waiting on the mac builds). Transferring ownership to JuliaStats sounds good now that things are reasonably stable. What's the procedure for that?

Repo settings -> Transfer ownership

Perhaps we should start a roadmap issue for systematically decoupling functionality from the DataFrame type? Time to dust off the ol' Roadmap.jl repo??

Would be nice to add StatsModels to the JuliaStats webpage.

I agree that it would be good to have it on the website, but maybe after it's registered?

Just wanted to say that I also love the AbstractTable idea and hope it hasn't disappeared meanwhile in the meantime. I would love to support DataFrames in the new MLDataUtils refactor for eachobs, eachbatch, and splitobs etc, but am a bit hesitant to require the full DataFrames.jl repo, since all I would need is nrow, getindex, and a type to dispatch on.

@Evizero "Meanwhile" what? We're rather making progress on that front. I think the plan is to replace JuliaData/AbstractTables with davidagold/AbstractTables.jl soon.

I think he's just saying exactly what I'm thinking, which is that we can't
wait for tabular data to be as flexible and extendable as AbstractArray.
Looking forward to the results of AbstractTables!

On Wed, Oct 19, 2016 at 8:53 AM, Milan Bouchet-Valat <
[email protected]> wrote:

@Evizero https://github.com/Evizero "Meanwhile" what? We're rather
making progress on that front. I think the plan is to replace
JuliaData/AbstractTables with davidagold/AbstractTables.jl
https://github.com/davidagold/AbstractTables.jl soon.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/JuliaStats/DataFrames.jl/issues/1018#issuecomment-254803381,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA492uqtWtzqhqYyoUCxR7IoHNveyUdLks5q1hKygaJpZM4JSJvl
.

I'd like to suggest another, super simple interface for tabular data: an iterator of immutables, where the convention is that you iterate through the rows of a table. Each row would be represented by an immutable type. This kind of interface would not require any abstract base types, it can purely work based on utilizing existing julia base conventions.

A function that consumes such an iterable and wants to explore the schema of the data source can simply use the following standard julia methods for doing so:

  • to get the names of the columns use fieldnames(eltype(iter)).
  • to get the types of the columns use eltype(iter).types.
  • to get the number of columns use length(eltype(iter).types) or length(fieldnames(eltype(iter))
  • to get the number of rows use length(iter) (but check first whether that is supported by using the usual julia base methods)

We could of course define some helper functions that wrap these, but they would be helpers, not part of the interface contract. It might also make sense to define a SimpleTrait that indicates that a type supports this interface, so that one can dispatch on it.

This kind of interface is pervasive in Query.jl, so if for example ModelFrame could work with this interface, there would be a really, really nice integration between the query framework and the model estimation framework. I stumbled over this yesterday when trying to do everything in this chapter using Query.jl, and literally the only thing that is missing right now is the ability to pass an iterable of immutables to something like lm in GLM.jl to make everything work (and I actually think in a more natural syntax than the R original).

This kind of interface might be in addition to something like AbstractTables, which might provide additional benefits. But it would be really great if the most simple interface in this universe did not require types to inherit from some base type, because that doesn't square at all with e.g. the design of Query.jl.

I don't think the idea was ever to require types to subtype a certain abstract type. AbstractTable would indeed be an abstract type, but interface methods would be carefully defined around it so that as long as types implement the required interface methods, they'd be able to participate in any provided interface functionality. I think David Gold has been making some great progress on the required interface methods in his AbstractTables.jl package and it'd be great to coordinate with Query.jl as well. I think the kind of RowIterator interface you're talking about would definitely fit with what David's already put together.

I'm suggesting we don't define new functions for this most basic interface, but instead just rely on what is in base already, i.e. plug into the existing iterator and type interface in julia. The major benefit would be that any iterator in base is automatically a data source and could e.g. be passed to ModelFrame or plot etc. If we define new functions for this interface, like the ones in davidagold/AbstractTables.jl, then we would have to implement these new functions for all the iterators in base for them to work in this framework. That seems unnecessary, right?

No, in my mind, we would make something like the helper functions you mentioned above apart of the "official" AbstractTable interface. Downstream packages will then code to the AbstractTable interface. I don't think it would require re-implementing anything for all iterators anyway; if needed, we define the helper functions once that take any iterable and then it's good to go.

Ah, yes, I don't mind if we make the helper functions the API for clients. But it would be great if sources don't even have to depend on AbstractTable for everything to work.

I'm with @quinnj on this one. I think packages that want to support tabular data formats should write things in terms of AbstractTables, then DataFrames and whatever else will "just work" when plugged in. While I understand the appeal, I don't think we should have a table type masquerading as something from Base, or bloat Base with something specific to tables. I think the use of a tabular data structure should be explicit and having it come from a package seems 👌 to me.

My proposal would add nothing to base. All I'm suggesting is that the most basic way we think about a table is as an iterator of named tuples. For that interface, there is no need to define any new methods or types, one can just use the standard methods in base to inquire about the complete scheme of a data source.

Oh sorry, I misunderstood.

@davidanthoff, making ModelMatrix work for something like an iterator for named tuples is on my TODO list for StatsModels.

I agree with David that a subtype declaration is overly restrictive, and that an interface contract would be more appropriate. I also agree that an iterator over row-like objects should be able to satisfy this interface, hence the interface oughtn't to require anything over and above what such an iterator can provide. Indeed, the most basic AbstractTable interface should really just allow you to extract schema information from a "table". Here's some (constructive, I hope) criticism of the proposal:

  • The interface David suggests essentially requires that the immutables be named tuples. This seems overly restrictive. It seems as though one ought to be able to satisfy the interface contract with an iterator over plain tuples that stores the field -> column index mapping in the iterator itself.
  • The selector methods David describes are kind of verbose and unclear. For instance, it seems preferable to just be able to do ncol(itr) as opposed to length(eltype(iter).types).
  • Immutable-returning iteration perhaps shouldn't be part of the most basic tabular interface, since this would preclude a table type that just wraps a database connection.
  • An abstract AbstractTable type is useful for hooking into generic functionality that relies on dispatch, e.g. show.

I think that AbstractTables.jl would be an appropriate place to formalize and document this interface contract and house the AbstractTable type, for whatever the latter may be useful for. I also realize I house backend support for SQ there. I'm happy to move that code elsewhere and make the AbstractTables package more neutral.

DataFrames master no longer has modeling functionality; that's been moved to StatsModels.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

CameronBieganek picture CameronBieganek  ·  6Comments

bbrunaud picture bbrunaud  ·  3Comments

tlienart picture tlienart  ·  8Comments

garborg picture garborg  ·  8Comments

blackeneth picture blackeneth  ·  5Comments