Openrefine: Add new rows to an existing project from external file

Created on 2 May 2013  路  18Comments  路  Source: OpenRefine/OpenRefine

I have been using OpenRefine to clean data, collected across several years, in Excel/CSV files. The ability to load multiple files at once upon project creation has been very valuable. Now that I have a cleaned dataset, I anticipate the need to do this again with new data. I would like to be able to load a new Excel/CSV file into the existing OpenRefine dataset and do cleaning on it. Ideally, this new dataset would come in as new rows that I can then start manipulating.

The only way I can see to do this currently is by exporting the entire project, doing the addition outside OpenRefine, then importing the data anew. Of course, this would lose all of the valuable editing history of the project.

enhancement Medium

Most helpful comment

@biiip There is other more important work that we have scoped out. This would be a Medium priority...unless we get like 50 folks that +1 this issue :) Which seems hasn't been the case in 4 years. The other work surrounding this would be around our data model alterations we have planned (and some started with our metadata support and data package support), but as @wetneb says its better to wait and let us separate our backend and frontend 1st which we have planned for this year (hoping summer).

The work estimate is probably about 40-80 hours for a single developer once we have the backend separated out nicely and comfortable with operations (this would be an Update operation in CRUD essentially)

All 18 comments

This is closely related to #556

Only somewhat related. #556 is looking to make new blank rows that can then be edited directly. I want to make new rows that come from an external dataset. That's different than #556.

:+1:

:+1:

As a use case, I have a single huge file of 8.6 GB that is almost impossible to parse as a whole with my computer. I have split the file in small ones and would be nice to import them in the same project.

There is this tricks (though I didn't get it to work) : https://www.google.ca/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwjgwYversbKAhXIVD4KHVnHC0sQFggbMAA&url=http%3A%2F%2Fkb.refinepro.com%2F2011%2F12%2Fadd-extra-rows-records-in-google-refine.html%3Fm%3D1&usg=AFQjCNEcv4k5ZKZzUoVplhLgGBSWSpECYQ&sig2=NQMIgVHG0jECngRywojgSg

And there is this solution : https://code.google.com/archive/p/google-refine/issues/556
That work if you have combined value into a single cell/row... You can fake a single cell/row content and use a different splitting character to do so... Note caution with using default comma splitting character because it will split any row that contains it... I had hard time find back what was just happen when I splited cell/row into 2 rows, because I was splitting with the wrong character... If you split correctly with a filter you should see poping your new row with the content of the previous splitted row... And as earlier mention you can play Open Refine make it create a new "empty" row by just append a special character into a row and invoking cell -> split values -> "input your special character into the pop alert box"

:)

Is there a more recent solution to this issue?

@biiip short answer is No. I have released an extension which allows you to add new blank rows to an existing project, but there is no way to import a file into an existing OpenRefine project.

@biiip There has been discussion before from folks that work with this sort of workflow. But the team really needs a better understanding of what this involves. Here is our current understanding of this issue ...(what we do in OpenRefine to make things easier against the following needs, we can follow up on, but first an understanding of the workflow and needs) ...

  1. User wants to add or append additional data to an existing OpenRefine project. (perhaps because the data was too large and was split into multiple CSVs or its new data for a project that came in because the data set grows over time for various reasons)
  2. The User might like to perform on the newly added data rows, the Undo/Redo set of operations (set of batch operations) that they did on the previous existing rows in the OpenRefine project.

Now the approach we might take to make things easier for the above...

  1. We could add an Append importer. This new importer would add new rows (and tag them with a NEW flag, not the star or flag, to keep track of them for further operations if the user desired in 2.
  2. Have a new function or button option in the Undo/Redo dialog or elsewhere that allows to "Redo all operations on only NEW rows". The operation by default will run all checked operations on the NEW rows.

I think adding rows to the project as an operation is against the philosophy of OpenRefine's data model, because that's not a row-wise (or record-wise) operation. However, the request makes a lot of sense from a practical perspective. It should be possible to apply OpenRefine's operations to a stream of data that is gradually discovered in chunks. That's something to consider in our long term architectural changes.

So I prefer approach 2 of Thad's alternative.

@wetneb Just to be clear, Its a whole approach. That is, 1 and 2 are considered a sequence of actions. (1 and 2 are inclusive, we add an Appender AS WELL AS provide a new function to Redo operations on the tagged New or Appended rows.) That OK ?

UPDATE: and of course there might be other operations to perform or apply on those NEW rows. The important part is when adding them we Mark them as NEW so that we have context to do whatever we want or an extension wants to do with those NEW rows.

Oh okay I misread your proposal. Yeah, I think your two points make a good proposal, although I guess I would not even give the choice to the user to perform 2 or not: 2 should be performed without asking any questions. Otherwise the edit history is going to be quite messy.

But again a real solution would be to properly allow running OR operations on streams, and that requires a big backend change.

@thadguidry Yes, the main issue is that the database is all the time growing and i need to compare new data with already existing one. To be clear: there are some operations that need to be done only for new data and there are operations (deduplication eg) where i need to compare old and new data. It looks like your solution might help for that.

@thadguidry Do you know approximately how much time this kind of developments might take and when this update might be launched?

@biiip There is other more important work that we have scoped out. This would be a Medium priority...unless we get like 50 folks that +1 this issue :) Which seems hasn't been the case in 4 years. The other work surrounding this would be around our data model alterations we have planned (and some started with our metadata support and data package support), but as @wetneb says its better to wait and let us separate our backend and frontend 1st which we have planned for this year (hoping summer).

The work estimate is probably about 40-80 hours for a single developer once we have the backend separated out nicely and comfortable with operations (this would be an Update operation in CRUD essentially)

Hi all
I've just hit this same issue. Have there been any further developments since the last message in 2018?
Thanks

Ditto.

We don't all work with a complete set. I am often working with a stream of data that is constantly being updated. Yesterdays data set is out-of-date today. The data set I'm looking at are bank transactions. I really don't want to re-do all my work and run separate projects. I want to import new data into the existing project.

(@thadguidry this is completely unrelated to "separating backend and frontend" or CRUD, and I am not sure where your work estimate is coming from)

@fyodorovich I would be interested to understand your use case better: at the stage where you want to import new rows into your project, would you want these rows to go through the same cleaning process as the ones you have in your project already? In other words, should the operations you have in your history be reapplied to the new data when you import it? What sorts of operations are you using in your workflow? (They would all need to work row-wise for this to be feasible)

@wetneb Hi Antonin, that work estimate came from Jacky. It was for him to add in the ability for users to use the CSV/TSV or Line importer and add new rows read from that to a user specified existing OpenRefine project. In 2018, we were talking about separating front/back (that also initially came from Jacky as you know) as a priority and mentioned it to let users know what our priority was at that time in 2018 when I made that comment.
Of course priorities changed after you understand more of our Roadmap, our users frustration in various areas, and so we changed a lot of priorities based on all of us working together and I'm happy now with our current priorities...largely based on your research and our user surveys, so thanks for that!

Was this page helpful?
0 / 5 - 0 ratings