Dataverse: Ingest: Provide more robust ingest for Excel and CSV

Created on 9 Jul 2014 · 39Comments · Source: IQSS/dataverse

Author Name: Kevin Condon (@kcondon)
Original Redmine Issue: 4014, https://redmine.hmdc.harvard.edu/issues/4014
Original Date: 2014-05-21
Original Assignee: Eleni Castro

The initial Excel ingest implementation is a proof of concept -it supports rectangular columns and rows of string and numeric types.

More functionality may be required, for example dates, missing value support, variable labels, but it is unclear what is typically expected from researchers who use Excel to store their data.

This ticket is a place holder to investigate a concrete set of requirements for a more robust Excel ingest.

As a side note, there are a number of "How to use Excel for Research Data" guides posted by various universities out there that may answer some questions and provide a starting point.

File Upload & Handling Feature Depositor

Source

eaquigley

All 39 comments

@posixeleni Any update on this?

eaquigley on 27 Aug 2014

@eaquigley Will need to review the documentation mentioned above to see if it can help.

@landreev have you had any luck with this?

posixeleni on 2 Sep 2014

Will need to move this to another milestone.

posixeleni on 12 Sep 2014

@kcondon @landreev Found some best practices on how we can ingest Excel files https://oit.utk.edu/research/documentation/Documents/HowToUseExcelForDataEntry.pdf but not sure if these would be applied across the board by all researchers and how we would handle them exactly in Dataverse:

All your data should be in a single spreadsheet of a single file. [Eleni: what would we do if someone sent us an excel spreadsheet with multiple workbooks? only load the first one and put in our documentation that this is what we do?]
Enter variable names in the first row of the spreadsheet.
Variable names cannot contain spaces, but may use the underscore character. [Eleni: What would we do if variable (header) names did contains spaces? Replace spaces with underscores?]
No blank rows should appear in the data. [Eleni: Would we programmatically remove all blank rows?]
For missing values, leave the cell blank. Although SPSS and SAS use a period to represent a missing value, if you actually type a period in Excel, stat packages will read the column as character data so you
won’t, for example, be able to calculate even the mean of a column. [Eleni: can we make sure we record these as N/A values?]
You can enter dates with slashes (6/13/2003) and times with colons (12:15 AM). [Eleni: Is this something our system can look for and store them as dates and times?]

@eaquigley Assigning to @scolapasta to see if this is something we can schedule for 4.0 or if it needs to be moved to 4.1.

posixeleni on 15 Oct 2014

Probably best to just move this to 4.1, but first checking with @landreev if there are any low hanging fruit here

scolapasta on 6 Jan 2015

@posixeleni :
Eleni, I've spoken to Gustavo and Liz about this; we decided that we'll attend to putting together any best practices/guidelines documentation in 4.1.

In the short run though (for 4.0), any chance you know of any real-life Excel files, that any of our partners would actually want to ingest as tabular data?

I'll try and run the ingest thingy on more XLSX files from production; but if you have any extra leads - please let me know.

landreev on 30 Jan 2015

@landreev unfortunately my leads for getting real-life Excel files from our partners fell through. They never sent me any after several follow-up attempts. Sorry I couldnt be of more assistance. But will send some your way if I ever do get some excel data from in-the-wild.

posixeleni on 30 Jan 2015

I mentioned we're looking for XLSX files in the #sciencelab channel at irc.mozilla.org and @kaythaney from @mozillascience tweeted about it! Thanks @kaythaney! She also suggested emailing http://lists.software-carpentry.org/mailman/listinfo/discuss_lists.software-carpentry.org which is an organization @devbioinfoguy is involved in, I believe.

pdurbin on 30 Jan 2015

Any specific type of excel files you're looking for? I'm sure I've got (anonymized) UV-Vis data (absorbance vs. wavelength) kicking around if you want them.

pbulsink on 30 Jan 2015

Greetings! I have a ton (ok, quite a #) of Excel files I've generated over the past 10 years doing evolution and developmental biology research. Happy to pass a few along.

Interestingly, DataCarpentry (sister group to Software Carpentry (SWC)) is looking to roll out workshops to subject matter areas: biology (genomics) and social sciences first. Structured data/Excel is going to be one of their modules. And its possible that SWC may talk about some of this in their SQL module (http://software-carpentry.org/lessons.html).

devbioinfoguy on 30 Jan 2015

Thanks so much @pbulsink and @devbioinfoguy!
We would love any anonymized xslx files but @landreev can let you know if he needs anything specific to test with. Please email us your xlsx files to supportATdataverseDOTorg and if you dont mind we may bug you to confirm with us that these files are being ingested correctly.

posixeleni on 30 Jan 2015

@pbulsink
@devbioinfoguy
Philip, Bob,
Yeah, what Eleni said - thanks so much!
As long as they are XLSX files (i.e, the "new", xml-based Excel format - as opposed to the old, proprietary MS binary format)... We are not looking for any specific type of files; as in, we don't know yet what to specifically look for. Support for Excel format is a brand new feature in the Dataverse. Because, historically, Social Sciences have been (used to be) more of an SPSS/Stata/R shop... Getting data out of XLSX _seems_ like a relatively straightforward thing - as long as it follows the basic pattern of a rectangular data matrix - variables (rows) and observations - we should be able to turn it into our data table... But, it's reasonable to expect that once we try the whole thing with files from real researchers, there will be some issues; some bad formatting, something stripped/lost, etc. So any real-life examples of real research data would be of much help. To try to ingest, to try to run our data exploration and analysis tools on, etc.
And it sounds like what you have is as "real life" as it gets.
Thanks again,
-Leonid

landreev on 31 Jan 2015

People with data in Excel aren't usually thinking about exporting it to another tool and rarely have it formatted in a way that would make it easy or even possible to ingest. As @devbioinfoguy mentions, that's why in Data Carpentry, we have a 'data organization in spreadsheets' lesson (https://github.com/datacarpentry/datacarpentry/tree/master/lessons/excel/ecology-examples). So, it sounds like you're looking for Excel spreadsheets, from domain scientists, that would attempt to conform to the guidelines @posixeleni lays out, rather than just some standard wild Excel data? If you had a blog post on a call for this type of data, or whatever else you need, we'd be happy to post it at datacarpentry.org Thanks for all the work you're doing!

tracykteal on 31 Jan 2015

Ticket in RT w/ excel files for @landreev to test out:

Queue: DVN_support
Subject: Anonymized Excel Files
Ticket URL: https://help.hmdc.harvard.edu/Ticket/Display.html?id=192570

posixeleni on 2 Feb 2015

I didn't have time to finish writing up an update for this, for beta 12. I'll get it done for 13, and I'll push in a couple of small ingest fixes I've added since then.

landreev on 11 Feb 2015

Based on comments in this ticket and additional research, @mcrosas will create requirements for this ingest.

eaquigley on 9 Jul 2015

Related: #2626

bencomp on 10 May 2016

We use .csv files as a 'lowest common denominator' filetype for a project aimed largely at in-country partners. They are well-structured/formatted (no spaces in header rows, no blanks, no mixing of datatypes in the same column, etc.) but still not compatible with TwoRavens. One option is to upload them as Rdata files and let users download the tsv created during ingest, but ideally we would like to use our files as-is. What about an option to add field-specific metadata during ingest that's not picked up from the file?

zward on 25 Jul 2016

@tracykteal thanks for linking to https://github.com/datacarpentry/archive-datacarpentry/tree/master/lessons/excel/ecology-examples which seems like good advice for researchers using Excel.

@zward I'm not exactly sure what you mean by "field specific metadata" but in DVN 3.x we used to have the idea of a "control card" file (see http://guides.dataverse.org/en/3.6.2/dataverse-user-main.html#control-card-based-data-ingest ) but from what I understand we got away from this concept in Dataverse 4 because it was too confusing.

pdurbin on 22 Oct 2016

An another example of an xlsx file that failed ingest is Harvard_Dataverse_homepage_survey_results_assessment.xlsx from https://dataverse.harvard.edu/file.xhtml?fileId=2969155&version=2.0 so perhaps we can work with @jggautier on figuring how either how to improve ingest to accommodate that file or improve the docs to give users more help in formatting their file so that ingest can succeed.

pdurbin on 6 Jan 2017

Just ran into a guide from Cornell about preparing tabular data for archiving: https://data.research.cornell.edu/content/tabular-data. (Looked but couldn't find anything similar at Harvard.) Could be helpful if we decided to improve the docs.

jggautier on 7 Feb 2017

👍1

This is a nice documentation piece, but it's written too much from only the
perspective of archiving, for what we need. I agree that the docs should
include this type of documentation on working with files, and files to
upload, but we should write from the point of view of a researcher or data
author - a researcher working with quantitative data usually will already
work with R, stata or similar data package. If he/she doesn't use any of
those, recommending what you should do when creating the CSV file can be
helpful.

Let's discuss this - it would be a very useful exercise to write this as
part of the docs.

Merce

Mercè Crosas, Ph.D., Chief Data Science and Technology Officer, IQSS, Harvard
University
@mercecrosas mercecrosas.com

On Tue, Feb 7, 2017 at 4:37 PM, Julian Gautier notifications@github.com
wrote:

Just ran into a guide from Cornell about preparing tabular data for
archiving: https://data.research.cornell.edu/content/tabular-data
https://urldefense.proofpoint.com/v2/url?u=https-3A__data.research.cornell.edu_content_tabular-2Ddata&d=CwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=n9HCCtgqDPssu5vpqjbO3q4h2g6vMeTOp0Ez7NsdVFM&m=nPWg2rii7w3D3xfGqiBflbCOe1evplsYGXY2WfYAB9Y&s=lYQrq__64UAge9bPijJP4y7ulEy53ZHYYMIzvA4czsE&e=.
(Looked but couldn't find anything similar at Harvard.) Could be helpful if
we decided to improve the docs.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_IQSS_dataverse_issues_585-23issuecomment-2D278148359&d=CwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=n9HCCtgqDPssu5vpqjbO3q4h2g6vMeTOp0Ez7NsdVFM&m=nPWg2rii7w3D3xfGqiBflbCOe1evplsYGXY2WfYAB9Y&s=c3uC5537VLE_ADyX-vNg_qZHlkq8PWelNatDqnN8yM8&e=,
or mute the thread
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AApQyH5khQ9-5FmH9e00Xcf6PG7JWtWeSPks5raOQbgaJpZM4CLnV5&d=CwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=n9HCCtgqDPssu5vpqjbO3q4h2g6vMeTOp0Ez7NsdVFM&m=nPWg2rii7w3D3xfGqiBflbCOe1evplsYGXY2WfYAB9Y&s=9sofGdEA35Scnz6bmI4gX6MyXiHoTAqwsiqk7s8jZHY&e=
.

mercecrosas on 8 Feb 2017

In a Slack conversation with @thegaryking about better support for CSV ingest, @landreev wrote:

the current implementation does not allow new lines embedded in text fields. Because we expect every line to contain the same number of comma-separated fields. In that particular file [that failed ingest] there are opening and trailing quotes, before and after the chunk of text with new lines... So yes, it can still be parsed unambiguously.

Should be a straightforward fix. We may have just assumed that new lines would not be meaningful to anybody in tabular data. So stripping them and/or replacing them with spaces would make an ingestible CSV file.

jggautier on 5 Apr 2017

Have you considered delegating CSV support to external libraries like Apache Commons CSV? I see it is listed as a dependency for Dataverse, but the ingest apparently still depends on @landreev's own code.

bencomp on 5 Apr 2017

@bencomp I'm not sure why we invented our own way of parsing CSV files within the Dataverse code base. I agree with you that if we can leverage an open source package that already does this job, it would probably be better. Maybe there's some Dataverse-specific stuff in our CSV parsing. I don't know. @landreev would.

Yesterday @raprasad was showing us how Python's Pandas can parse Excel files: https://github.com/IQSS/miniverse/blob/729a8d1cb0bededbcc9882e51ccf5b874b1a178e/dv_apps/datafiles/tabular_previewer.py#L107

There's also a suggestion to have ingest run on a separate service rather than as part of the Dataverse monolith: #2331.

pdurbin on 5 Apr 2017

There was some historical reason why we insisted on one line per one observation requirement. But it doesn't appear to be relevant anymore.
Also, when that original implementation was put together, we did not anticipate that anyone would use csv ingest for real, full text - with rich punctuation, extra tabs, commas and newlines. Since this appears to be a requirement now, yes, switching to an existing third party parser that takes care of all these things seems like the right thing to do.

landreev on 11 Apr 2017

I'm opening a new issue for some concrete incremental improvements of the CSV ingest, to be estimated and included into some sprint in the short term. (This one is just too generic)

landreev on 11 Apr 2017

3767 - a child issue I created just for rewriting the CSV plugin to use more standard (and permissive) parsing.

landreev on 12 Apr 2017

Does #3767 close this?

oscardssmith on 20 Jul 2017

I have two additions on this -- there are a couple of tickets on Excel ingest, this seemed the most appropriate one but please redirect if I missed something

.tab vs. .tsv: I like that Dataverse normalized .xlsx files and I think tab separated values are a good, robust choice, but why did you choose the .tab file extension (which is ambiguous and very rarely used) over the much more common .tsv?
The download dialog for ingested tabular files has an option to download the "original" file. I think that should be the original xlsx file, as uploaded. Currently it's just the .tab file again.

adam3smith on 7 Mar 2018

@adam3smith thanks, these are related:

Switch preservation format extension from .tab to .tsv #2720
Default Download Type should be "original" #4000

pdurbin on 7 Mar 2018

Great. #2720 addresses what I want with 1. but while I agree with Alex in #4000 , my point about Excel is different: the original format is completely lost when uploading Excel. I have no way to get out the Excel file I uploaded after ingest (at least I'm not finding it). That may actually be a bug?

adam3smith on 7 Mar 2018

Hi @adam3smith. The second point sounds like a bug. Could you link to a dataset with an ingested Excel file where we're not able to download the original format?

This quick search returns a few ingested Excel files in Harvard Dataverse (4.8.4), but I'm able to download the original format for them: https://dataverse.harvard.edu/dataverse/harvard?q=fileName%3Aexcel&fq0=fileTypeGroupFacet%3A%22tabulardata%22.

I just uploaded an .xlsx file on Harvard Dataverse that ingested and was able to download the original file format.

screen shot 2018-03-07 at 12 14 42 pm

jggautier on 7 Mar 2018

👍1

Sorry for the false alarm -- this works correctly for me not just in demo.dataverse but also in our dev instance on 4.8. It seems broken the way I describe, though, in our beta environment running 4.6.2. So may have gotten fixed at some point?

adam3smith on 7 Mar 2018

@adam3smith there were a lot of ingest improvements included in our 4.8 release, which could account for improved functionality between demo (4.8.4) and your 4.6.2.

mheppler on 7 Mar 2018

👍1

Since this issue was opened, it has been addressed for the CSV ingest plugin, that was much improved.
I propose we close this issue, but open a new separate issue for the Excel plugin. And try to address it asap (there is a bunch of open Excel issues for various things the current plugin does wrong)

landreev on 31 May 2019

@landreev yeah, I agree that this issue should probably be closed. It's one of our oldest open issues and doesn't fit into our modern "small chunks" style. #3382 ("Excel ingest can't accept more than 26 (Z) columns") is an example of a small chunk and @qqmyers make pull request #5891 for it yesterday. (Thank you!! 🎉 ) Then @landreev opened #5896 about adding tests for Excel. This is perfect. Small chunks move across our (new) board at https://github.com/orgs/IQSS/projects/2 much faster than large projects.

@adam3smith you seem to be fairly engaged in this issue. Are you willing or able to create a "small chunk" issue or two for some of your ideas above?

Everyone else is very welcome to create new issues as well, of course! The idea is that the issues should be relatively small and targeted. Incremental improvement. Small chunks.

pdurbin on 31 May 2019

@pdurbin @landreev I opened this issue originally because the initial implementation was very basic because we did not have a good idea of what a set of features we should support would be, given the wide variety of practices. It was, at least initially, a research spike to figure that out, then become something specific. We lucked out there was a csv implementation so I agree that part is resolved. I think Leonid's suggestion of opening a new Excel-specific implementation sounds reasonable and I'm happy it has specifics since when I opened this there was not an agreed upon set of specifics. So, 1. open new ticket with specifics, then 2. close this ticket ;)

kcondon on 31 May 2019

@pdurbin - I just had two issues here: One was resolved, the other one is fully captured by #2720 , so I'm all good with closing this.

adam3smith on 1 Jun 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Content placed in the "Terms of Access" field via dataset templates does not appear in datasets created with those templates

BPeuch · 4Comments

Improve Notice (Error) when trying to view a Previous version with non-restricted file

shlake · 4Comments

Dataverse "pretty url" and "regular url" give different search results (48 results vs. 75k+ results)

raprasad · 5Comments

Cannot delete empty subdataverse if it once contained a dataset which had a custom role in its Permissions

BPeuch · 3Comments

API Native Endpoint - Error

Fernand0S · 4Comments