Author Name: Kevin Condon (@kcondon)
Original Redmine Issue: 4014, https://redmine.hmdc.harvard.edu/issues/4014
Original Date: 2014-05-21
Original Assignee: Eleni Castro
The initial Excel ingest implementation is a proof of concept -it supports rectangular columns and rows of string and numeric types.
More functionality may be required, for example dates, missing value support, variable labels, but it is unclear what is typically expected from researchers who use Excel to store their data.
This ticket is a place holder to investigate a concrete set of requirements for a more robust Excel ingest.
As a side note, there are a number of "How to use Excel for Research Data" guides posted by various universities out there that may answer some questions and provide a starting point.
@posixeleni Any update on this?
@eaquigley Will need to review the documentation mentioned above to see if it can help.
@landreev have you had any luck with this?
Will need to move this to another milestone.
@kcondon @landreev Found some best practices on how we can ingest Excel files https://oit.utk.edu/research/documentation/Documents/HowToUseExcelForDataEntry.pdf but not sure if these would be applied across the board by all researchers and how we would handle them exactly in Dataverse:
@eaquigley Assigning to @scolapasta to see if this is something we can schedule for 4.0 or if it needs to be moved to 4.1.
Probably best to just move this to 4.1, but first checking with @landreev if there are any low hanging fruit here
@posixeleni :
Eleni, I've spoken to Gustavo and Liz about this; we decided that we'll attend to putting together any best practices/guidelines documentation in 4.1.
In the short run though (for 4.0), any chance you know of any real-life Excel files, that any of our partners would actually want to ingest as tabular data?
I'll try and run the ingest thingy on more XLSX files from production; but if you have any extra leads - please let me know.
@landreev unfortunately my leads for getting real-life Excel files from our partners fell through. They never sent me any after several follow-up attempts. Sorry I couldnt be of more assistance. But will send some your way if I ever do get some excel data from in-the-wild.
I mentioned we're looking for XLSX files in the #sciencelab channel at irc.mozilla.org and @kaythaney from @mozillascience tweeted about it! Thanks @kaythaney! She also suggested emailing http://lists.software-carpentry.org/mailman/listinfo/discuss_lists.software-carpentry.org which is an organization @devbioinfoguy is involved in, I believe.
Any specific type of excel files you're looking for? I'm sure I've got (anonymized) UV-Vis data (absorbance vs. wavelength) kicking around if you want them.
Greetings! I have a ton (ok, quite a #) of Excel files I've generated over the past 10 years doing evolution and developmental biology research. Happy to pass a few along.
Interestingly, DataCarpentry (sister group to Software Carpentry (SWC)) is looking to roll out workshops to subject matter areas: biology (genomics) and social sciences first. Structured data/Excel is going to be one of their modules. And its possible that SWC may talk about some of this in their SQL module (http://software-carpentry.org/lessons.html).
Thanks so much @pbulsink and @devbioinfoguy!
We would love any anonymized xslx files but @landreev can let you know if he needs anything specific to test with. Please email us your xlsx files to supportATdataverseDOTorg and if you dont mind we may bug you to confirm with us that these files are being ingested correctly.
@pbulsink
@devbioinfoguy
Philip, Bob,
Yeah, what Eleni said - thanks so much!
As long as they are XLSX files (i.e, the "new", xml-based Excel format - as opposed to the old, proprietary MS binary format)... We are not looking for any specific type of files; as in, we don't know yet what to specifically look for. Support for Excel format is a brand new feature in the Dataverse. Because, historically, Social Sciences have been (used to be) more of an SPSS/Stata/R shop... Getting data out of XLSX _seems_ like a relatively straightforward thing - as long as it follows the basic pattern of a rectangular data matrix - variables (rows) and observations - we should be able to turn it into our data table... But, it's reasonable to expect that once we try the whole thing with files from real researchers, there will be some issues; some bad formatting, something stripped/lost, etc. So any real-life examples of real research data would be of much help. To try to ingest, to try to run our data exploration and analysis tools on, etc.
And it sounds like what you have is as "real life" as it gets.
Thanks again,
-Leonid
People with data in Excel aren't usually thinking about exporting it to another tool and rarely have it formatted in a way that would make it easy or even possible to ingest. As @devbioinfoguy mentions, that's why in Data Carpentry, we have a 'data organization in spreadsheets' lesson (https://github.com/datacarpentry/datacarpentry/tree/master/lessons/excel/ecology-examples). So, it sounds like you're looking for Excel spreadsheets, from domain scientists, that would attempt to conform to the guidelines @posixeleni lays out, rather than just some standard wild Excel data? If you had a blog post on a call for this type of data, or whatever else you need, we'd be happy to post it at datacarpentry.org Thanks for all the work you're doing!
Ticket in RT w/ excel files for @landreev to test out:
Queue: DVN_support
Subject: Anonymized Excel Files
Ticket URL: https://help.hmdc.harvard.edu/Ticket/Display.html?id=192570
I didn't have time to finish writing up an update for this, for beta 12. I'll get it done for 13, and I'll push in a couple of small ingest fixes I've added since then.
Based on comments in this ticket and additional research, @mcrosas will create requirements for this ingest.
Related: #2626
We use .csv files as a 'lowest common denominator' filetype for a project aimed largely at in-country partners. They are well-structured/formatted (no spaces in header rows, no blanks, no mixing of datatypes in the same column, etc.) but still not compatible with TwoRavens. One option is to upload them as Rdata files and let users download the tsv created during ingest, but ideally we would like to use our files as-is. What about an option to add field-specific metadata during ingest that's not picked up from the file?
@tracykteal thanks for linking to https://github.com/datacarpentry/archive-datacarpentry/tree/master/lessons/excel/ecology-examples which seems like good advice for researchers using Excel.
@zward I'm not exactly sure what you mean by "field specific metadata" but in DVN 3.x we used to have the idea of a "control card" file (see http://guides.dataverse.org/en/3.6.2/dataverse-user-main.html#control-card-based-data-ingest ) but from what I understand we got away from this concept in Dataverse 4 because it was too confusing.
An another example of an xlsx file that failed ingest is Harvard_Dataverse_homepage_survey_results_assessment.xlsx from https://dataverse.harvard.edu/file.xhtml?fileId=2969155&version=2.0 so perhaps we can work with @jggautier on figuring how either how to improve ingest to accommodate that file or improve the docs to give users more help in formatting their file so that ingest can succeed.
Just ran into a guide from Cornell about preparing tabular data for archiving: https://data.research.cornell.edu/content/tabular-data. (Looked but couldn't find anything similar at Harvard.) Could be helpful if we decided to improve the docs.
This is a nice documentation piece, but it's written too much from only the
perspective of archiving, for what we need. I agree that the docs should
include this type of documentation on working with files, and files to
upload, but we should write from the point of view of a researcher or data
author - a researcher working with quantitative data usually will already
work with R, stata or similar data package. If he/she doesn't use any of
those, recommending what you should do when creating the CSV file can be
helpful.
Let's discuss this - it would be a very useful exercise to write this as
part of the docs.
Merce
Mercè Crosas, Ph.D., Chief Data Science and Technology Officer, IQSS, Harvard
University
@mercecrosas mercecrosas.com
On Tue, Feb 7, 2017 at 4:37 PM, Julian Gautier notifications@github.com
wrote:
Just ran into a guide from Cornell about preparing tabular data for
archiving: https://data.research.cornell.edu/content/tabular-data
https://urldefense.proofpoint.com/v2/url?u=https-3A__data.research.cornell.edu_content_tabular-2Ddata&d=CwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=n9HCCtgqDPssu5vpqjbO3q4h2g6vMeTOp0Ez7NsdVFM&m=nPWg2rii7w3D3xfGqiBflbCOe1evplsYGXY2WfYAB9Y&s=lYQrq__64UAge9bPijJP4y7ulEy53ZHYYMIzvA4czsE&e=.
(Looked but couldn't find anything similar at Harvard.) Could be helpful if
we decided to improve the docs.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_IQSS_dataverse_issues_585-23issuecomment-2D278148359&d=CwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=n9HCCtgqDPssu5vpqjbO3q4h2g6vMeTOp0Ez7NsdVFM&m=nPWg2rii7w3D3xfGqiBflbCOe1evplsYGXY2WfYAB9Y&s=c3uC5537VLE_ADyX-vNg_qZHlkq8PWelNatDqnN8yM8&e=,
or mute the thread
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AApQyH5khQ9-5FmH9e00Xcf6PG7JWtWeSPks5raOQbgaJpZM4CLnV5&d=CwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=n9HCCtgqDPssu5vpqjbO3q4h2g6vMeTOp0Ez7NsdVFM&m=nPWg2rii7w3D3xfGqiBflbCOe1evplsYGXY2WfYAB9Y&s=9sofGdEA35Scnz6bmI4gX6MyXiHoTAqwsiqk7s8jZHY&e=
.
In a Slack conversation with @thegaryking about better support for CSV ingest, @landreev wrote:
the current implementation does not allow new lines embedded in text fields. Because we expect every line to contain the same number of comma-separated fields. In that particular file [that failed ingest] there are opening and trailing quotes, before and after the chunk of text with new lines... So yes, it can still be parsed unambiguously.
Should be a straightforward fix. We may have just assumed that new lines would not be meaningful to anybody in tabular data. So stripping them and/or replacing them with spaces would make an ingestible CSV file.
Have you considered delegating CSV support to external libraries like Apache Commons CSV? I see it is listed as a dependency for Dataverse, but the ingest apparently still depends on @landreev's own code.
@bencomp I'm not sure why we invented our own way of parsing CSV files within the Dataverse code base. I agree with you that if we can leverage an open source package that already does this job, it would probably be better. Maybe there's some Dataverse-specific stuff in our CSV parsing. I don't know. @landreev would.
Yesterday @raprasad was showing us how Python's Pandas can parse Excel files: https://github.com/IQSS/miniverse/blob/729a8d1cb0bededbcc9882e51ccf5b874b1a178e/dv_apps/datafiles/tabular_previewer.py#L107
There's also a suggestion to have ingest run on a separate service rather than as part of the Dataverse monolith: #2331.
There was some historical reason why we insisted on one line per one observation requirement. But it doesn't appear to be relevant anymore.
Also, when that original implementation was put together, we did not anticipate that anyone would use csv ingest for real, full text - with rich punctuation, extra tabs, commas and newlines. Since this appears to be a requirement now, yes, switching to an existing third party parser that takes care of all these things seems like the right thing to do.
I'm opening a new issue for some concrete incremental improvements of the CSV ingest, to be estimated and included into some sprint in the short term. (This one is just too generic)
Does #3767 close this?
I have two additions on this -- there are a couple of tickets on Excel ingest, this seemed the most appropriate one but please redirect if I missed something
@adam3smith thanks, these are related:
Great. #2720 addresses what I want with 1. but while I agree with Alex in #4000 , my point about Excel is different: the original format is completely lost when uploading Excel. I have no way to get out the Excel file I uploaded after ingest (at least I'm not finding it). That may actually be a bug?
Hi @adam3smith. The second point sounds like a bug. Could you link to a dataset with an ingested Excel file where we're not able to download the original format?
This quick search returns a few ingested Excel files in Harvard Dataverse (4.8.4), but I'm able to download the original format for them: https://dataverse.harvard.edu/dataverse/harvard?q=fileName%3Aexcel&fq0=fileTypeGroupFacet%3A%22tabulardata%22.
I just uploaded an .xlsx file on Harvard Dataverse that ingested and was able to download the original file format.

Sorry for the false alarm -- this works correctly for me not just in demo.dataverse but also in our dev instance on 4.8. It seems broken the way I describe, though, in our beta environment running 4.6.2. So may have gotten fixed at some point?
@adam3smith there were a lot of ingest improvements included in our 4.8 release, which could account for improved functionality between demo (4.8.4) and your 4.6.2.
Since this issue was opened, it has been addressed for the CSV ingest plugin, that was much improved.
I propose we close this issue, but open a new separate issue for the Excel plugin. And try to address it asap (there is a bunch of open Excel issues for various things the current plugin does wrong)
@landreev yeah, I agree that this issue should probably be closed. It's one of our oldest open issues and doesn't fit into our modern "small chunks" style. #3382 ("Excel ingest can't accept more than 26 (Z) columns") is an example of a small chunk and @qqmyers make pull request #5891 for it yesterday. (Thank you!! 🎉 ) Then @landreev opened #5896 about adding tests for Excel. This is perfect. Small chunks move across our (new) board at https://github.com/orgs/IQSS/projects/2 much faster than large projects.
@adam3smith you seem to be fairly engaged in this issue. Are you willing or able to create a "small chunk" issue or two for some of your ideas above?
Everyone else is very welcome to create new issues as well, of course! The idea is that the issues should be relatively small and targeted. Incremental improvement. Small chunks.
@pdurbin @landreev I opened this issue originally because the initial implementation was very basic because we did not have a good idea of what a set of features we should support would be, given the wide variety of practices. It was, at least initially, a research spike to figure that out, then become something specific. We lucked out there was a csv implementation so I agree that part is resolved. I think Leonid's suggestion of opening a new Excel-specific implementation sounds reasonable and I'm happy it has specifics since when I opened this there was not an agreed upon set of specifics. So, 1. open new ticket with specifics, then 2. close this ticket ;)
@pdurbin - I just had two issues here: One was resolved, the other one is fully captured by #2720 , so I'm all good with closing this.