Dataverse: replace download option "RData" with RDS

Created on 25 Feb 2020  路  6Comments  路  Source: IQSS/dataverse

In https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/QPHMKX each of the 4 data files provides the option to download the individual file in RData format.
This is nice for R users but in this example the downloaded data get inserted as an object called "x" in R; loading several objects will repeatedly overwrite object x.

As an alternative to RData format I would suggest to use R's RDS format https://stat.ethz.ch/R-manual/R-devel/library/base/html/readRDS.html which

  • saves exactly one object per file
  • when reading back it can be assigned to any name, fully under control by the programmer.
File Upload & Handling Suggestion

All 6 comments

I think an rds is preferable to an RData file, for the reasons that @kuriwaki and I have discussed in the r client repo. Please let us know if you'd like to discuss it more.

@wibeasley thanks for jumping in.

@reikoch thanks for opening this issue. I'm not a very good R developer and I'm ignorant about these formats but my first thought is... are you sure you want to replace the ability to download RData format with RDS format? I'm concerned about scripts that may rely on the older format (I assume it's older) for reproducibility. I would think adding RDS support would be safer, more backward compatible. So we'd offer both formats, I'm saying.

Backward compatibility is probably necessary.
I would still echo @reikoch 's points about how the RData format is currently implemented is error-prone. Because these download options only apply to ingestible data (i.e., a single table, not a bundle or environment) to begin with, there is no real reason to prefer .rda over .rds in this setting.

By the way, for this particular dataverse file, I think downloading it as the original .csv file and reading it in as a csv file is preferable to transforming it in to RData/Rds.

Well generally I think it is bad to use a mechanism as RData format where when loading you cannot determine the target's name. load('xy.RData') can silently overwrite existing objects in the R session whereas myvar <- readRDS('xy.rds') allows me to decide under which name I want the content of xy.rds be brought in. A nice essay about this topic you find in https://yihui.org/en/2017/12/save-vs-saverds/.
If you feel backwards compatibility is needed for a while, what about deprecating RData downloads first? No problem with RData for uploads.

True, R can read pretty much any file format but rds and RData are type safe (dates are noted as such etc), csv is not. In addition with plain csv there is no encoding of the data specified, http://frictionlessdata.io/ might be a way out as data packages store these metadata, xlsx does so too.

As a consumer I love type safe data formats in specified encoding!

rds and RData are type safe (dates are noted as such etc), csv is not.

  • I agree; my point was that for the particular dataset you linked to , the authors uploaded their data in csv originally. So there's no loss of info to download it as csv.

Ok, that means the RData file is derived from the csv file making some assumptions on encoding. Looking at the variable VSORRESU in CSC305ABC_VS it seems that the csv file was encoded in Latin1 which the derivation did not pick up - see unit for temperature measurements.

Maybe just provide original file and a quick analysis of encoding and csv dialect for data uploaded as csv?

Was this page helpful?
0 / 5 - 0 ratings