dataverse 🚀 - Binderverse: integrating Binderhub with Dataverse (using docker+kubernetes)

@aculich For prototyping, conf/docker-aio might be helpful (it's designed for integration tests, but can also be deployed in a standard (non-production) way (which is mostly how it's used in conf/docker-dcm). It's not as rapid as I'd like yet, but https://github.com/pameyer/dataverse/blob/exp-docker_it_iterating/conf/docker-aio/prep_it.bash is intended be another incremental step towards that.

pameyer on 25 May 2018

@aculich thanks for opening this issue! @wrathofquan just gave me access to the "Spring 2018 CloudWG: Dataverse + Binderhub" doc at https://docs.google.com/document/d/1ZR9AhhAAqCRmXjFkqAm7UAJlxEUe4ooyOgWOILwOIXw/edit?usp=sharing and I'm having a look around.

When you created this issue I shared it at https://groups.google.com/d/msg/dataverse-community/VG6gTMEd_Ps/9walKsGQCQAJ so people could see the mockup of BinderHub operating on a DOI, which is really cool.

This is great, exciting stuff. How can the Dataverse team help? You mentioned #4665 and there's been some discussion there. (I mentioned you at https://github.com/IQSS/dataverse/issues/4665#issuecomment-392296715 .) Anything else you need? We could create separate smaller issues about specific tasks such as letting external tools operate at the dataset level. Thanks!

pdurbin on 31 May 2018

@aculich hi! Any news? I just checked the "Spring 2018 CloudWG: Dataverse + Binderhub" doc above and it looks like there haven't been any changes since my last comment. No pressure, I'm just checking in. At the Dataverse Community Meeting the other week people were definitely still interested in Jupyter notebook integration!

pdurbin on 27 Jun 2018

@pdurbin some movement is happening in this form at the moment: https://github.com/whole-tale/whole-tale/issues/35

Please feel free to join in the discussions there, as the work around federation could use movement on several reference implementations to help binderhub evolve in ways that suit multiple repositories, including Dataverse in the mix. See also these two:

aculich on 15 Aug 2018

🎉1

@aculich thanks for linking to https://github.com/whole-tale/whole-tale/issues/35 and I'm glad to see @craig-willis created that issue but I don't think I have enough context to contribute to the discussion. That issue links to https://docs.google.com/document/d/16kL6TPMqNgpiZ-H9LLk1bqm3zc3t_lhKRUpGeK-yq8I/edit?usp=sharing and I'm glad to see you added "and Dataverse" as a suggested edit to indicate that the Binder team has expressed interest in integrating with Dataverse. Are you blocked on anything? Do you need anything from the Dataverse team?

pdurbin on 15 Aug 2018

@pdurbin we are not blocked and there is not a specific request (yet) from the Dataverse team. Just wanted to reply to you on this issue thread to give you a sense of what's going on. We'll keep you posted and let you know if we have a specific request.

aculich on 15 Aug 2018

🎉1

Thanks @aculich for linking -- I wasn't aware of the progress here. @pdurbin I've added a couple of sentences to the document to provide a little more context (maybe). The short is we are trying better align WT with Binder, particularly concerning how to publish and execute these composite research objects (code + data + environment) from remote repositories. WT has been working mainly with DataONE, and it's great to see the effort here.

I don't have access yet to the "Spring 2018 CloudWG: Dataverse + Binderhub" document, but I'm curious how Binder repos are published to Dataverse. Do I download a zip of by Github repo and upload to Dataverse? Or is there a better way?

craig-willis on 16 Aug 2018

@craig-willis I update the permissions on that document so you should be able to read it _and_ also comment on it, so feel free to comment in the gdoc itself if you have questions or thoughts (and assign the comment to me so I see it).

Also mentioning @wrathofquan, our resident Data Lab Librarian at UC Berkeley, to loop him into it since he is the one who spear-headed these efforts along with the two students who did all of the hands-on work.

Our focus was primarily on the UX integration, rather than the under-the-hood plumbing for data itself. With a semester-long project where we had only 2hrs/week of the student's time to work on it, we wanted to keep the scope achievable for the time available. The students did a great job helping us advance our thinking in addition to sorting out some concrete implementation details.

aculich on 16 Aug 2018

I'm curious how Binder repos are published to Dataverse.

@craig-willis Maybe I'm confused about what a Binder repo even is but my understanding is that @aculich et al. want to add a button called something like "Launch Binder" to the dataset pages in Dataverse. Then the DOI of the dataset is sent to Binder automatically rather than having to enter the DOI manually like in the screenshot above. Below I'm cropping just the part where you enter the DOI in Binder:

screen shot 2018-08-15 at 9 17 36 pm

I don't know what happens to the Binder repo after it's done executing. I've been thinking of Binder as an exploration tool for data that has already been deposited and published in Dataverse. Currently, the "Explore" button in Dataverse can be a dropdown that launches either Data Explorer or TwoRavens, and Binder would be a third "external tool": http://guides.dataverse.org/en/4.9.2/installation/external-tools.html . For more discussion on Binder some day being an external tool for Dataverse, please see https://groups.google.com/d/msg/dataverse-community/VG6gTMEd_Ps/Xy7jDhVoBwAJ

Do I download a zip of by Github repo and upload to Dataverse?

This is somewhat beside the point but Dataverse is planning to allow code deposit from GitHub (similar to Zenodo) as part of #2739. Again, I hope it doesn't matter how the data got into Dataverse. It would be great if the "Launch Binder" button could be enabled for datasets that were deposited and published long ago.

@aculich thanks for tweaking the permissions on that Google doc. I thought it was a good read!

pdurbin on 16 Aug 2018

Thanks @aculich, makes perfect sense.

@pdurbin I should probably defer to @aculich on this one, but here's my outsider perspective.

If you haven't played with BinderHub, it's quite cool and worth a look. Given a well-structured Git repository, they build and run a Docker image based on supported interactive environments (e.g., Jupyter, Rstudio, etc).

They have many examples, including:

https://github.com/minrk/ligo-binder

Just click "Launch in Binder" and you should be taken to an interactive Jupyter session via Binderhub.

The repos typically contain code/notebooks + environment (e.g., installed packages) + sometimes data to facilitate reuse and reproducibility. From my understanding, the original focus was on Git repos but there are now several initiatives to integrate with research repositories. A basic use case would be a researcher with an existing Binder-formatted Git repo with code/notebook + environment + data and wants to publish it to Dataverse. Once published a reader can access the Dataverse entry (via DOI) and "Launch in Binder" or plug the DOI into the above search box to have access to a full interactive environment to reproduce some aspect of the work. In this case, it's important to know how the "binder" repository is published to Dataverse (and can be resolved/executed by BinderHub). I expect #2739 is quite useful in this scenario.

No doubt there will be ways to launch some baseline Binder environments on historical entries as well, but publishing these sorts of composite research objects (code + data + environment) intended for re-execution in services like Binder is of interest to me for the WT project right now.

craig-willis on 16 Aug 2018

👍1

@pdurbin In some sense, Binder is a third "external tool" for Dataverse, but a fundamental difference in the way it works from Data Explorer and TwoRavens is that it both gives you a tool for interacting with the data, as well as a full compute environment separate from your laptop— whereas the other two apps are intended to be run in-browser or as an application on a laptop that talks with the Dataverse data repository.

The important thing about Binder is that it enables what we call _Mobility of Compute_ and... we hope in the future... _Mobility of Data_.

What we'd like to see is data repository become aware of the _Binder Pattern_ and provide a way of letting users curate their data and software in a way when they deposit it that the repository can then allow an end user of the repository to browse any data and launch a compute environment wherever that is suitable to do so— so, for example, if the Harvard Dataverse were to partner with Harvard's Research Computing group to provide compute resources, then allow people to run directly their compute environments directly on Harvard's infrastucture— _and_ also provide the ability to use a third party infrastructure to run the same environment. If that option were chosen, then it would launch the compute environment at the other third-party location... and then all the data from that collection would be copied to the remote compute environment (e.g. via Globus).

So that would allow people to work with much larger datasets than is possible on their laptops.

And then another step, which @craig-willis points to in the example above, is that in addition to being able to "launch binder" directly from a data repo like Dataverse itself, we'd also like to be able to drop a DOI into any compute environment that supports the _Binder Pattern_ and it would figure out where the data lives and let the user choose where to run the compute environment— whether that's on the data repo itself if that capability is provided there, or to transfer the data to a third party hosting the compute environment.

As far as historical entries— yes, it would be possible to launch a Binder for existing data/code.... however when the code runs, it may not run correctly if the software requirements are not captured.... however, for some environments that use a common set of software packages it could be solved trivially, and for more complex situations we could work something out on a case-by-case basis (especially if the person who wants the data is willing to do some of the sweat & tears labor to get it working and contribute an "software requirements" update back to the repository owner).

aculich on 20 Aug 2018

👍2

@aculich thanks! I'm chatting with @craig-willis at https://gitter.im/jupyterhub/binder?at=5b7ae5fccda86f5fb28091fb if you're interested.

I jumped in there to ask if "Binder Pattern" was a thing and it sounds like the idea is to follow https://mybinder.readthedocs.io/en/latest/using.html#preparing-a-repository-for-binder (which @craig-willis linked to already above as a "well-structured Git repository").

@craig-willis and I will talk more at https://wholetale.org/2018/06/26/working-group-workshop.html I'm sure. If others reading this will be there (September 13-14 in Chicago), please let me know!

pdurbin on 20 Aug 2018

@aculich Your comments about having Binder figure out where data files are currently available makes me thing that the locality APIs will need to be user facing (at least the read portions). Do you have pointers to what Binder expects those APIs to look like, and what protocols it supports for storage and transfer?

pameyer on 20 Aug 2018

@pdurbin @craig-willis @choldgraf there is (as far as I am aware) no official "Binder Pattern", however what I meant by using that term generically is two things: 1) that people (researchers in particular) with code & data should explicitly declare their dependencies according to the Using Binder docs: https://mybinder.readthedocs.io/en/latest/using.html#preparing-a-repository-for-binder
and that 2) infrastructure providers (e.g. data repositories and research computing providers) should implement BinderHub https://binderhub.readthedocs.io/en/latest/
and there are still a few key missing pieces such as a federation interface and an extended set of instructors for how to prepare a binder that includes data as a component.

I will continue the conversation with ya'll over on Gitter, but wanted to leave a note here that "Binder Pattern" is just a term that I made up and isn't (yet?) in official usage.

aculich on 22 Aug 2018

@pameyer at the moment there is _no expectation_ from Binder, in fact, it is entirely ignorant of it! So here we have a chance to define what that ought to be to make Binder aware of the universe of DOIs and related repositories, as well as how to make repositories aware of Binder.

aculich on 22 Aug 2018

@aculich I would love to setup a meeting to talk more about this. we have a new funded Sloan project to develop a journal replication environment and we should coordinate the interaction with Dataverse. This is something the new Global Dataverse Community Consortium would be interested in too

jonc1438 on 22 Aug 2018

👍1

@aculich in case you aren't aware @jonc1438 is a great person to know because he's been running Dataverse for almost as long as Harvard at UNC ( https://dataverse.unc.edu ), has been a strong advocate for Dataverse for many years, is the primary contact for the Global Dataverse Community Consortium ( https://dataverse.org/blog/global-dataverse-community-consortium-announcement ), and is an awesome guy.

@jonc1438 I'd love to hear more about your new project!

pdurbin on 23 Aug 2018

@pdurbin indeed, I do know him (We originally met at the 2017 Binder workshop at UC Davis last October) and I absolutely agree that @jonc1438 is an awesome guy... in fact, as you were writing here on github your ears must have been burning, since he and I were on a Zoom call saying the same thing about you! :) Among other things we were discussing Confirmable Reproducible Research (CoRe2) Environment: Linking Tools to Promote Computational Reproducibility (congrats, @jonc1438 !) and how we may continue to align with efforts in the Jupyter/Binderhub ecosystem, with @craig-willis and WholeTale, etc. Looking forward to coordinating with you all to help these tools co-evolve.

aculich on 23 Aug 2018

I see that @jonc1438 already weighted in on https://github.com/whole-tale/whole-tale/issues/43 but heads up to @aculich and others interested in reproducible research publishing and interoperability that @whole-tale is hosting a meeting on Wednesday October 24th, organized by @craig-willis (thanks!). Details are in that issue.

pdurbin on 4 Oct 2018

👍1

We had a second meeting today (I'm not sure what to call the group but see https://github.com/whole-tale/whole-tale/issues/50 ... from my perspective these are "computation" meetings but "reproducibility infrastructure" was suggested) and @craig-willis demo'ed https://github.com/craig-willis/dataverse-binder where if you click a link you're taken to https://mybinder.org/v2/gh/craig-willis/dataverse-binder/master?urlpath=%2Fnotebooks%2Fdataverse.ipynb%3FfileId%3D2865473%26siteUrl%3Dhttps%3A%2F%2Fdataverse.harvard.edu to spin up a Jupyter notebook based on data in Harvard Dataverse. Neat!

pdurbin on 28 Nov 2018

Thanks, @pdurbin. We can also use this toy example to highlight two different general use cases reflected in the discussion in this issue.

The first is the ability to do ad-hoc/exploratory analysis on any data stored in Dataverse. This external tool/Binder example or the similar external tool/WT example gets at that -- they provide a generic Jupyter/Rstudio environment that could be used to do ad-hoc analysis on any dataset stored in Dataverse.

The second is the ability to re-run a specific analysis published in Dataverse, such as one of the AJPS exemplars listed in https://github.com/whole-tale/whole-tale/issues/49. From my understanding, one vision of the CoRe2 implementation is to put a Dockerfile in the replication study dataset at the end of the verification process. In this case, something like Binder/WT or related systems could build and run the specific environment created by the researcher to reproduce the study.

The initial WT dataverse integration (https://github.com/whole-tale/girder_wholetale/pull/180) addresses the first case. Our next step will be to support the second. If the dataset contains a Dockerfile or other Binder/repo2docker-compatible configuration, we'll not only mount the dataset, but also build the environment to allow you to re-run the analysis. Ideally, the work done in both WT and CoRe2 will contribute to this broader "Binderverse" vision.

p.s. -- this also relates to your question to @davclark in the meeting notes about how Gigantum integrates with Dataverse. If I'm understanding, they publish a zipfile of a Github repo. This could easily be done with Binder repos or WT tales as well, but doesn't seem like the ideal approach from the repository perspective. Curious do you hear your thoughts about the approach of publishing a zip that contains code/data/environment description.

craig-willis on 30 Nov 2018

From my understanding, one vision of the CoRe2 implementation is to put a Dockerfile in the replication study dataset at the end of the verification process.

@craig-willis I don't know enough about CoRe2 to comment on what technologies will be used but I'd be remiss if I didn't point out that Code Ocean is highly oriented toward producing a Dockerfile for datasets via their graphical interface. As I mentioned in my brain dump at https://github.com/IQSS/dataverse/issues/5028#issuecomment-433185375 I was able to try this myself during a Code Ocean workshop and I was very pleased that at then end, I was able to download a Dockerfile from Code Ocean rather than an image. That is to say, I'm much more interested in the ability to reproduce the image from a Dockerfile than the image itself. It's smaller and you get the recipe. I guess this relates to your p.s. in the sense that I think small is beautiful. It's seems a bit wasteful of disk space for data repositories to store Ubuntu over and over in the form of Docker images. I don't know. I'll defer to the Dataverse community on what they want. The software is for them, not me. 😄

pdurbin on 30 Nov 2018

@pdurbin Good point of clarification -- I wasn't suggesting that the image was or would be stored in Dataverse. In the CodeOcean case, I believe they are running their own Docker registry which is used to preserve the images today. For WT, DataONE has asked that we not push images for now, only the description of the environment via Dockerfile, etc.

I just learned something new about Dataverse -- I didn't realize that Zip archives were automatically expanded during submission. I thought that it would be possible to upload a Zipfile containing, for example, the contents of https://github.com/craig-willis/dataverse-binder. But I now see that I was wrong -- https://dev1.dataverse.org/dataset.xhtml?persistentId=doi:10.5072/FK2/QR840C&version=DRAFT and in a good way.

craig-willis on 30 Nov 2018

@craig-willis yeah, the work around is to "double zip" to prevent the expansion of files (and flattening into a single directory) as mentioned in #3439. A lot of people are surprised by this behavior. So please feel free to double zip. There's also a concept of a "package file", which you can read about at http://guides.dataverse.org/en/4.9.4/user/find-use-data.html#downloading-a-dataverse-package-via-rsync

pdurbin on 30 Nov 2018

👍1

Phil

Yes you are correct about CodeOcean and we have been in conversations with them

We hope Core2 can provide a very similar yet open sourced option

I do not imagine we will make our Binder implementation as slick or nice at CodeOceans given the resource differences

I would like to imagine a world where several outside tools could be used for verification of datasets in Dataverse

What we need is a smooth path out of Dataverse and the ability of Dataverse to recognize the Dockerfile that is put back for preservation

I talked to CodeOcean and they want to coordinate a plan

One thing I also need to build withe Core2 is a communications workflow for authors editors and verifiers

We are thinking about automation some GitLab requests but I am really not sure what to use yet

Sorry I have been traveling in Brazil Africa and now China and it has kept me behind in this discussion

Does this make sense

Jon

Get Outlook for iOShttps://aka.ms/o0ukef

From: Philip Durbin notifications@github.com
Sent: Saturday, December 1, 2018 12:05 AM
To: IQSS/dataverse
Cc: Crabtree, Jonathan David; Mention
Subject: Re: [IQSS/dataverse] Binderverse: integrating Binderhub with Dataverse (using docker+kubernetes) (#4714)

From my understanding, one vision of the CoRe2 implementation is to put a Dockerfile in the replication study dataset at the end of the verification process.

@craig-willishttps://github.com/craig-willis I don't know enough about CoRe2 to comment on what technologies will be used but I'd be remiss if I didn't point out that Code Ocean is highly oriented toward producing a Dockerfile for datasets via their graphical interface. As I mentioned in my brain dump at #5028 (comment)https://github.com/IQSS/dataverse/issues/5028#issuecomment-433185375 I was able to try this myself during a Code Ocean workshop and I was very pleased that at then end, I was able to download a Dockerfile from Code Ocean rather than an image. That is to say, I'm much more interested in the ability to reproduce the image from a Dockerfile than the image itself. It's smaller and you get the recipe. I guess this relates to your p.s. in the sense that I think small is beautiful. It's seems a bit wasteful of disk space for data repositories to store Ubuntu over and over in the form of Docker images. I don't know. I'll defer to the Dataverse community on what they want. The software is for them, not me. 😄

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com/IQSS/dataverse/issues/4714#issuecomment-443251372, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHYP4peKVDH9Tp3xt12QtdDyc0O7QeELks5u0VdOgaJpZM4UNOB0.

jonc1438 on 30 Nov 2018

👍1

Heh, yes, it does make sense @jonc1438 . Thanks! We've been following your travel via tweets like https://twitter.com/Odum_Institute/status/1068161451670626309 😄

You might want to check in with @tlchristian and @donsizemore who were on the call on Wednesday ( https://github.com/whole-tale/whole-tale/issues/50 ). I attempted to give a demo of clicking a "Whole Tale" button (right under "Data Explorer") and being sent to an installation of Whole Tale with Jupyter notebooks and RStudio and other tools. I wasn't logged to Whole Tale at the time so it wasn't the best demo but I just tried again from https://dev1.dataverse.org/dataset.xhtml?persistentId=doi:10.5072/FK2/EPNUKP&version=1.0 and it worked better this time. Here are some screenshots. Great stuff! Thank you @craig-willis and @Xarthisius! These screenshots probably belong in #5097 too.

screen shot 2018-11-30 at 3 41 55 pm
screen shot 2018-11-30 at 3 42 06 pm

screen shot 2018-11-30 at 3 43 01 pm

pdurbin on 30 Nov 2018

Yes cool work

Craig and I talked last week and I saw it

We for sure need to get everyone on the same path and decide how to get things out of Dataverse and what the UI on Dataverse needs to look like

If we get WT. Core2 and CodeOcean on same path it would be great

We also need to include your new Encapsulator work along with Dockerfiles as recognized file types in Dataverse

Jon

Get Outlook for iOShttps://aka.ms/o0ukef

From: Philip Durbin notifications@github.com
Sent: Saturday, December 1, 2018 4:46 AM
To: IQSS/dataverse
Cc: Crabtree, Jonathan David; Mention
Subject: Re: [IQSS/dataverse] Binderverse: integrating Binderhub with Dataverse (using docker+kubernetes) (#4714)

Heh, yes, it does make sense @jonc1438https://github.com/jonc1438 . Thanks! We've been following your travel via tweets like https://twitter.com/Odum_Institute/status/1068161451670626309 😄

You might want to check in with @tlchristianhttps://github.com/tlchristian and @donsizemorehttps://github.com/donsizemore who were on the call on Wednesday ( whole-tale/whole-tale#50https://github.com/whole-tale/whole-tale/issues/50 ). I attempted to give a demo of clicking a "Whole Tale" button (right under "Data Explorer") and being sent to an installation of Whole Tale with Jupyter notebooks and RStudio and other tools. I wasn't logged to Whole Tale at the time so it wasn't the best demo but I just tried again from https://dev1.dataverse.org/dataset.xhtml?persistentId=doi:10.5072/FK2/EPNUKP&version=1.0 and it worked better this time. Here are some screenshots. Great stuff! Thank you @craig-willishttps://github.com/craig-willis and @Xarthisiushttps://github.com/Xarthisius! These screenshots probably belong in #5097https://github.com/IQSS/dataverse/issues/5097 too.

[screen shot 2018-11-30 at 3 41 55 pm]https://user-images.githubusercontent.com/21006/49313911-c28c4d00-f4b6-11e8-9673-197cb8738c7d.png
[screen shot 2018-11-30 at 3 42 06 pm]https://user-images.githubusercontent.com/21006/49313913-c28c4d00-f4b6-11e8-9b91-10f8e30313a6.png
[screen shot 2018-11-30 at 3 42 14 pm]https://user-images.githubusercontent.com/21006/49313914-c28c4d00-f4b6-11e8-9dde-cecc4174de2f.png
[screen shot 2018-11-30 at 3 42 38 pm]https://user-images.githubusercontent.com/21006/49313915-c28c4d00-f4b6-11e8-9995-e15b9059741e.png
[screen shot 2018-11-30 at 3 43 01 pm]https://user-images.githubusercontent.com/21006/49313916-c28c4d00-f4b6-11e8-9698-5fe2935d89eb.png
[screen shot 2018-11-30 at 3 43 10 pm]https://user-images.githubusercontent.com/21006/49313917-c28c4d00-f4b6-11e8-8c12-6cf15ea3d8cc.png

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com/IQSS/dataverse/issues/4714#issuecomment-443334208, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHYP4uAFPbqR0lRXZLA4RTWgnmtLPWL4ks5u0ZkNgaJpZM4UNOB0.

jonc1438 on 30 Nov 2018

how to get things out of Dataverse

Well, here's the code @Xarthisius just wrote for Whole Tale in case it helps anyone else (this is the Binder issue, after all) 😄 https://github.com/whole-tale/girder_wholetale/pull/175 . He had a question or two in IRC but mostly he just looked at the Dataverse API Guide.

I'm still a little fuzzy on what Encapsulator is but I know there's code at https://github.com/ProvTools/encapsulator and http://provtools.org/analyzes/ says, "Create capsules of analyses that contain the entire environment needed to reproduce a computational process". Probably a discussion for a different issue. 😄 Or a future call. I don't know.

pdurbin on 30 Nov 2018

Phil

I will take a look when I get back but I was thinking more on conceptual level not programming

Things like how do users choose which code to use in the dataset and with what data and for what manuscripts do they match

If the docker file exists it is easy but often datasets have many files and no way to automatically know what goes in container

I think we need to understand the use case for all the approaches so we can plan the path forward

Cool code is great and needs to happen but we want to provide adoptable solutions that all groups can use

That planning before we all sink programming time into individual solutions I think is needed

Might also allow us to leverage others code better

Jon

Get Outlook for iOShttps://aka.ms/o0ukef

From: Philip Durbin notifications@github.com
Sent: Saturday, December 1, 2018 5:05 AM
To: IQSS/dataverse
Cc: Crabtree, Jonathan David; Mention
Subject: Re: [IQSS/dataverse] Binderverse: integrating Binderhub with Dataverse (using docker+kubernetes) (#4714)

how to get things out of Dataverse

Well, here's the code @Xarthisiushttps://github.com/Xarthisius just wrote for Whole Tale in case it helps anyone else (this is the Binder issue, after all) 😄 whole-tale/girder_wholetale#175https://github.com/whole-tale/girder_wholetale/pull/175 . He had a question or two in IRC but mostly he just looked at the Dataverse API Guide.

I'm still a little fuzzy on what Encapsulator is but I know there's code at https://github.com/ProvTools/encapsulator and http://provtools.org/analyzes/ says, "Create capsules of analyses that contain the entire environment needed to reproduce a computational process". Probably a discussion for a different issue. 😄 Or a future call. I don't know.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com/IQSS/dataverse/issues/4714#issuecomment-443339300, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHYP4sKr9k_CbtiBs42phJWH-QnOBQOxks5u0Z2xgaJpZM4UNOB0.

jonc1438 on 30 Nov 2018

how do users choose which code to use in the dataset

Yes. I hear you. The Code Ocean perspective on this seems to be "Create a master script to execute your code." In the workshop I attended, the script was called run.sh. Here's the concept from the slides ( http://bit.ly/harvard-oa-week ):

screen shot 2018-11-30 at 4 20 49 pm

Here's how "run.sh" looked in a Code Ocean capsule (see https://bit.ly/2NanJLc ):

screen shot 2018-11-30 at 4 23 22 pm

If you're logged in to Code Ocean you can click "Run" and get the output, the figures created from the code and the data, like this:

screen shot 2018-11-30 at 4 26 55 pm

Anyway, I hope this helps. 😄

pdurbin on 30 Nov 2018

One thing that I haven't seen (or don't remember seeing) is a standardized way of indicating inputs, processing code, outputs, and evaluation/comparison code. I don't know if this is something that fits into stock dockerfiles (or other containerization specifications).

It would seem to make sense to me if all the groups working in this area to agree on a convention for this - although given that I'm not directly involved, it's very possible that a convention or standard already exists.

pameyer on 30 Nov 2018

🎉1

I have not seen it either and that is what I think we need for Dataverse

Jon

Get Outlook for iOShttps://aka.ms/o0ukef

From: pameyer notifications@github.com
Sent: Saturday, December 1, 2018 5:28 AM
To: IQSS/dataverse
Cc: Crabtree, Jonathan David; Mention
Subject: Re: [IQSS/dataverse] Binderverse: integrating Binderhub with Dataverse (using docker+kubernetes) (#4714)

One thing that I haven't seen (or don't remember seeing) is a standardized way of indicating inputs, processing code, outputs, and evaluation/comparison code. I don't know if this is something that fits into stock dockerfiles (or other containerization specifications).

It would seem to make sense to me if all the groups working in this area to agree on a convention for this - although given that I'm not directly involved, it's very possible that a convention or standard already exists.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com/IQSS/dataverse/issues/4714#issuecomment-443345176, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHYP4pvyFqYmj1burzDCktDIKA-xjTXwks5u0aMXgaJpZM4UNOB0.

jonc1438 on 30 Nov 2018

Phil

Yep I have seen that too

How many of these scripts have you seen in Dataverse?

How many researchers know how to write them?

If we each have our own “script language” it will complicate

I was imagining a button in Dataverse Dataset Page that said something like

Run replication environment or open capsule or open computation environment etc

When clicked it allowed user to select which code to use which data to use it with etc

The user could also select the environment to use
Core2 wt CodeOcean etc

The output could be a script to run or push to the environment

At end of process the environment saves the dockerfile/script/capsule/etc. back to original Dataverse Dataset

On second click of button User allowed to choose previous docketfile etc

The code for these might be easy

Agreeing on workflow together is needed though

I imagine most of this outside Dataverse too
Use API just like WT is doing

These are just rough thoughts on my iPhone while I can not sleep due to time difference in Shanghai

More formal planning is needed as group I think

Jon

Get Outlook for iOShttps://aka.ms/o0ukef

From: Philip Durbin notifications@github.com
Sent: Saturday, December 1, 2018 5:28 AM
To: IQSS/dataverse
Cc: Crabtree, Jonathan David; Mention
Subject: Re: [IQSS/dataverse] Binderverse: integrating Binderhub with Dataverse (using docker+kubernetes) (#4714)

how do users choose which code to use in the dataset

Yes. I hear you. The Code Ocean perspective on this seems to be "Create a master script to execute your code." In the workshop I attended, the script was called run.sh. Here's the concept from the slides ( http://bit.ly/harvard-oa-week ):

[screen shot 2018-11-30 at 4 20 49 pm]https://user-images.githubusercontent.com/21006/49315648-704e2a80-f4bc-11e8-9080-aca8fcb8e84c.png

Here's how "run.sh" looked in a Code Ocean capsule (see https://bit.ly/2NanJLc ):

[screen shot 2018-11-30 at 4 23 22 pm]https://user-images.githubusercontent.com/21006/49315649-70e6c100-f4bc-11e8-9c04-9034186e1571.png

If you're logged in to Code Ocean you can click "Run" and get the output, the figures created from the code and the data, like this:

[screen shot 2018-11-30 at 4 26 55 pm]https://user-images.githubusercontent.com/21006/49315761-daff6600-f4bc-11e8-8084-b4af152d742e.png

Anyway, I hope this helps. 😄

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com/IQSS/dataverse/issues/4714#issuecomment-443344987, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHYP4hM34BsaOEswnH19O4nW-hh2HbKlks5u0aLtgaJpZM4UNOB0.

jonc1438 on 30 Nov 2018

👍1

p.s. -- this also relates to your question to @davclark in the meeting notes about how Gigantum integrates with Dataverse. If I'm understanding, they publish a zipfile of a Github repo. This could easily be done with Binder repos or WT tales as well, but doesn't seem like the ideal approach from the repository perspective. Curious do you hear your thoughts about the approach of publishing a zip that contains code/data/environment description.

A little late to this very active party... but I'll point out that the git repo is fairly intrinsic to the Gigantum project - in this sense, it's difficult to package something safely for a repository. Zip files make sense to me, as it's a highly standardized format that preserves structure and is easy enough to get a single file out of.

Is there any easy way to put a git repository on dataverse such that you'd trust it'd come back to you well-formed?

My suspicion is that it'd be easier to simply allow dataverse to peer into zip archives. But honestly, I don't know and would love to hear what folks think about storing things like git repos on dataverse.

davclark on 4 Dec 2018

@davclark if you'd love to hear what people think about git repos in Dataverse, you're in luck because we're discussing it tomorrow (Tuesday) during our community call if you'd like to join us! Please see https://dataverse.org/community-calls for the call in information. It'll be noon, Boston time. The "Code Deposit - Github Integration" issue we're discussing is #2739.

pdurbin on 4 Dec 2018

Oy - I'm not currently able to keep up well with GitHub notifications, so I'm thankful that there are good notes posted on the community call page!

davclark on 10 Dec 2018

👍1

@aculich Happy New Year! What's the status of this issue, please? Are you blocked? Do you need anything? Thanks!

pdurbin on 4 Jan 2019

Yesterday I stumbled upon this awesome drawing about Jupyter Notebooks and reproducibility:

use-cases-binder-logbook-solution

It's apparently from https://opendreamkit.org/2017/11/02/use-case-publishing-reproducible-notebooks/ and some day it would be neat to see Dataverse in there as a repository.

pdurbin on 9 Jan 2019

I love that drawing, was clever of them to think to pay somebody to do this...we should totally do this with the jupyter project too :-)

choldgraf on 10 Jan 2019

👍1

+1 on adding any kind of jupyter integration! through a notebook with limited functionality in a virtual environment, this could even run on sensitive data. but in general, it would make understanding the data and subsequent replication sooo much easier

RightInTwo on 20 Feb 2019

@RightInTwo your best bet for launching Jupyter notebooks from Dataverse at the moment is the new Whole Tale integration. You can learn how to set it up at http://guides.dataverse.org/en/4.11/installation/external-tools.html

I posted some screenshots when this was being developed above at https://github.com/IQSS/dataverse/issues/4714#issuecomment-443334208

pdurbin on 25 Feb 2019

🎉1

I just noticed that Binder/BinderHub/repo2docker has a newish "content provider" framework: https://github.com/jupyter/repo2docker/tree/0.8.0/repo2docker/contentproviders

My understanding is that the default provider is git.

Other providers such as Dataverse could probably be added?

Whole Tale has already implemented "download files via a dataset's DOI or Handle" in Python at https://github.com/whole-tale/girder_wholetale/pull/175 so I would think it would be a good starting point for anyone who's interested in picking this up.

pdurbin on 4 Apr 2019

👍1

+1 on adding any kind of jupyter integration!

@RightInTwo (and everyone) as I just mentioned at http://irclog.iq.harvard.edu/dataverse/2019-04-08#i_90314 at the moment you can play around with launching Jupyter notebooks and RStudio by clicking "Explore" and then "Whole Tale" over at https://dev2.dataverse.org/file.xhtml?fileId=29

Screen Shot 2019-04-08 at 9 45 04 AM

This is a bit beta. We are working through some issues. More screenshots at https://github.com/IQSS/dataverse-ansible/issues/59 (we've resolved some of the errors mentioned there).

pdurbin on 8 Apr 2019

I just noticed that Binder/BinderHub/repo2docker has a newish "content provider" framework: https://github.com/jupyter/repo2docker/tree/0.8.0/repo2docker/contentproviders

My understanding is that the default provider is git.

Other providers such as Dataverse could probably be added?

Thanks to efforts by @choldgraf and @betatim in https://github.com/jupyter/repo2docker/pull/692 the repo2docker ContentProviders has now been documented at https://repo2docker.readthedocs.io/en/latest/architecture.html#contentproviders

Here's a screenshot of how the docs look:

Screen Shot 2019-06-11 at 12 45 21 PM

repo2docker is written in Python and I am happy to help mentor any developer who is interested in contributing a dataverse.py ContentProvider next to the Zenodo ContentProvider that was recently added at https://github.com/jupyter/repo2docker/blob/dce6c1e8d731b4846d722ef0701745321c3a694c/repo2docker/contentproviders/zenodo.py

I suspect that this will mostly be a copy and paste effort from Python code Whole Tale wrote at https://github.com/whole-tale/girder_wholetale/pull/175 (I wrote something similar at https://github.com/SwissDataScienceCenter/renku-python/issues/536 ) or the shiny new pyDataverse package could be used instead, which is available at https://pypi.org/project/pyDataverse/ from pip and which yesterday I started using at https://github.com/IQSS/dataverse-sample-data

I'll add the "Mentor: pdurbin" label to this issue. 😄

If you're interested in launching Jupyter Notebooks from Dataverse and think you can help contribute some code, please comment below to find me at http://chat.dataverse.org ! Thanks!

pdurbin on 11 Jun 2019

👍2 ❤1

@choldgraf did you see https://github.com/jupyter/repo2docker/pull/739 by @Xarthisius ?!?! 🎉 🎉 🎉

This the backend support we need on the repo2docker/binder side to eventually add a "Launch Binder" button on datasets in Dataverse, right?

I'd love to chat with you about how best to write up Binder as a "external tool" to be listed on a future version of http://guides.dataverse.org/en/4.15.1/installation/external-tools.html#inventory-of-external-tools . Maybe we can chat in https://gitter.im/jupyterhub/binder or http://chat.dataverse.org sometime? Please let me know! Thanks!

pdurbin on 11 Jul 2019

very cool! I hadn't seen that (away at SciPy right now so haven't been checking issues etc closely). Know if anybody from the dataverse world is here?

choldgraf on 11 Jul 2019

👍1

@choldgraf well, from looking at https://www.scipy2019.scipy.org/speaker-directory

@willingc isn't exactly from the Dataverse world she hung out with some of us Dataverse folks at an open source health workshop ( https://projects.iq.harvard.edu/osshealthindex ) that @IQSS held a few weeks ago.
@xuf12 attended our 5th annual Dataverse conference the other week and we had a fantastic chat about his poster about near future Sloan integration of Code Ocean and Dataverse: https://osf.io/64kzt/
@astrofrog has visited @IQSS to talk about Dataverse and other stuff and authored https://github.com/astrofrog/pyverse
@acabunoc is someone I've spoken with about Dataverse, I believe, when she held an open science event in Boston.

That was a quick scroll through so I may have missed some.

While I'm writing I'm going to ask you to help me figure out the best repo to have this conversation about mybinder URLs. Originally I was going to post this at https://github.com/jupyter/repo2docker/pull/739 but I'll just throw it in here for now.

@Xarthisius thank you so much for kicking off this effort!! 🎉 🎉 🌮

I'd love to chat with the Binder developers about how best to allow users to launch Binder from a Dataverse dataset.

Dataverse supports the concept of an "external tool" so that from a file, you can click "Explore" and then the name of the tool to launch that tool.

For example, a few weeks ago at the 5th annual Dataverse meeting people were very excited to see me click "Explore" and then "Whole Tale", which launched me into Whole Tale with the Jupyter Notebook and data from the dataset in Dataverse. I transcribed my talk and added screenshots of this demo here: https://scholar.harvard.edu/pdurbin/blog/2019/jupyter-notebooks-and-crazy-ideas-for-dataverse

An installation of Dataverse can add a similar button under "Explore" for Binder today like in the screenshot below from a test server:

Screen Shot 2019-07-11 at 1 00 53 PM

In this case, Dataverse will open a new tab and send the user to this URL:

https://mybinder.org/v2/dataverse/?siteUrl=https://dev2.dataverse.org&datasetId=18&fileId=30

Below I'm including the definition of the external tool that I'm using. It's in JSON format and we call it a "manifest".

My questions:

Is it ok that the URL includes query parameters (siteUrl, datasetId, etc.) rather than "path only" URLs such as https://mybinder.org/v2/zenodo/10.5281/zenodo.3229823/ or https://mybinder.org/v2/gh/pdurbin/dataverse-irc-metrics/master ?
Is it ok that the URL doesn't have the dataset DOI and has the database id instead?

Here's how I loaded the external tool for Binder in my installation of Dataverse:

curl http://localhost:8080/api/admin/externalTools -X POST --upload-file mybinder.json

Here's the contents of mybinder.json:

{
  "displayName": "MyBinder",
  "description": "Analyze in MyBinder",
  "type": "explore",
  "toolUrl": "https://mybinder.org/v2/dataverse/",
  "contentType": "application/x-ipynb+json",
  "toolParameters": {
    "queryParameters": [
      {
        "siteUrl": "{siteUrl}"
      },
      {
        "datasetId": "{datasetId}"
      },
      {
        "fileId": "{fileId}"
      }
    ]
  }
}

Over at http://irclog.iq.harvard.edu/dataverse/2019-07-11#i_100587 @Xarthisius and I seem to be converging on https://github.com/jupyterhub/binderhub being the right place for the conversation above about mybinder.org URLs but I'm also conscious that at https://blog.jupyter.org/binder-with-zenodo-af68ed6648a6 you are directing people to open issues in https://github.com/jupyter/repo2docker . Please advise. 😄

pdurbin on 11 Jul 2019

For URLs we should discuss that in an issue on https://github.com/jupyterhub/binderhub (or even better a PR in that repo that adds a RepoProvider for the dataverse provider).

My initial reaction is that we should keep URLs as uniform and "createable by a human" as possible. So something like mybinder.org/v2/dataverse/<doi_here>. Ideally ideally we would have /v2/doi/<doi_here> instead of having to make a new URL scheme for each "host". However I've given up on that for the near term as it seems tricky to do :-/

betatim on 12 Jul 2019

👍1

@pdurbin I think that tracking conversation about Binder integration would be best-done in the repo2docker repository just because more folks might have a chance to see it there. You can probably copy/paste a bunch of your questions from here and add them there (also fine to open up another issue there if the conversation will be more general than that PR specifically) :-)

choldgraf on 12 Jul 2019

👍1

My initial reaction is that we should keep URLs as uniform and "createable by a human" as possible. So something like mybinder.org/v2/dataverse/<doi_here>. Ideally ideally we would have /v2/doi/<doi_here> instead of having to make a new URL scheme for each "host". However I've given up on that for the near term as it seems tricky to do :-/

This is potentially something that @whole-tale could help with. We have an endpoint that takes doi and returns you a basic info about it:

$ curl -s -H 'Accept: application/json' 'https://data.wholetale.org/api/v1/repository/lookup?dataId=%5B%2210.7910%2FDVN%2F6ZXAGT%22%5D' | jq
[
  {
    "dataId": "https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/6ZXAGT",
    "doi": "10.7910/DVN/6ZXAGT",
    "name": "ArchaeoGLOBE Repository",
    "repository": "Dataverse",
    "size": 14157617
  }
]
$ curl -s -H 'Accept: application/json' 'https://data.wholetale.org/api/v1/repository/lookup?dataId=%5B%2210.5065%2FD6862DM8%22%5D' | jq
[
  {
    "dataId": "resource_map_doi:10.5065/D6862DM8",
    "doi": "10.5065/D6862DM8",
    "name": "Humans and Hydrology at High Latitudes: Water Use Information",
    "repository": "DataONE",
    "size": 28856295
  }
]

Pending our addition of Zenodo as a provider, it could serve as a DOI -> ContentProvider translator.

Xarthisius on 12 Jul 2019

👍1

https://github.com/jupyter/repo2docker/pull/739 was merged a few hours ago! Thanks, @betatim for merging !!! 🎉 🎉 🎉 Thank you to @Xarthisius for creating the pull request and continuing to iterate on it and to @nuest for helping!!! 🎉 🎉 🎉

For adding Dataverse to the Binder UI, I created https://github.com/jupyterhub/binderhub/issues/900

pdurbin on 18 Sep 2019

❤1

wahooo! that's fantastic!

choldgraf on 18 Sep 2019

🎉2

This just in!! The indefatigable @Xarthisius just created a pull request called "Add Dataverse to UI" over at https://github.com/jupyterhub/binderhub/pull/969 🎉 🎉 🎉

pdurbin on 1 Oct 2019

🎉1

Don't look now but it's actually possible to launch a Binder from a Dataverse DOI if you use the "Zenodo" DOI provider. Lots of screenshots over at https://github.com/jupyterhub/binderhub/pull/969#issuecomment-549987945 🎉 🚀

@atrisovic is my witness. 😄

In other news, this incredible article just hit my radar:

Make code accessible with these cloud services: Container platforms let researchers run each other’s software — and check the results: https://www.nature.com/articles/d41586-019-03366-x

Tools mentioned in the article include

Binder
Colaboratory
Code Ocean
Gigantum
Nextjournal
Holepunch

pdurbin on 5 Nov 2019

🎉1

@atrisovic and I had a great chat with the Binder team today. Notes at https://hackmd.io/u2ghJJUCRWK-zRidCFid_Q?view

I believe the video will be available at https://www.youtube.com/user/ipythondev/videos

pdurbin on 21 Nov 2019

❤1 🎉1 👍1

Unfortunately, the video won't be available. See https://github.com/jupyterhub/team-compass/issues/241 . Oh well.

The big news is that one can now use https://mybinder.org with Dataverse DOIs! That's because https://github.com/jupyterhub/binderhub/pull/969 was merged yesterday!! 🎉

Lots of likes and retweets at https://twitter.com/philipdurbin/status/1204472858413371392

Screen Shot 2019-12-11 at 5 34 20 PM

I just opened pull request #6453 to add Binder to the list of integrations in the Admin Guide.

But when is this issue finished? What is the definition of done? Originally @aculich seemed to want an external tool, a button called "Binder" that you can click from a dataset in Dataverse. As we discussed in tech hours yesterday, the Dataverse external tool framework doesn't current support the URLs required by Binder. That is to say, Dataverse always wants to send DOIs in query parameters and Binder wants DOIs (not Handles, by the way) in the path like this:

https://mybinder.org/v2/dataverse/10.7910/DVN/TJCLKP/

So maybe the next step is to make the Dataverse external tool framework more flexible about the URLs it can create so that datasets can have a Binder button.

pdurbin on 11 Dec 2019

🎉3

So maybe the next step is to make the Dataverse external tool framework more flexible about the URLs it can create so that datasets can have a Binder button.

@atrisovic just opened #6807 to track improvements to Dataverse's external tool framework to allow for an Explore button for Binder. I'm closing this issue. Thanks, everyone! 🎉 Great feedback on the Binder/Dataverse integration, the Bindervese, continues to come in. Here's a recent tweet from https://twitter.com/Duhem_/status/1245682563617820672

Screen Shot 2020-04-09 at 5 08 27 PM

pdurbin on 9 Apr 2020

🎉1

Dataverse: Binderverse: integrating Binderhub with Dataverse (using docker+kubernetes)

Most helpful comment

All 57 comments

Related issues