Related: #2739
We're currently working on designing a workflow that will allow Dataverse users to connect their GitHub accounts using GitHub's webhooks, and then import a GitHub repo for deposit into Dataverse in the form of a dataset containing a .zip with all files from the repo. Users will be able to configure a "sync" whereby whenever a GitHub repo is updated in a specific way (probably when a release is minted), the dataset in Dataverse will be automatically updated.
This spike will help us learn more about what's possible when using GitHub webhooks with Dataverse.
Some questions we have that will help inform the design of this feature:
Cool! FWIW - if your workflow involves the existing workflow mechanism is Dataverse, you might want to start from #5048 that is winding its way through the process. It lets you send any Dataverse setting you want to the workflow and fixes some transaction issues (at least in local workflows, I'm not as sure the problem existed for workflows that do callbacks). As an example, #5049 is a workflow to submit a zipped bag to the Digital Preservation Network and it gets the hostname/port, etc. from Dataverse settings.
After doing some research, here are some answers to our initial questions:
Using GitHub鈥檚 webhooks, what events could we set to trigger syncing with Dataverse? (E.g. releases, commits, pull requests)
When connecting your Dataverse account to your GitHub account, how does that initial connection get made?
How can we determine which GitHub repositories a user can select from in Dataverse? (E.g. Can the Dataverse user select from a list of GitHub repos he owns, or repos he has certain other permissions on?)
Other interesting info I found:
- As far as I can tell releases are not baked into git itself, if we supported this maybe we'd tap into archive?
Releases in GitHub are based on Git tags. See https://help.github.com/articles/about-releases/
I'll probably have a number of questions but one that's top of mind is this:
It looks like github webhooks do not retry. Some remediation efforts I can see for this are provide some way through the UI for the user to publish a release from github manually. Or maybe we poll at regular intervals ontop of the webhook? Or maybe we just poll? A good thing to discuss in our next group.
If we decide to do polling, there are rate limits to take into account. These are per individual, with a limit of 5000 for that individual across all applications.
It looks like deep in the github UI there is a way to see if your webhooks succeeded. Its not obvious to the user on on github, we can get the info via the API but it looks to involve sifting through a lot of junk (link) .
Thanks for the research, @matthew-a-dunlap! Later today I plan to really dive into the answers you gave and start developing new mockups. If I have any followup questions, I'll post them here!
@dlmurphy No problem! We _may_ want to have some discussion before we dig too deep, especially in light of @pdurbin 's question about retrying webhooks.
RE: Question 3, How can we determine which GitHub repositories a user can select from in Dataverse?
In addition to the info Matthew posted above, I want to include this info I found from one of the pages he linked:
The authenticated user has explicit permission to access repositories they own, repositories where they are a collaborator, and repositories that they can access through an organization membership.
Any dropdown or type-ahead selector we include can allow the Dataverse user to select any repos that fit those criteria.
A follow-up question for @matthew-a-dunlap or anyone else in the know, RE: Question 5, "What metadata can we get from the GitHub repo?"
Could you please list more specifically what metadata we can pull from a repo, or link to a page with that info? I'm having a hard time finding that info.
@dlmurphy Regarding Q5: This page has info on all the objects that can be queried via the api, and their attributes . Including release, user, repository and organization.
@dlmurphy if you look at the dataset in https://dataverse.harvard.edu/dataverse/open-source-at-harvard you can get a sense of the metadata you can get from GitHub for one of their repos. Here's a handy link to Data Explorer: https://scholarsportal.github.io/Dataverse-Data-Explorer/?fileId=3040230&siteUrl=https://dataverse.harvard.edu and to the JSON GitHub exposed for our "dataverse" repo back in July last year: https://github.com/pdurbin/open-source-at-harvard-primary-data/blob/master/2017-07-31/IQSS-dataverse.json
Thanks, guys. That answers that question pretty well! I'm happy with the answers we've gathered, but I want to let this issue simmer until we can go over these answers in a design team meeting, perhaps this Wednesday.
Following today's design meeting, we decided that we'd like the next step for this spike to include:
Creation of a working prototype that demonstrates a basic Dataverse/GitHub webhook connection that can pull a repo from GitHub and create a .zip of it in a Dataverse dataset.
Decisions on which metadata fields would be appropriate for Dataverse to use in a software metadata block, and then a mapping of which of those can be autopopulated from GitHub.
@dlmurphy - FYI, in standup today, there was some discussion regarding the prototype, mostly related to whether or not it includes a front end. Some folks may check in with you.
Just talked about this with the design team -- we don't need a UI for this prototype.
To be more specific, we're looking for a prototype that:
Can pull a repo from GitHub into Dataverse as a .zip file when a user manually specifies. (pull a specific release OR if there's no release pull to the latest commit)
Can pull a repo from GitHub into Dataverse as a .zip file via GitHub webhooks when a release is triggered. (How do you set up the webhook and how do you maintain it? What can we do if the webhook fails?)
Can pull metadata from a GitHub repo into Dataverse (which specific fields don't really matter, just want to demonstrate that this is doable).
The prototype doesn't need a frontend.
@mheppler, please feel free to weigh in on this, you might have a better idea of what's helpful here.
Yesterday I demo'ed some code I hacked together as of b9305c0 to @djbrooke @scolapasta @TaniaSchlatter @mheppler @dlmurphy @jggautier and @kcondon
All functionality is API only for now. There are two steps:
curl -H "X-Dataverse-key: $API_TOKEN" http://localhost:8080/api/datasets/31/github
{"status":"OK","data":{"datasetId":31,"githubUrl":"https://github.com/IQSS/Zelig"}}curl -H "X-Dataverse-key: $API_TOKEN" -X POST http://localhost:8080/api/datasets/31/github/importThe result is that a file is created that looks like the screenshot below from https://dev1.dataverse.org/file.xhtml?persistentId=doi:10.5072/FK2/FS7M3O/EBNKNB

I had to leave to pick up my kids before any decisions were made about next steps.
Below is a more readable version of the output from https://api.github.com/repos/IQSS/Zelig that I shoved into the file description above. Please note that I believe that this is only the tip of the iceberg in terms of metadata that we could pull out of GitHub for a repo. The content is most URLs for pulling out additional information. The items that I find interesting are:
{
"stargazers_count": 65,
"pushed_at": "2018-02-27T13:49:49Z",
"subscription_url": "https://api.github.com/repos/IQSS/Zelig/subscription",
"language": "R",
"branches_url": "https://api.github.com/repos/IQSS/Zelig/branches{/branch}",
"issue_comment_url": "https://api.github.com/repos/IQSS/Zelig/issues/comments{/number}",
"labels_url": "https://api.github.com/repos/IQSS/Zelig/labels{/name}",
"subscribers_url": "https://api.github.com/repos/IQSS/Zelig/subscribers",
"releases_url": "https://api.github.com/repos/IQSS/Zelig/releases{/id}",
"svn_url": "https://github.com/IQSS/Zelig",
"subscribers_count": 25,
"id": 14958190,
"forks": 32,
"archive_url": "https://api.github.com/repos/IQSS/Zelig/{archive_format}{/ref}",
"git_refs_url": "https://api.github.com/repos/IQSS/Zelig/git/refs{/sha}",
"forks_url": "https://api.github.com/repos/IQSS/Zelig/forks",
"statuses_url": "https://api.github.com/repos/IQSS/Zelig/statuses/{sha}",
"network_count": 32,
"ssh_url": "[email protected]:IQSS/Zelig.git",
"license": null,
"full_name": "IQSS/Zelig",
"size": 115034,
"languages_url": "https://api.github.com/repos/IQSS/Zelig/languages",
"html_url": "https://github.com/IQSS/Zelig",
"collaborators_url": "https://api.github.com/repos/IQSS/Zelig/collaborators{/collaborator}",
"clone_url": "https://github.com/IQSS/Zelig.git",
"name": "Zelig",
"pulls_url": "https://api.github.com/repos/IQSS/Zelig/pulls{/number}",
"default_branch": "master",
"hooks_url": "https://api.github.com/repos/IQSS/Zelig/hooks",
"trees_url": "https://api.github.com/repos/IQSS/Zelig/git/trees{/sha}",
"tags_url": "https://api.github.com/repos/IQSS/Zelig/tags",
"private": false,
"contributors_url": "https://api.github.com/repos/IQSS/Zelig/contributors",
"has_downloads": true,
"notifications_url": "https://api.github.com/repos/IQSS/Zelig/notifications{?since,all,participating}",
"open_issues_count": 26,
"description": "A statistical framework that serves as a common interface to a large range of models",
"created_at": "2013-12-05T15:57:10Z",
"watchers": 65,
"keys_url": "https://api.github.com/repos/IQSS/Zelig/keys{/key_id}",
"deployments_url": "https://api.github.com/repos/IQSS/Zelig/deployments",
"has_projects": true,
"archived": false,
"has_wiki": false,
"updated_at": "2018-10-30T16:47:25Z",
"comments_url": "https://api.github.com/repos/IQSS/Zelig/comments{/number}",
"stargazers_url": "https://api.github.com/repos/IQSS/Zelig/stargazers",
"git_url": "git://github.com/IQSS/Zelig.git",
"has_pages": true,
"owner": {
"gists_url": "https://api.github.com/users/IQSS/gists{/gist_id}",
"repos_url": "https://api.github.com/users/IQSS/repos",
"following_url": "https://api.github.com/users/IQSS/following{/other_user}",
"starred_url": "https://api.github.com/users/IQSS/starred{/owner}{/repo}",
"login": "IQSS",
"followers_url": "https://api.github.com/users/IQSS/followers",
"type": "Organization",
"url": "https://api.github.com/users/IQSS",
"subscriptions_url": "https://api.github.com/users/IQSS/subscriptions",
"received_events_url": "https://api.github.com/users/IQSS/received_events",
"avatar_url": "https://avatars2.githubusercontent.com/u/675237?v=4",
"events_url": "https://api.github.com/users/IQSS/events{/privacy}",
"html_url": "https://github.com/IQSS",
"site_admin": false,
"id": 675237,
"gravatar_id": "",
"node_id": "MDEyOk9yZ2FuaXphdGlvbjY3NTIzNw==",
"organizations_url": "https://api.github.com/users/IQSS/orgs"
},
"commits_url": "https://api.github.com/repos/IQSS/Zelig/commits{/sha}",
"compare_url": "https://api.github.com/repos/IQSS/Zelig/compare/{base}...{head}",
"git_commits_url": "https://api.github.com/repos/IQSS/Zelig/git/commits{/sha}",
"blobs_url": "https://api.github.com/repos/IQSS/Zelig/git/blobs{/sha}",
"git_tags_url": "https://api.github.com/repos/IQSS/Zelig/git/tags{/sha}",
"merges_url": "https://api.github.com/repos/IQSS/Zelig/merges",
"downloads_url": "https://api.github.com/repos/IQSS/Zelig/downloads",
"has_issues": true,
"url": "https://api.github.com/repos/IQSS/Zelig",
"contents_url": "https://api.github.com/repos/IQSS/Zelig/contents/{+path}",
"mirror_url": null,
"milestones_url": "https://api.github.com/repos/IQSS/Zelig/milestones{/number}",
"teams_url": "https://api.github.com/repos/IQSS/Zelig/teams",
"fork": false,
"issues_url": "https://api.github.com/repos/IQSS/Zelig/issues{/number}",
"events_url": "https://api.github.com/repos/IQSS/Zelig/events",
"issue_events_url": "https://api.github.com/repos/IQSS/Zelig/issues/events{/number}",
"organization": {
"gists_url": "https://api.github.com/users/IQSS/gists{/gist_id}",
"repos_url": "https://api.github.com/users/IQSS/repos",
"following_url": "https://api.github.com/users/IQSS/following{/other_user}",
"starred_url": "https://api.github.com/users/IQSS/starred{/owner}{/repo}",
"login": "IQSS",
"followers_url": "https://api.github.com/users/IQSS/followers",
"type": "Organization",
"url": "https://api.github.com/users/IQSS",
"subscriptions_url": "https://api.github.com/users/IQSS/subscriptions",
"received_events_url": "https://api.github.com/users/IQSS/received_events",
"avatar_url": "https://avatars2.githubusercontent.com/u/675237?v=4",
"events_url": "https://api.github.com/users/IQSS/events{/privacy}",
"html_url": "https://github.com/IQSS",
"site_admin": false,
"id": 675237,
"gravatar_id": "",
"node_id": "MDEyOk9yZ2FuaXphdGlvbjY3NTIzNw==",
"organizations_url": "https://api.github.com/users/IQSS/orgs"
},
"assignees_url": "https://api.github.com/repos/IQSS/Zelig/assignees{/user}",
"open_issues": 26,
"watchers_count": 65,
"node_id": "MDEwOlJlcG9zaXRvcnkxNDk1ODE5MA==",
"homepage": "http://zeligproject.org",
"forks_count": 32
}
Upon discussing this in more detail this morning with @djbrooke @TaniaSchlatter @pdurbin , an update on expected results of this spike:
Also, scrap the "cherry pick and format the metadata" suggestion. That can be done when this full feature moves to development. We know that can be done and don't need a spike to prove it.
The short answer to the question above is "I don't know" because I struggle mightily with JSF. We can try to get both working so we have options. It will take time.
Meanwhile, below my todo list from the dev perspective. This is the logical order in which to work on the code.
@pdurbin - Let's put the brakes on this for now (except #1 IMHO). We are verifying a proposed approach with @mercecrosas and I'd like to revisit the technical architecture after that. Let's pick this up when you're back next week. Apologies for the confusion.