dataverse 🚀 - Implement Backend Support for Make Data Count use and citation metrics

Feel free to ask any questions DataCite staff can help with here.

mfenner on 18 Oct 2018

🎉1

Here are some notes taken during a meeting on 2018-10-18: https://docs.google.com/document/d/1eM4rAuhmR4ZQxJC_PTE0rq2x7N3aNEjMN7QVvpkY1os/edit?usp=sharing

pdurbin on 30 Oct 2018

We'll determine how these metrics appear on the page as #3404 moves through our design process, but there's an opportunity to get the backend pieces in place. Some proposed steps for discussion and estimation:

Sending the logs to DataCite (and any processing we need to do beforehand)
Receiving data back from DataCite
Storing the data that's sent back
Getting the information onto the page (but not displaying it)

This will position us well for implementation once we have the designs further along and validated.

djbrooke on 31 Oct 2018

I met @mbjones at Whole Tale Workshop on Tools and Approaches for Publishing Reproducible Research and he mentioned he'd be happy to field technical questions we have about DataONE's implementation of Make Data Count.

Meanwhile, DataONE put out a blog post at https://www.dataone.org/news/new-usage-metrics that has some nice screenshots of a dataset at https://search.dataone.org/view/doi:10.5063/F1Z899CZ which I'll put below:

dataone_implements_new_usage_and_citation_metrics_to_make_your_data_count_dataone_-_2018-11-13_14 47 21

pdurbin on 13 Nov 2018

Happy to help, @pdurbin. The time series graphs you cited were made much faster by caching results locally and then enabling group by at various levels of aggregation. The d3-charts we build and other visualizations are all part of our open source MetacatUI data portal frontend, so you might find some of that reusable.

mbjones on 15 Nov 2018

👍1

@mbjones thanks. Is there any reusable Java we might interested in as well?

All, at standup today I said I was close to pushing some docs that capture my understanding of what we're trying to implement. These docs are in 4dd10bd but I'll add them as a screenshot below as well. I also stubbed out some API tests but nothing has been implemented yet. It's all just stubs. Feedback is welcome.

make_data_count_ _dataverse org_-_2018-11-19_16 31 47

pdurbin on 19 Nov 2018

Here's a to do list of tasks that are top of mind for me.

[x] Read https://github.com/CDLUC3/Make-Data-Count/blob/master/getting-started.md
[x] Read "COUNTER Code of Practice for Research Data": https://doi.org/10.7287/peerj.preprints.26505v1
[x] Decide if we will be parsing logs that we generate or extending our guestbook feature to record views as well. Or other approaches. Discuss. Updated: On 2018-11-27 we decided to parse logs rather that writing each view and download to the database but we didn't consider multiple Glassfish servers and may need to think some more about this.
[x] Decide how we will store the metrics in Dataverse. Which database tables? Should the JSON be cached? Update: on standup on 2018-12-03 I explained that the DataONE interface shows more than just a number for views, for example. It shows a time series chart of views per month. Is this what we want?
[x] Ask @mbjones is there is any re-usable Java code from DataONE's Make Data Count implementation. Could also ask in https://dataoneorg.slack.com which I joined. Update: No Java code. See below.
[x] Decide if we can make use of https://github.com/CDLUC3/counter-processor and follow up on decision at https://github.com/CDLUC3/Make-Data-Count/issues/99 . Update: Apache logs cannot simply be parsed as-is as explained in https://github.com/CDLUC3/counter-processor/issues/3 . Dataverse must emit logs in a particular format to make use of counter-processor. On 2018-11-27 we decided that we probably won't use counter-processor because it introduces a dependency and because it doesn't "just work" with our logs: https://github.com/CDLUC3/Make-Data-Count/issues/99#issuecomment-441779513
[x] Is there value in me (and/or others on the Dataverse team) joining https://www.rd-alliance.org/groups/data-usage-metrics-wg ? Is there a mailing list with public archives? The most recent item under "Recent Activity" is a blog post from June. Update: A "pdurbin" account applied for on 2018-12-03 and the response back was "Your request will be approved by RDA Secretariat and your account activated within 1 business day." 2018-12-07 update, still word on a "pdurbin" account and we were told in the meeting by Martin that there are no implementation details in there.
[x] Question: Can Dataverse express data citations? Can "Related Publications" be used? Update: yes, Dataverse can express citations but "Related Dataset" should be used. See 17cbf37
[x] If Dataverse can express data citations, can the DataCite hub receive them? In 4dd10bd I only talk about sending views/investigations and downloads/requests. Update: Yes, DataCite can receive data citations (make sense, I guess 😄 ). See 17cbf37 and discussion below.
[x] Is "DataCite hub" the right name for the service that Dataverse installations will be sending data to? Update: "DataCite hub" is what's shown at https://makedatacount.org/roadmap/ so we'll go with that.
[x] Make sure @mheppler @TaniaSchlatter @dlmurphy and @jggautier know that there is some potentially reusable front end code for #5253 from DataONE as @mbjones indicated above. A good starting point may be https://github.com/NCEAS/metacatui/issues/594 . Update: discussed at a standup before Thanksgiving.

I also wanted to note that I set up a Jenkins job to build the guides from the branch I'm using to http://guides.dataverse.org/en/4821-make-data-count/admin/make-data-count.html

I asked the Dataverse community for feedback at https://groups.google.com/d/msg/dataverse-community/rQWNllAyTu0/RMD0GEFzAgAJ

pdurbin on 20 Nov 2018

@pdurbin We didn't implement this in Java, so no Java code to share there. We have an index processor and metrics service in python if you have interest in that.

Also, reading your document, I see one little difference from the DataONE interpretation when defining Views and Downloads. Like us, you are using the terminology “Views” and “Downloads” over “Investigations” and “Requests”. So, we should be sure we are using those the same way. I think our implementation is Downloads == Requests, and Views = Investigations - Requests, whereas you seem to state that Views == Investigations. We made Views be the difference so that they were independent metrics -- Views basically represent how many times the landing page has been looked at or the metadata was accessed, whereas Downloads is how many times all or part of the data was accessed. Does that make sense to you?

mbjones on 20 Nov 2018

@mbjones I'm not sure what an index processor is but I know we both use Solr so I guess I'll take a link to that code as well as the metrics service if it's not too much trouble.

I understand what you're saying about the meaning of "views" but let me read the specs and talk to others on my team before I respond. I'm also curious what DASH does. We've always shown downloads in Dataverse but the idea of showing views and citations is new to us. Thanks for the feedback!

pdurbin on 20 Nov 2018

Sorry about the confusion over 'index processor' -- that is our component that takes our raw usage logs from apache and other sources and processes them to insert usage events into our ElasticSearch index, which we then use to send stats to DataCite. Its pretty well customized to DataONE so probably not a lot of general utility except as an example.

mbjones on 20 Nov 2018

👍1

Thanks @pdurbin - the process in the doc makes sense to me. I added a small comment/question and I'm interested in the thoughts from @mfenner and the rest of MDC team and also @scolapasta and others on the technical implementation.

@mbjones thanks for the feedback here as well!

djbrooke on 21 Nov 2018

@pdurbin @djbrooke in the list above, you asked:

If Dataverse can express data citations, can the DataCite hub receive them?

The answer is yes, but not the same way as usage metrics. DataCite already supports linking to publications in the DOI-related metadata that you submit with your DOIs using the <relatedIdentifier> element. See the DataCite EventData Guide. These publication linkages are parsed and added to the EventData source. So, the "hub" is used for reporting Investigations and Requests as counts, whereas every individual citation event is reported in the DOI metadata.

The only problem we have with this approach is that it is completely DOI-centric, and we have many data sets that are not identified with DOIs. I think you also have some with Handles, right? In any case, I'd love to have an API for collating citations for any identifier type, including Handles, ARKs, UUIDs, CURIEs, etc.

mbjones on 21 Nov 2018

Question: Can Dataverse express data citations? Can "Related Publications" be used?

Dataverse has a related dataset field (mentioned in https://github.com/IQSS/dataverse/issues/5277), although DataCite expects identifiers, and the related dataset field is a free text field. We expect "Related Publication" to be used mostly for text-based publications (and that's how it's mapped to the DDI exports).

If we use Related Publications for datasets as well:

The Related Dataset field becomes redundant
Does DataCite hub care which related thing is an article and which is a dataset? If it does, how will it or the Make Data Count infrastructure know, especially if the related identifier isn't a DOI and it can't count on the DataCite schema's resourceTypeGeneral attribute? Either way, if "Related Publications" fields were used, I think it's important that Dataverse knows what type of thing the related thing is (e.g. for it's own metadata mapping).

If Dataverse can express data citations, can the DataCite hub receive them? In 4dd10bd I only talk about sending views/investigations and downloads/requests. Ask @mfenner

Discussions about Dataverse sending <relatedIdentifier> metadata to DataCite are in https://github.com/IQSS/dataverse/issues/2917 and https://github.com/IQSS/dataverse/issues/2778.

(I think these important questions are more about Dataverse being able to contribute to the quality of Make Data Count's citation metrics, and less about implementing backend support for sending/receiving usage counts and receiving citation counts.)

jggautier on 21 Nov 2018

Before Thanksgiving I was telling @djbrooke that I hoped the "counter-processor" Python code from CDL would "just work" with our Apache logs, but it turns out that the logs much contain data repository-specific fields like title, publisher, author, etc. Example logs look like this on a single line:

2018-05-08T00:00:40-07:00 128.195.188.234 - - - http://dash.lib.uci.edu/stash/dataset/doi:10.7280/D1H01B doi:10.7280/D1H01B - - uci-google-search-appliance (Enterprise; T4-CGX5LF9EL8JCP; [email protected],[email protected],[email protected]) Mustard Removal Experiment at Bayview Slope UC Irvine grid.266093.8 Riley Pratt|Jessica Pratt|Jenny Talbot|Stephanie Kivlin|Margaret Royall-Reed|Steven D. Allison 2015-04-14T11:00:46Z 1 - https://dash.lib.uci.edu/stash/dataset/doi:10.7280/D1H01B 2015

And they look like this as key/value pairs:

event_time: '2018-05-08T00:00:40-07:00'
client_ip: 128.195.188.234
session_cookie_id: '-'
user_cookie_id: '-'
user_id: '-'
request_url: http://dash.lib.uci.edu/stash/dataset/doi:10.7280/D1H01B
identifier: doi:10.7280/D1H01B
filename: '-'
size: '-'
user-agent: uci-google-search-appliance (Enterprise; T4-CGX5LF9EL8JCP; [email protected],[email protected],[email protected])
title: Mustard Removal Experiment at Bayview Slope
publisher: UC Irvine
publisher_id: grid.266093.8
authors: Riley Pratt|Jessica Pratt|Jenny Talbot|Stephanie Kivlin|Margaret Royall-Reed|Steven
  D. Allison
publication_date: '2015-04-14T11:00:46Z'
version: '1'
other_id: '-'
target_url: https://dash.lib.uci.edu/stash/dataset/doi:10.7280/D1H01B
publication_year: '2015'

I opened https://github.com/CDLUC3/counter-processor/issues/3 to provide a brain dump of my thinking as a potential new user and left a new comment at https://github.com/CDLUC3/Make-Data-Count/issues/99#issuecomment-441779513 and another to do item above to capture what I said at standup this morning, that we need to decide on the approach we'd like to take. The assumption is that we'll be parsing logs in a specific format that we teach Dataverse to write. Another approach could be to extend our guestbook model of recording downloads to also record views in the database but I'm concerned about how much space that would take up.

By the way, thanks to all who have made comments above. I've read them but I'm a little focused on understanding counter-processor at the moment. 😄

pdurbin on 26 Nov 2018

👍1

@scolapasta @landreev @sekmiller @kcondon @mheppler @matthew-a-dunlap and I just had a nice discussion in tech hours about Make Data Count. Here's what I drew on the whiteboard and I'm sorry for the chicken scratch:

img_20181127_160337515

img_20181127_160330387

The most important thing to me was getting consensus on some decisions:

We won't log views in our database. Rather we will create a dedicated log file for views and downloads for Make Data Count. I'm saying views and downloads because I think it will be easier and cleaner to put both views and downloads in the log even though I'm sure we'll continue to track downloads in our database like we always have.
We don't like the idea of introducing a dependency on third party software like counter-processor so we don't plan to use it. We would rather have the Make Data Count feature "just work" when you install Dataverse. Long live the monolith. When creating the dedicated log above, however, we will consider using a format compatible with what counter-processor needs so we don't necessarily close the door to ever using that software. It seems useful and we're glad it's around.

Other items:

Views will only be counted after the Dataverse release with Make Data Count support has been deployed. If we someday want to try to mine Google Analytics or Piwik data for historical views, it will be a separate effort. Out of scope.
We should talk to @pameyer @akio-sone and @jonc1438 about how we don't expect to be able to count downloads from rsync or TRSA. We're only going to be able to count downloads if they pass through Glassfish. Out of scope.

Oh, I found a typo in the COUNTER Code of Practice for Research Data an emailed @mfenner about it. I would have made a pull request but it isn't on GitHub. 😄 I'm on page 8. Still reading.

pdurbin on 27 Nov 2018

I just had a nice chat with @pameyer after standup and I'll make it clear that downloads via rsync will not be reported. He reminded me that in his installation of Dataverse, the download count (part of the "metrics" block) is not present. (This is tied to the :DownloadMethods database setting.)

Pete also reminded me that direct access to data (bypassing Glassfish) is also available in Swift installations. I believe that in the screenshot below, the download count only reflects downloads via Glassfish, not via Swift.

screen shot 2018-11-28 at 12 10 29 pm

In short, I'll update the docs in my branch to indicate the downloads won't be counted for rsync or direct access via Swift.

pdurbin on 28 Nov 2018

What I understand from this is that Dataverse won’t be able to track download activity for files accessed via rsync or other methods that bypass Glassfish. This is fine for Pete/SBGrid, which isn’t concerned with tracking anyway, and doesn’t plan to display metrics. When datasets have files that can be accessed via rsync and http, only access via http (or via methods that go through Glassfish?) will be counted.

If so, this implies that we need to be careful about displaying metrics in scenarios where not all access activity is counted equally. The representation of the activity will need to communicate clearly which activity is being counted. This adds UI complexity - more to explain/display, and implies the system needs to track activity in a way that is granular enough that it can be displayed clearly. I don’t know if putting views and downloads together as @phildurbin describes will enable the capture of the information needed.

On Nov 28, 2018, at 12:18 PM, Philip Durbin notifications@github.com wrote:

I just had a nice chat with @pameyer after standup and I'll make it clear that downloads via rsync will not be reported. He reminded me that in his installation of Dataverse, the download count (part of the "metrics" block) is not present. (This is tied to the :DownloadMethods database setting.)

Pete also reminded me that direct access to data (bypassing Glassfish) is also available in Swift installations. I believe that in the screenshot below, the download count only reflects downloads via Glassfish, not via Swift.

In short, I'll update the docs in my branch to indicate the downloads won't be counted for rsync or direct access via Swift.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

TaniaSchlatter on 29 Nov 2018

I don’t know if putting views and downloads together as @pdurbin describes will enable the capture of the information needed.

@TaniaSchlatter sorry, I wasn't being clear. I'm really only talking about where to record each view or download in the database or one the filesystem. The decision was to use the filesystem in a dedicated log. From this log we will be aggregating views and downloads into reports (JSON format) that we send to the DataCite hub. The dedicated log will be rotated, deleted after a year or whatever, so they don't take up too much space on disk. What I'm trying to say is that you shouldn't have to worry to much about what goes in this log. I hope this makes sense. I'll swing by to make sure we're on the same page.

You're absolutely correct that we'll only be able to track downloads that are initiated through Dataverse/Glassfish. We won't be tracking rsync downloads. We won't be tracking direct downloads from Swift ("Cloud Storage Access" in the screenshot above). I talked to @jonc1438 in #5213 and it sound like TRSA downloads will be initiated through Dataverse so we should be able to track them once the TRSA pull request gets merged.

pdurbin on 29 Nov 2018

@pdurbin Any traction on the idea of an API for remote downloads to record downloads back to Dataverse?

kcondon on 29 Nov 2018

@kcondon I chatted with @pameyer about it and it doesn't sounds like he's personally interested in implementing it for his rsync server but we talked about how customer number two of all the rsync stuff might want to work on this. I wasn't planning on adding the API to the Dataverse side until we have someone who's interested in using it. It would basically involved parsing rsync logs, from what I understand. I guess the same would be true of Swift logs? I don't know.

pdurbin on 29 Nov 2018

After discussion with @mheppler and @jggautier and the comment from @mbjones above, I pushed 17cbf37ee to clarify that we plan to send citations to DataCite as part of the Make Data Count effort. It has been emphasized that citations are the most important thing and since we can express these in Dataverse under our "Related Dataset" field and DataCite is ready to receive them, we should try to send them. While were in this part of the code we should also endeavor to send citations for publications as well (details in #2917 and #2778).

pdurbin on 29 Nov 2018

What we think of as citations seems to be one of the three relationship types that the Event Data service is collecting, called "linking events". The other two relationship types are versioning (this dataset I'm depositing IsNewVersionOf/IsPreviousVersionOf another dataset) and granularity (this file is part of this dataset, which Dataverse already sends to DataCite).

Will the "citation counts" that Dataverse receives include all three types of relationships or only certain types? Can Dataverse determine which types of counts it displays? For example, when Dataverse reports "citation counts," it shouldn't include the number of links between a dataset and its files. Could Dataverse exclude that? (I tried seeing what Dash and DataONE do, but haven't found a dataset with a citation count, yet.)

Update: It looks like you can filter certain relationTypes (https://support.datacite.org/v1.1/docs/eventdata-query-api-guide#section-filtering-events-links-by-type), and it recommends certain types to exclude from a citation count.

jggautier on 1 Dec 2018

At standup yesterday I indicated that I've finished reading the "COUNTER Code of Practice for Research Data" which brings more questions to my mind. Below I'll put the latest to do list. At standup I talked through open items from the original to do list above at https://github.com/IQSS/dataverse/issues/4821#issuecomment-440327789 and I just updated that comment to indicate the latest status. I'm going to duplicate any open items in the list above so that we can have a single new list in this comment. Here goes.

Questions for Make Data Count:

[x] Does Dataverse really need to become a harvesting server for reports in SUSHI format? https://www.projectcounter.org/code-of-practice-rd-sections/5-delivery-reports/ says, "Reports MUST be available for harvesting via the SUSHI protocol within 1 month of the end of the reporting period."
[x] Does Dataverse really need to become a harvesting server for reports in TSV format? https://www.projectcounter.org/code-of-practice-rd-sections/5-delivery-reports/ says, "Tabular reports MUST be made available through a website."
[x] Why does the CoP refer to SUSHI (JSON) and TSV formats but the "getting started" guide links to DataONE examples in XML? I'm talking about https://releases.dataone.org/online/api-documentation-v2.0.1/design/UsageStatistics.html#statistics-service-usage and "formatType=DATA" vs "formatType=METADATA", for example.
[x] I've emailed Martin about two typos in the CoP and they've been fixed (thanks!) but what's the process for giving more extensive feedback on the CoP?
[x] How do you plan to measure non-HTTP downloads such as via rsync?
[x] What's the likelihood that there will be audits in the future? Is this something we should warn Dataverse installations about?
[x] Should Table 3.2 say "Master Dataset Report" for report name for consistency with Table 3.1?
[x] Should each Dataverse installation report as their "brand" such as "Harvard Dataverse" or "Scholars Portal" under "Created_By" in Table 3.2?
[x] In Table 3.3 I'm wondering if a "Data Repository" becomes just a "Repository" if it starts hosting software along side data (#2739).
[x] Is there a published JSON Schema for SUSHI?
[x] Do I need to read the whole SUSHI spec? (I already printed it out.)
[x] Table 4.2 talks about exception reporting. Is it possible to report multiple exceptions?
[x] In Table 4.3 does "M" stand for "mandatory" and does "O" stand for "optional"?
[x] Why does 6.1 talk about the processing of standard logs but counter-processor does not operate on standard logs? Please see https://github.com/CDLUC3/counter-processor/issues/3
[x] In 7.4 is it accurate to say that what the spec calls a "database" we would call "a Dataverse installation"?
[x] What about DOS attacks? Section 8.5 about limits says no limits.

Questions for Dataverse tech hours (or sooner):

[x] Should our database store metrics (views, downloads, citations) at the granularity of months for retrieval via API and to show in the GUI? DataOne shows a time series plot for each metric by month. DASH and Zenodo just show a total number. Zenodo shows different totals per version. Examples from each:
- DataOne: https://search.dataone.org/view/doi:10.5063/F1Z899CZ
- DASH: https://dash.ucmerced.edu/stash/dataset/doi:10.6071/M3RP49
- Zenodo: https://zenodo.org/record/1453307#.XAWHAidOnUI
[ ] What do others think of DataONE's "Views = Investigations - Requests" interpretation of the spec? Let's make sure we're super clear in our docs on our definitions of views, downloads, investigations, and requests.
[x] Are we interested in looking at DataONE's implementation? "our component that takes our raw usage logs from apache and other sources and processes them to insert usage events into our ElasticSearch index, which we then use to send stats to DataCite."

To do:

[ ] Read "DataONE Metrics Service" linked from https://makedatacount.org/roadmap/ as "Implementation guide by DataONE".
[ ] Read all three documents mentioned as "See examples of logging and log processing infrastructure at a large scale, check out documents from DataONE (Log Aggregation Overview, Event Logging and Reporting, DataONE Usage Statistics)" at https://github.com/CDLUC3/Make-Data-Count/blob/master/getting-started.md
[ ] In Datavese guides, consider using similar language about Make Data Count (and COUNTER) that Zenodo uses at https://help.zenodo.org/#statistics . See also their blog post: http://blog.zenodo.org/2018/07/18/2018-07-18-usage-statistics/

pdurbin on 4 Dec 2018

@pdurbin happy to help answer these questions, but would it make sense to break this done into several issues? Not one issue per item, but the list is so long that it might become difficult to track the responses.

mfenner on 5 Dec 2018

@mfenner hi! If @djbrooke hasn't emailed you and @dlowenberg already, he plans to do so soon. I'm sorry that I didn't have a lot of questions back during our meeting on 2018-10-18 (notes) but back then I hadn't watched two webinars, hadn't read the CoP, hadn't read the "getting started" guide. (I still haven't read the SUSHI spec and I suspect that I should.) Now that I'm feeling more up to speed, I think the next meeting will be more productive. You've seen that I now have a list of questions above. 😄 If it's easier for you and I to have a quick separate call, that's fine too. Please let me know what makes the most sense to you. Thanks!

pdurbin on 5 Dec 2018

A call makes a lof sense and can happen soon (schedule via email). Be aware that we don't have the answers to all your questions.

mfenner on 5 Dec 2018

Does Dataverse really need to become a harvesting server for reports in SUSHI format?

In the ideal world yes, but the MDC pilot partners are also not doing this yet. So nothing to worry about right now, but keep this in the back of your head.

mfenner on 5 Dec 2018

Does Dataverse really need to become a harvesting server for reports in TSV format?

Again, this can happen at some point in the future. DataCite will do a CSV conversion of the reports sent to us in JSON format.

mfenner on 5 Dec 2018

Why does the CoP refer to SUSHI (JSON) and TSV formats but the "getting started" guide links to DataONE examples in XML?

SUSHI reporting is in JSON and/or TSV. The XML is specific to DataONE, they can explain the reasoning behind it.

mfenner on 5 Dec 2018

I've emailed Martin about two typos in the CoP and they've been fixed (thanks!) but what's the process for giving more extensive feedback on the CoP?

We haven't sorted out the formal process since COUNTER officially took over maintenance of the Code of Practice a few months ago. It is a good question, we will get back to you about this.

mfenner on 5 Dec 2018

How do you plan to measure non-HTTP downloads such as via rsync?

The Code of Practice is really agnostic about the protocol, as long as you have log entries with timestamp and useragent information.

mfenner on 5 Dec 2018

What's the likelihood that there will be audits in the future? Is this something we should warn Dataverse installations about?

This is something that is central to COUNTER for journal articles, but nothing is planned yet for dataset usage stats. My guess is that we will not have that discussion until there is more uptake of the Code of Practice, and until we have figured out a way to do audits that are not too resource-intensive.

mfenner on 5 Dec 2018

@mfenner thanks for all the answers above! Sorry, but we have even more questions that we'd like to go over with you and @dlowenberg during the call that starts in 10 minutes. Here's the updated list:

https://docs.google.com/document/d/1MlJqQmPMUJyJn_fGMzmL2WjvcJQeu7146FfqFkzEAlg/edit?usp=sharing

pdurbin on 7 Dec 2018

We just had a meeting after the meeting and here are the notes: https://docs.google.com/document/d/16zURrRqNVdMQ3hQHc3MrcxNq7dRDQC3lSNWMX8t28WM/edit?usp=sharing

pdurbin on 7 Dec 2018

I have generated two stories to move us forward on the work supporting Make Data Count:

https://github.com/IQSS/dataverse/issues/5384 - Generate verbose data usage logs for processing and communication with Make Data Count
https://github.com/IQSS/dataverse/issues/5385 - Spike: Test Counter Processor for use by Dataverse in implementation of Make Data Count

Furthermore, there are some additional bits of investigation that can be done to support this work:

Get Make Data Count test accounts
Start trying to download information from DataCite / Make Data Count

matthew-a-dunlap on 7 Dec 2018

The architecture drawing on our whiteboard is so messy I decided to make a diagram of our current direction:

make-data-count

ac5e29d7b is the initial commit of this diagram but I'm sure we'll iterate on it.

Also, sent an email to DataCite report with a subject of "Make Data Count, SUSHI, dashboard, JSON Web tokens" to ask for the JSON Web Token we need to start sending SUSHI JSON to the test DataCite hub.

pdurbin on 21 Dec 2018

👍1

For what its worth, I tried switching over our publisher from "client-id" to "dataverse" when creating raw logs for processing by counter processor. The records were processed fine by counter-processor but rejected by Make Data Count. @mfenner Do you have any guidance on what we should put for our records?

Dataverse:
... "dataverse", "publisher-id": [{"type": "", "value": ""}], ...
error:
422 {'Date': 'Fri, 18 Jan 2019 22:03:20 GMT', 'Content-Type': 'application/json; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Status': '422 Unprocessable Entity', 'Cache-Control': 'no-cache', 'Vary': 'Accept-Encoding, Origin', 'Content-Encoding': 'gzip', 'X-Runtime': '0.045917', 'X-Credential-Username': 'datacite.harvard', 'X-Request-Id': 'd8e44670-f640-43ef-bb3b-8f6d7de23658', 'X-Powered-By': 'Phusion Passenger 6.0.0', 'Server': 'nginx/1.15.7 + Phusion Passenger 6.0.0'} application/json; charset=utf-8 {"errors":[[{"#/report-datasets/0/publisher-id/0/type":"The property '#/report-datasets/0/publisher-id/0/type' value \"\" did not match one of the following values: isni, orcid, grid, urn, client-id in schema 7757177d-ae02-5888-8cdf-d748b3fb8616#"}]]}

client-id
... "publisher": "client-id", "publisher-id": [{"type": "", "value": ""}], ...

matthew-a-dunlap on 18 Jan 2019

Hot off the PIDapalooza presses, I had a good conversation with @kjgarza from DataCite answering a few questions outstanding about our MDC integration.

publisher / publisher-id :
The client-id option for publisher is the datacite client id, which each installation should have (tho I'm not sure how we store this in dataverse). We should be able to pass the same publisher info for each dataset. That being said, this information is not actually used at this point (MDC gets the info out of its own system instead of trusting ours) so we could spoof this as well.

Updating sushi logs during the month
Currently MDC needs the information for all days that have passed in the month on each submission, even if you are submitting daily. In other words, on Jan 9th we have to pass info from Jan 1-9, and then on Jan 10th from Jan 1-10 . This is how counter-processor supports the calls.
The issues with MDC not updating the the log entry each day seems to be due to the how we are hacking counter-processor for testing. The first submission to MDC per month should be a POST, and each subsequent call should be a PUT. When we wipe out the state of Counter Processor to test it (likely) sees the log as a new one and does a POST which MDC takes but does not actually use to update. The sashimi readme has more info: https://github.com/datacite/sashimi/blob/master/README.md

Log Processing info
There is another available processor for raw logs, written by members of the DataCite team https://github.com/datacite/shiba-inu . It looks like we could pipe our logs into this system as well. I think Counter Processor is a better choice for our production flavor and requires less infrastructure.

matthew-a-dunlap on 24 Jan 2019

I just pushed 916bd87a7 to stub out a new "Dataset Metrics" heading in the User Guide (and various other reorg): http://guides.dataverse.org/en/4821-make-data-count/user/dataset-management.html#dataset-metrics

Here's a screenshot:

screen shot 2019-01-28 at 1 32 50 pm

I showed a draft to @dlmurphy and he and others are welcome to make improvements.

pdurbin on 28 Jan 2019

Latest todo list after some whiteboarding with @matthew-a-dunlap and @sekmiller

[x] ~~Test and fix publisher logging in Dataverse~~ (will not fix, unneeded)
[x] Confirm Handles appearing in sushi (required) and hub (optional)
- [ ] Convert all Handles to DOI in production. https://github.com/IQSS/dataverse/issues/4832
- [x] Write documentation for other installations around what functionality they'll get depending on their PID.
[x] ~~Fork Counter Processor~~ Code we wanted is part of main branch now
- [x] Size logging (just uncommenting code)
- [x] ~~Not having to create files for logging mid-month~~ Will not fix, should document for testing
[x] Investigate the PUT/POST request being sent by Counter Processor to hub
- [x] Can we delete a record?
  - Yes (e.x. curl -H "Authorization: Bearer $JSON_WEB_TOKEN" -X DELETE https://api.test.datacite.org/reports/0e6be407-411f-4764-acdd-53a9dffa4ff5)
    - [x] Document this
  - This can be used along with clearing out counter processor state to allow repeated tests of the same data
[x] Look at other log processor created by DataCite (Counter Processor alternative)
[x] Test logging
[x] Investigate MDC api so we can test better. Think about what will be needed by our partners as well.
[ ] hook up citations API, update view/downloads API to not require country
[ ] size db column, etc. (depends on https://github.com/CDLUC3/counter-processor/pull/5 or our fork)
[x] machine vs human db column, etc.
[ ] finalize API Guide (looks good to @matthew-a-dunlap )
[ ] finalize Admin/Installation Guide (looks good to @matthew-a-dunlap )
[ ] finalize User Guide (looks good to @matthew-a-dunlap )
[x] "Related Dataset" out of scope for citations, right? (i.e. citation counts won't include any "related dataset" metadata in datasets published in Dataverse repositories)

pdurbin on 28 Jan 2019

Note: this issue is blocked by https://github.com/IQSS/dataverse/issues/4832 or whatever new issue we find to capture the need to convert all PIDs in Harvard Dataverse to DOIs.

matthew-a-dunlap on 5 Feb 2019

I just moved pull request #5329 to code review. Kudos to @sekmiller and @matthew-a-dunlap for all the great work on it!

Questions for reviewers to ponder:

Are you happy with the names of the API endpoints and how they are "branded" with "makeDataCount"?
Are you happy with process for populating datasetsmetrics table, a cron job that calls main.py and then a curl to the new Dataverse API to parse the SUSHI file?
Are you happy with process for populating datasetexternalcitations table, a cron job that requires you to interate over each published dataset, calling curl for each one?
When building the GUI, do you think the new API endpoints are suitable to call into as-is or will we want different endpoints?
Should the API output for a given metric show the metric requested ("viewsTotalRegular":5)? Or should it just return the number (5)?

Documentation to review:

pdurbin on 12 Feb 2019

I noticed something on http://guides.dataverse.org/en/4821-make-data-count/developers/make-data-count.html that looks like someone meant to go back later and add more detail

Under "Testing Make Data Count and Dataverse":

"The first thing to fix is to clear two files from Counter Processor ..."

Was the idea to mention which two files?

dlmurphy on 14 Feb 2019

👍1

Issues found:
[x] 1. Missing single quote at end of command:
http://guides.dataverse.org/en/4821-make-data-count/admin/make-data-count.html

curl -X POST 'http://localhost:8080/api/admin/makeDataCount/:persistentId/addUsageMetricsFromSushiReport?reportOnDisk=/tmp/sushi_sample_logs.json

[x] 2, Should use actual report file name in example above,

curl -X POST 'http://localhost:8080/api/admin/makeDataCount/:persistentId/addUsageMetricsFromSushiReport?reportOnDisk=/tmp/make-data-count-report.json'

[x] 3. Clean up Counter Processor installation instructions to suggest installation directory to agree with admin/dev guide suggestion, and potentially correct geoip db location instructions, since both reference counter-processor-0.0.1 sub dir:
http://guides.dataverse.org/en/4821-make-data-count/installation/prerequisites.html
Change to the Counter Processor directory.
cd /home/counter/counter-processor-0.0.1

[x] 4. Following prereq instructions, pip3 was not installed, needed to be installed separately
http://guides.dataverse.org/en/4821-make-data-count/installation/prerequisites.html
Decision was this would work for many and an admin would figure it out if not.

[x] 5. Dataverse API to extract citation is not working, per Phil.

[x] 6. In some cases, ip addresses (aws) that cannot be resolved to a country by counter processor and by this site: https://www.ip2location.com/demo result in a blank country code in cp report. It appears this can happen when requests are made from the same machine such as on an aws box that has a private, non routable ip address, not the same as loopback 127.0.0.1.
Decision was we can ignore this.

[x] 7. Machine access stats are not imported into db from json report, when no country code present in cp json report.
Decision was we can ignore this.

[x] 8. Multiple views or downloads in a short time either via browser or curl get counted only 1ce, though 3 were performed. This includes total and unique.
This is due to the 30second double-click detection threshold. All were considered one click.

[x] 9. File downloads are counted both as dataset view and file downloads.
This is by design, as described in the spec:
The dataset (a collection of data published or curated by a single agent) is the content item for which we report usage in terms of investigations (i.e. how many times metadata are accessed) and requests (i.e. how many times data are retrieved, a subset of all investigations).

[x] 10. Multiple different file downloads from the same dataset results in correct total but a single unique download. This is because a file download, even if a different file, are all considered coming from the same dataset with respect to uniqueness.
Discussed with Matthew, as designed.

[x] 11. Multiple different file downloads from the same dataset results in a single file download uri in the json report file. It appears to be grabbing the first url and is not meaningful to us we can ignore it.

[x] 12. Downloading exported metadata from the ui (metadata tab on dataset page) is not logged as an event.

[x] 13. Fetching dataset export metadata via api does not log as an event, using any download metadata api.

[x] 14. Datacite accepts counts that have empty country-counts, dv db does not.
We've decided to record events in dv that do not have an identifiable country to be consistent with Datacite.

[x] 15. Download multiple files by checkbox when processed by cp throws error and does not complete. I'm told this path uses a different api:
processing sample_logs/counter_2019-02-27.log
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/peewee.py", line 2484, in execute_sql
cursor.execute(sql, params or ())
sqlite3.IntegrityError: metadataitem.identifier may not be NULL

[x] 16. Export metadata, either via UI or API logs events but does not result in counts in cp json report.

[x] 17. Export UI metadata, formats OAI_ORE and schema.org JASON-LD are not logged.

[x] 18. Download Dataset Metadata as Json native api endpoint is not logged.

[x] 19. Native API List files in a dataset endpoint does not log event.

Cannot upload json to hub, get expected to upload, but got code 500
@pdurbin @matthew-a-dunlap suggested you might help on this one?
Tried again today, still 500 on connect to datacite using secrets.yaml. Tried command line and env variable method too but fails silently. Here is cp trace:

Writing JSON report to /tmp/make-data-count-report.json
expected to upload, but got code 500
expected to upload, but got code 500
expected to upload, but got code 500
expected to upload, but got code 500
expected to upload, but got code 500
expected to upload, but got code 500
expected to upload, but got code 500
expected to upload, but got code 500
^CTraceback (most recent call last):
  File "main.py", line 45, in <module>
    upload.send_to_datacite()
  File "/home/counter/counter-processor-0.0.1/upload/upload.py", line 50, in send_to_datacite
    response = retry_if_500(method='post', url=my_url, data=data, headers=headers)
  File "/home/counter/counter-processor-0.0.1/upload/upload.py", line 33, in retry_if_500
    time.sleep(1)
KeyboardInterrupt

[x] 21. Calling export api using localhost logs twice, both as regular and as machine. If use dns name, logs correctly as only machine. This appears to be a bug in Counter Processor. Note: I did not see this behavior when calling download dataset metadata as json with localhost. Update: Cannot reproduce. Appears to work correctly.

[x] 22. Need cron jobs for operational config, with notification on error and steps to fix. This may be out of scope for this issue and need a separate ticket.
Update, opened as a separate issue: https://github.com/IQSS/dataverse.harvard.edu/issues/3

Left to test:
[x] -Check whether unpublished access, or draft access is counted
No on both
[x] -Check whether blank country sent to datacite fails entire report
yes, works and is country/present in report from datacite.
[x] -Check whether blank country counts toward total count in dv db metrics
no, they are not, tested alone and mixed with country entries
[x] -Test hdls with post to hub off
[x] -Load test data flow with lots of data
[x] -Check metrics api options, eg. country, date, other
[x] -Retest: ui vs api, view vs. download, regular vs machine, for dataset, metadata, file, multi file, export. Also account for double click (30sec), unique user (1hr), country/no country, localhost/private ip

kcondon on 25 Feb 2019

@kcondon I made some of the doc improvements we talked about in 44acd971d

I'm still not quite sure where the json SUSHI report should be saved so I didn't change anything there. Also, as we discussed the log files are written by glassfish but need to be read by counter. I'm open to suggestions about which directories to use. Maybe we can chat a bit more about it.

pdurbin on 25 Feb 2019

@kcondon ok in 51cfde87e I tried to reconcile the config with the guides so they match.

pdurbin on 25 Feb 2019

@djbrooke you asked me to leave a couple code comments of decisions made during tech hours and I just did in 1b527aa09 . Can you please take a look?

pdurbin on 26 Feb 2019

Looks good, thanks. @kcondon was re-verifying a few things and doing some further testing so I'm moving this back to QA. Thanks all for the discussion at tech hours.

djbrooke on 26 Feb 2019

As discussed with @kcondon @sekmiller and @djbrooke we plan to revert 1b527aa and start views and downloads even when Counter Processor can determine a country based on the IP address (127.0.0.1, 192.168.0.1, 172.16.0.1, 10.0.0.1, etc.). Primarily we decided this because the DataCite hub accepts reports without countries and we don't want the metrics we store in Dataverse to be out of sync with the DataCite hub.

pdurbin on 27 Feb 2019

We are waiting on a new api token to complete testing of this story.

matthew-a-dunlap on 11 Mar 2019

Turns it out was a dev box issue, not an api token (https://github.com/datacite/sashimi/issues/56) . I was able to submit to the test box with our current api token, so this is unblocked.

matthew-a-dunlap on 15 Mar 2019

Dataverse: Implement Backend Support for Make Data Count use and citation metrics

All 51 comments

Does Dataverse really need to become a harvesting server for reports in SUSHI format?

Does Dataverse really need to become a harvesting server for reports in TSV format?

Why does the CoP refer to SUSHI (JSON) and TSV formats but the "getting started" guide links to DataONE examples in XML?

I've emailed Martin about two typos in the CoP and they've been fixed (thanks!) but what's the process for giving more extensive feedback on the CoP?

How do you plan to measure non-HTTP downloads such as via rsync?

What's the likelihood that there will be audits in the future? Is this something we should warn Dataverse installations about?

Related issues