Dataverse: Metrics APIs: Dataverses, Datasets, Files

Created on 20 Mar 2018  路  28Comments  路  Source: IQSS/dataverse

Following the discussion in #4478, we decided to add some new metrics endpoints to Dataverse based on the model that's current provided by Miniverse:

https://services.dataverse.harvard.edu/static/swagger-ui/index.html?url=/miniverse/metrics/v1/swagger.yaml

Let's estimate this as a group (all three endpoints) and if the estimate comes back high we can discuss breaking it up.

Metrics + Reports

All 28 comments

Today I stubbed out some code at fa8b601

In order to make this a little more real, I created a repo at https://github.com/IQSS/metrics.dataverse.org to start switching the front end code from pulling data from miniverse to pulling data from an installation of Dataverse. For now that installation of Dataverse is my server at https://dev1.dataverse.org running c62e7bb.

I started with what I suspect is the easiest endpoint to migrate, which is called "Dataverses by Category". Below you can see https://dataverse.org/metrics on the left and my implementation from the new repo above on the right. As I migrate the other five metrics over, they'll appear in the version on the right:

screen shot 2018-04-20 at 2 33 49 pm

I guess I'll add a todo list. There are 6 metrics to implement:

  • [ ] Dataverses Added Over Time. Started in 4e3c958
  • [x] Dataverses by Category. Done in fa8b601
  • [ ] Datasets Added Over Time. Started in fdd609b
  • [ ] Datasets by Subject. Started in 40ae4b1
  • [ ] Files Added Over Time
  • [ ] File Downloads Over Time. Started in 9352882 but needs work.

From the perspective of Swagger, these are the 6 from https://services.dataverse.harvard.edu/static/swagger-ui/index.html?url=/miniverse/metrics/v1/swagger.yaml

screen shot 2018-04-20 at 3 02 47 pm

Discussed with @pdurbin (and got a demo!). Adding the six endpoints above to the Dataverse application is the correct scope of this issue. There are some other endpoints that are interesting (bytes used, for example), but we can develop iteratively. We'll work on an updated aggregator/reporting application in later issue, but we want to get these initial endpoints into a release sooner instead of later.

I just chatted with @dlmurphy and it sounds like he and @jggautier may have already written some of the remaining queries we need as part of #4169. I can, of course, also refer to the miniverse code and maybe even run miniverse on my laptop to have it show me the raw SQL queries. The miniverse code seems to use the Django ORM, which is an abstraction.

I just made pull request #4612 and moved this issue into code review at https://waffle.io/IQSS/dataverse

I'm not sure that the definition of done for the issue is especially clear but we'll see.

We've figured out that our query for total number of uploaded files is inaccurate because it's including deleted files in the count. Pulling this back into develop while we figure out a better query.

Here's how the JSON output looks as of 256d388 for each of the six endpoints, which I'm trying to keep compatible with the d3 and d3plus code currently in used on the metrics page. I'm also preserving inconsistencies (variations in display_name, for example) that probably may not be readily apparent if you don't have the UI of the metrics page in front of you. Again, that's why I went ahead and created https://github.com/IQSS/metrics.dataverse.org because the front end and the back end need to work together and it probably makes sense to hook them together before merging the back end code to make sure the front end is going to look the way we want.

The JSON has what I consider to be some oddities. For example, keys of JSON objects have a wide variety of formats...

  • camelCase (monthNum)
  • snake_case (month_sort)
  • Title Case with spaces (Number of Dataverses)

... and I wonder if it's possible to reduce the number of formats which will have an impact on the front end code.

Again, if we make changes in this area, we need to make sure that the front end and back end work together. I've already made some tweaks to the front end code in the other repo because all of the output from the Dataverse native API is wrapped in a JSON object called "data" while miniverse doesn't do this wrapping.

dataverses/byMonth

{
  "Month": "April 2018",
  "monthNum": 4,
  "name": "Total Dataverses",
  "Number of Dataverses": 3,
  "running_total": 2373,
  "month_sort": "2018-04",
  "display_name": "April 2018: 3 New Dataverses"
}

datasets/byMonth

{
  "Month": "November 2017",
  "monthNum": 11,
  "name": "Total Datasets",
  "Number of Datasets": 15,
  "running_total": 23972,
  "month_sort": "2017-11",
  "display_name": "November 2017: 15 new Datasets; Total of 23,972"
}

files/byMonth

{
  "Month": "November 2017",
  "monthNum": 11,
  "name": "Total Files Added",
  "Number of Files": 111,
  "running_total": 257262,
  "month_sort": "2017-11",
  "display_name": "November 2017: 111 added / total: 257,262"
}

downloads/byMonth

{
  "Month": "November 2017",
  "monthNum": 11,
  "name": "Total File Downloads",
  "Number of File Downloads": 3165,
  "running_total": 2144725,
  "month_sort": "2017-11",
  "display_name": "November 2017: 3,165 downloads / total: 2,144,725"
}

dataverses/byCategory

Note that in eaa3e18 a subsequent fix went in so Research Projects is returned rather than RESEARCH_PROJECTS.

{
  "dataverse count": 723,
  "name": "RESEARCH_PROJECTS (30.5%)",
  "percent_label": "30.5%"
}

datasets/bySubject

{
  "value": 22583,
  "weight": 0.7411065697669983,
  "totalDatasets": 30472,
  "type": "Social Sciences",
  "label": "Social Sciences (74.1%)"
}

@djbrooke and I talked out the JSON output this morning and we agreed that we should not be tied to the current d3/d3plus implementation that we use on the existing metrics page. Rather it should be more "just the facts, ma'am." To that end I've greatly simplified the JSON output and removed the oddities (spaces in key names, etc.) and inconsistencies.

Here's how the JSON looks as of 1963f20

dataverses/byMonth

{
  "yearMonth": "2018-04",
  "newDataverses": 3,
  "runningTotal": 2373
}

datasets/byMonth

{
  "yearMonth": "2017-11",
  "newDatasets": 15,
  "runningTotal": 23972
}

files/byMonth

{
  "yearMonth": "2017-11",
  "newFiles": 111,
  "runningTotal": 257262
}

downloads/byMonth

{
  "yearMonth": "2017-11",
  "downloads": 3165,
  "runningTotal": 2144725
}

dataverses/byCategory

{
  "category": "Research Project",
  "count": 723
}

datasets/bySubject

{
  "subject": "Social Sciences",
  "count": 22583
}

We should still decide if all the "byMonth" endpoints should continue to be limited to 12 months (the way miniverse works) or if it should show for all time.

We've been talking here and there about the need to cache the results of these database queries. They're slow enough that they could be used to pull off a denial of service attack if we don't do something to cache results. I was just telling @matthew-a-dunlap that I've been meaning to give JSR 107 (JCache - Java Temporary Caching API) a try after watching the video I linked to at #2100

@pameyer also mentioned Varnish but I don't think we want to introduce yet another installation dependency for all the installations out there.

@scolapasta and I talked briefly about caching results in the database. We plan to talk it out soon.

@pdurbin the discussion yesterday convinced me that Varnish wasn't a good fit here (aka - even if you were ok with adding another installation dependency).

@pameyer my understanding is that @openscholar gets a lot of mileage out of Varnish. I've talked to @rbran100 about it and I see it all over their repo on GitHub. It was a good thought (thanks!) but yeah, I'm not sure we want to force it on our users. Maybe we'll keep it in our back pocket.

We do get a lot of mileage out of Varnish but it is a good reverse proxy and not a good object store. JCache is for sure the way to go, but i think the reference implementation that comes with it will not be suitable for a production environment. You will end up having to choose an implementation https://jcp.org/aboutJava/communityprocess/implementations/jsr107/index.html and that may introduce yet another dependency. Of the options listed there I am most familiar with redisson but I am sure most are good.

I just discussed with @djbrooke and we're good to go forward with our new Total Number of Files query:

select count(filemetadata.datafile_id)
from filemetadata
join datasetversion on datasetversion.id = filemetadata.datasetversion_id
where concat(datasetversion.dataset_id,':', datasetversion.versionnumber + (.1 * datasetversion.minorversionnumber)) in
(
select concat(datasetversion.dataset_id,':', max(datasetversion.versionnumber + (.1 * datasetversion.minorversionnumber))) as max
from datasetversion
join dataset on dataset.id = datasetversion.dataset_id
where versionstate='RELEASED'
and dataset.harvestingclient_id is null
group by dataset_id
);

We tested this query in both Harvard Dataverse and Demo Dataverse. When querying Demo, we got a number matching the number in Demo's search UI. With HDV, however, the number from our query was about 1,000 lower than the number reported by the search UI.

Our hypothesis is that this discrepancy is actually caused by an indexing error in HDV. The query is returning the correct number; it's HDV's search UI that's wrong. There may be orphan records in HDV's index that are driving up the total, as previously seen in #4205.

It's not clear how we could make this query work retroactively to display results by month... It might not be possible. Instead we're thinking the way to go is to run this query once a month and store the result from each month.

This does mean that the first month that we switch over to this query will show a dramatic drop in the total number of files from the previous month (since the old query counted harvested files). But that's a hit we'll have to take; we'll explain that we've changed to more reliable and accurate ways of measuring our metrics.

@dlmurphy thanks, that query is so different than the others (not by month) that I added it as a 7th metric/query in d202c03

Let's keep talking about what to do for files. I don't like the idea that the app has to capture the metric on a certain day (what if it's down?) and can't simply look it the database to see how many files there were months ago but I guess it's due to our complex data model.

I have started looking into using an implementation of JCache to do caching for metrics (and maybe other things later). I do not have a working example yet but will keep on this tomorrow.

@djbrooke @scolapasta @dlmurphy and I met and decide to make many changes to the API, which I implemented in 786f226. Below is how the JSON output for the time-based endpoints looks now. Note that you pass in the month in YYYY-MM format and that the "runningTotal" is gone:

dataverses/byMonth/2018-03

{
  "count": 2370
}

datasets/byMonth/2018-03

{
  "count": 22722
}

files/byMonth/2018-03

{
  "count": 215966
}

downloads/byMonth/2018-03

{
  "count": 2158430
}

I added quite a few FIXMEs and TODOs in that commit but I at least deployed to a server with lots of data and the endpoints seem to be working as expected.

To do list:

As I mentioned at standup this issue is probably ready for code review for the bulk of the work. I implemented the design @scolapasta and @dlmurphy and I talked about last week. As explained in the checklist above, I still haven't added JCache to the branch. If someone wants to pick this up, I'd recommend reading through #2100 and taking a peek at 913242a. Here are some links I was looking at yesterday that may or may not be helpful:

I finally got back to this today. After digging into jCache and implementations via ehcache/HazelCache, as far as I can tell they require deployment of an external application to handle caching. Payara bundles HazelCache out of the box but for glassfish does not.

I tried digging deeper into HazelCache anyways because it has the most complete/current documentation. Reading over documentation here I tried following along with the GlassFish v3 section and deploying hazelcast-jca-rar to autodeploy but it spit out a file saying the application deployed in failure.

Stepping back I'm unsure if we want another external application dependency alongside Dataverse. There is a fair chance I'm misunderstanding the ways in which this caching can be set up, but I'm over my head in terms of GlassFish server knowledge.

Note: Usage of this code requires running of the 4.8.6 to 4.9 update script. There is also an admin setting MetricsCacheTimeoutMinutes for defining how long a cache will be held before allowing a rerun of the query.

@matthew-a-dunlap and I went over the code and here's my feedback:

  • [x] typo: "seperator"
  • [x] no need for CREATE TABLE in SQL update script
  • [x] Document new admin endpoints unless they're only there for testing.
  • [x] Remove "FIXME: Remove SQL injection vector" comment (see sanitizeYearMonthUserInput)
  • [x] Add more non-null constraints on JPA side (@Column( nullable = false ))
  • [x] Please format code with Netbeans: http://guides.dataverse.org/en/4.8.6/developers/coding-style.html#formatting-code

Questions: If I'm calling /clearMetricsCache/{name} how am I supposed to know what to provide for {name}? Do we plan to maintain a list in our documentation? How much value does this endpoint have over clearing all metrics (/clearMetricsCache)?

I have done the fixes @pdurbin pointed out for the caching portion of Metrics. I am moving this back into code review, someone other than the two of us should still take a look!

There were minor merge conflicts (methods added in the same place in Admin.java) so I just resolved them in 6cdad11

OK, generally works well but found a few items:

  • [x] Might want to mention the default cache timeout in the docs.
  • [x] Many of the query endpoints use concat() which is incompatible with Postgres 8.4
  • [ ] Datasets by subjects counts seem correct in small test env but way off on prod db (9k ui versus 22k api)
  • [x] datasetsBySubject and dataversesByCategory do not update their lastcalleddate value when accessed and do not seem to clear value if cache times out.

Caching issue should be fixed with latest commit

(sorry for the delay in this comment)

@kcondon - @matthew-a-dunlap volunteered to take on the remaining issues in standup. I left it assigned to you in case you wanted to test the fixes in parallel, but moving it back to development is OK too.

While fixing the issues found during QA, a few extra issues were found and resolved:

  • The datasetsBySubject query no longer has a date component. The new query originally had a check on the month.
  • The 'byMonth' queries were renamed to be 'toMonth', as this is more descriptive of their cumulative nature.

    • The metrics api documentation better describes this as well

All but one of the queries were touched during these changes, and they all should probably be re-tested.

Was this page helpful?
0 / 5 - 0 ratings