For some, but not all, Dataverse repositories running Dataverse 4.9.1-4.9.2, the count of all file downloads we get from the new metrics API doesn't match the download count of all files in an installation that's displayed in the metrics bar on the homepage:
vs. {"status":"OK","data":{"count":11120}}
Dataverses running 4.9.1-4.9.2 with different counts
UAL Dataverse
Qualitative Data Repository
Dataverses running 4.9.1-4.9.2 with matching acounts
I thought the difference might be caused by the query I think the API uses (or some version of it) and the number of entries in the database's guestbookresponse table that don't have timestamps (responsetimes). But is that likely for newer Dataverse installations like QDR's that probably don't have entries in its guestbookresponse table with no timestamps?
Here's the query:
_select to_char(date_trunc('month', guestbookresponse.responsetime), 'Mon YYYY') as months, count(guestbookresponse.id) AS new_datasets,
sum(count(guestbookresponse.id)) over (order by date_trunc('month', guestbookresponse.responsetime)) as cumulative
from guestbookresponse
join dvobject on dvobject.id = guestbookresponse.datafile_id
where dvobject.publicationdate is not null
and guestbookresponse.responsetime is not null
group by date_trunc('month', responsetime)
order by date_trunc('month', responsetime) desc
limit 12;_
For Harvard Dataverse, if you remove and guestbookresponse.responsetime is not null, the cumulative total includes the entries with no (null) responsetimes, which is closer to the count shown in the front page's metrics bar.
@jggautier For QDR, it looks like the GUI counts entries where dvobject.publicationdate is null. Removing that clause from the query above makes the counts match.
Thanks @qqmyers. Does that mean that for QDR, downloading files that are in unpublished datasets increments the download count in the UI? That would be a bug :(
@jggautier - maybe one that's fixed though - #4637 . I suspect we have some legacy counts from before that was fixed.
Oh right! Can we remove from the guestbookresponse table the download counts created by the bug @qqmyers described in #4637?
It looks like doing that will make QDR's UI and API counts match. The bug accounts for a small part of the mismatch for Harvard Dataverse.
So I think we're aware of three issues with the counts, and if possible I think we should:
where clauses)._where dvobject.publicationdate > guestbookresponse.responsetime and dvobject.publicationdate is not null_
(the API query ignores counts from currently unpublished datasets, but the UI is counting those)
The other situation @qqmyers described in #4637 is double counting. If there was a way to find entries in the guestbookresponse table where the same file was downloaded by the same person within the same second, removing those could make the count more accurate (it might correct more than just the counts created by the bug). How do sessionids work? I dug into the sessionid column a little to see if we could use it to represent one user, but it looks like sometimes sessionids are assigned to more than one user. Or maybe we could use the authenticateduserids with each entry. Since the double counting described in #4637 happened on draft datasets, the user would've been logged in, so the entries created by that bug would have authenticateduserids.
But even if nothing can be done about the double counting, it seems that removing counts from unpublished datasets will make QDR's UI and API counts match (and hopefully other installations').
On Harvard Dataverse, most of the difference between the counts from the UI and the API are from guestbookresponse entries with no timestamps (#3324). The UI is counting those, and the API is not. I think the counts without timestamps should be counted by the API.
@pdurbin suggested adding a timestamp to the guestbookresponse table. It could be an obviously fake timestamp.
Also in #3324 @landreev suggested:
when we generate access reports/otherwise display this data, we can think of presenting it in some sensible way: like, instead of listing all these downloads with no recorded times, we should probably just say "plus N downloads were recorded before [earliest download date recorded]; no further information is available about those prehistoric downloads, sorry for the inconvenience."
If we add fake timestamps to entries in the guestbookresponse table that have no timestamps, then when all of the counts are displayed by month, in addition to each month there would a group with that fake timestamp that we could describe as @landreev suggested.
Or if we leave them null, we could remove the clause guestbookresponse.responsetime is not null from the query that the API uses. And when all of the counts are displayed by month, there would be a null group that we could describe as @landreev suggested.
(When the total counts are displayed, I would think we don't have to explain that the dates of some counts are unknown.)
Here's the query I've used to find file downloads (and any other downloadtypes e.g. "explores") that are counted two or more times within the same second:
select email, name, authenticateduser_id, responsetime, sessionid, datafile_id, downloadtype, guestbook_id
from guestbookresponse g1
where authenticateduser_id is not null --assuming we would look for only "downloads" from logged in users
and exists (
select 1
from guestbookresponse g2
where
--find "downloads" that occurred at the same time
g2.responsetime = g1.responsetime
--OR find "downloads" within one second
--g2.responsetime > g1.responsetime - interval '1 sec'
--and g2.responsetime < g1.responsetime + interval '1 sec'
--from the same logged in user
and g2.authenticateduser_id = g1.authenticateduser_id
--for the same file
and g2.datafile_id = g1.datafile_id
--where the guestbookresponse entries aren't the same
and g2.id <> g1.id
)
--to show the results are what we want
order by sessionid, datafile_id, responsetime;
Two other issues that may be at play:
Looking at the initial issues in this story, it seems that QDR's counts match now. The only other issue I was able to identify is that we are not counting undated historic download counts in the per month query. This fix will count those undated results if the month queried is on/after the oldest dated record.
Alongside this, all the metrics table entries for downloads should be cleared so they are requeried with the updated values.
@matthew-a-dunlap assigning to you and pulling back to team dev based on our discussion
Numbers match on copy of prod db. Closing but will merge when custom home page passes.