Dataverse: Count of all file downloads from metrics API doesn't always match what's in UI (on old homepage metrics bar)

Created on 16 Aug 2018 · 10Comments · Source: IQSS/dataverse

For some, but not all, Dataverse repositories running Dataverse 4.9.1-4.9.2, the count of all file downloads we get from the new metrics API doesn't match the download count of all files in an installation that's displayed in the metrics bar on the homepage:

sefsef vs. {"status":"OK","data":{"count":11120}}

Dataverses running 4.9.1-4.9.2 with different counts

Harvard Dataverse
- Homepage count: https://dataverse.harvard.edu
- API count: https://dataverse.harvard.edu/api/info/metrics/downloads/toMonth

UAL Dataverse
- Homepage count: https://dataverse.library.ualberta.ca
- API count: https://dataverse.library.ualberta.ca/api/info/metrics/downloads/toMonth
Qualitative Data Repository
- Homepage count: https://data.qdr.syr.edu
- API count: https://data.qdr.syr.edu/api/info/metrics/downloads/toMonth

Dataverses running 4.9.1-4.9.2 with matching acounts

UNB Libraries Dataverse
- Homepage count: https://dataverse.lib.unb.ca
- API count: https://dataverse.lib.unb.ca/api/info/metrics/downloads/toMonth

Metrics + Reports Bug

Source

jggautier

All 10 comments

I thought the difference might be caused by the query I think the API uses (or some version of it) and the number of entries in the database's guestbookresponse table that don't have timestamps (responsetimes). But is that likely for newer Dataverse installations like QDR's that probably don't have entries in its guestbookresponse table with no timestamps?

Here's the query:

_select to_char(date_trunc('month', guestbookresponse.responsetime), 'Mon YYYY') as months, count(guestbookresponse.id) AS new_datasets,
sum(count(guestbookresponse.id)) over (order by date_trunc('month', guestbookresponse.responsetime)) as cumulative
from guestbookresponse
join dvobject on dvobject.id = guestbookresponse.datafile_id
where dvobject.publicationdate is not null
and guestbookresponse.responsetime is not null
group by date_trunc('month', responsetime)
order by date_trunc('month', responsetime) desc
limit 12;_

For Harvard Dataverse, if you remove and guestbookresponse.responsetime is not null, the cumulative total includes the entries with no (null) responsetimes, which is closer to the count shown in the front page's metrics bar.

jggautier on 16 Aug 2018

@jggautier For QDR, it looks like the GUI counts entries where dvobject.publicationdate is null. Removing that clause from the query above makes the counts match.

qqmyers on 17 Aug 2018

Thanks @qqmyers. Does that mean that for QDR, downloading files that are in unpublished datasets increments the download count in the UI? That would be a bug :(

jggautier on 17 Aug 2018

@jggautier - maybe one that's fixed though - #4637 . I suspect we have some legacy counts from before that was fixed.

qqmyers on 17 Aug 2018

Oh right! Can we remove from the guestbookresponse table the download counts created by the bug @qqmyers described in #4637?

It looks like doing that will make QDR's UI and API counts match. The bug accounts for a small part of the mismatch for Harvard Dataverse.

So I think we're aware of three issues with the counts, and if possible I think we should:

remove counts created by the first two issues
make sure the query used for the UI and for the API count are similar (so that they handle the counts with no timestamps in the same way). I think the UI count is coming from a query that counts all rows in the guestbookresponse table. That query should be similar to the query the API uses (same where clauses).

Counts from unpublished datasets

4637 caused Dataverse to count downloads of files in datasets that weren't published. I think we can find those downloads in the guestbookresponse table entries by querying for (1) entries that were created for currently unpublished datasets and (2) entries created before the currently published dataset was published:

_where dvobject.publicationdate > guestbookresponse.responsetime and dvobject.publicationdate is not null_
(the API query ignores counts from currently unpublished datasets, but the UI is counting those)

Counts from double counting

The other situation @qqmyers described in #4637 is double counting. If there was a way to find entries in the guestbookresponse table where the same file was downloaded by the same person within the same second, removing those could make the count more accurate (it might correct more than just the counts created by the bug). How do sessionids work? I dug into the sessionid column a little to see if we could use it to represent one user, but it looks like sometimes sessionids are assigned to more than one user. Or maybe we could use the authenticateduserids with each entry. Since the double counting described in #4637 happened on draft datasets, the user would've been logged in, so the entries created by that bug would have authenticateduserids.

But even if nothing can be done about the double counting, it seems that removing counts from unpublished datasets will make QDR's UI and API counts match (and hopefully other installations').

Counts with no timestamps

On Harvard Dataverse, most of the difference between the counts from the UI and the API are from guestbookresponse entries with no timestamps (#3324). The UI is counting those, and the API is not. I think the counts without timestamps should be counted by the API.

@pdurbin suggested adding a timestamp to the guestbookresponse table. It could be an obviously fake timestamp.

Also in #3324 @landreev suggested:

when we generate access reports/otherwise display this data, we can think of presenting it in some sensible way: like, instead of listing all these downloads with no recorded times, we should probably just say "plus N downloads were recorded before [earliest download date recorded]; no further information is available about those prehistoric downloads, sorry for the inconvenience."

If we add fake timestamps to entries in the guestbookresponse table that have no timestamps, then when all of the counts are displayed by month, in addition to each month there would a group with that fake timestamp that we could describe as @landreev suggested.

Or if we leave them null, we could remove the clause guestbookresponse.responsetime is not null from the query that the API uses. And when all of the counts are displayed by month, there would be a null group that we could describe as @landreev suggested.

(When the total counts are displayed, I would think we don't have to explain that the dates of some counts are unknown.)

jggautier on 20 Aug 2018

Here's the query I've used to find file downloads (and any other downloadtypes e.g. "explores") that are counted two or more times within the same second:

select email, name, authenticateduser_id, responsetime, sessionid, datafile_id, downloadtype, guestbook_id
from guestbookresponse g1
where authenticateduser_id is not null --assuming we would look for only "downloads" from logged in users
and exists (
    select 1
    from guestbookresponse g2
    where 

    --find "downloads" that occurred at the same time
    g2.responsetime = g1.responsetime

    --OR find "downloads" within one second
    --g2.responsetime > g1.responsetime - interval '1 sec' 
    --and g2.responsetime < g1.responsetime + interval '1 sec'

    --from the same logged in user
    and g2.authenticateduser_id = g1.authenticateduser_id

    --for the same file
    and g2.datafile_id = g1.datafile_id

    --where the guestbookresponse entries aren't the same
    and g2.id <> g1.id
    )

--to show the results are what we want
order by sessionid, datafile_id, responsetime;

jggautier on 24 Aug 2018

Two other issues that may be at play:

The metrics caching. Most of the queries default to 7 days unless configured.
~~There may be weirdness in how the metrics compare dates. Not sure but worth investigating.~~ (works as expected)

matthew-a-dunlap on 9 Jan 2019

👍1

Looking at the initial issues in this story, it seems that QDR's counts match now. The only other issue I was able to identify is that we are not counting undated historic download counts in the per month query. This fix will count those undated results if the month queried is on/after the oldest dated record.

Alongside this, all the metrics table entries for downloads should be cleared so they are requeried with the updated values.

matthew-a-dunlap on 10 Jan 2019

@matthew-a-dunlap assigning to you and pulling back to team dev based on our discussion