Readthedocs.org: expose server side analytics

Created on 21 Jan 2019  路  8Comments  路  Source: readthedocs/readthedocs.org

As of #4131, it looks like Read the Docs is tracking some analytics server side. Awesome: This offers advantages to documentation readers (e.g. their IPs are anonymized before being sent to Google Analytics), but also an advantage to documentation publishers that is not yet realized: Currently publishers need to supply their own Google Analytics tracking ID to be able to track views of their docs, which (1) not all publishers bother to do (it's buried under Advanced Settings), and (2) won't count visitors who block requests to GA (e.g. using a browser extension). If Read the Docs exposed to projects the server side analytics that it's already tracking, it would address both these issues. Surfacing even just one or two metrics such as visitors per month and pageviews per month would be really useful. Any interest?

Thanks for your consideration and for all your work on Read the Docs!

Feature design decision

All 8 comments

Right now the analytics sent server side are exclusively advertising related. Longer term I'd like to completely remove client side GA and switch to entirely server side GA. I outlined my thoughts here. I even built a separate module for it.

One reason I'm hesitating slightly is that right now we're sending ~1-2k/day events to GA (ad clicks) and that's fine. If we made every pageview on RTD send to GA server side, we'd be looking at closer to 1-2M/day. Perhaps using some serverless tech is a better fit.

Regardless, I'm glad somebody else is interested in this! This is on my list of stuff I want to do but it hasn't yet bubbled to the top.

Thanks for the quick reply @davidfischer, and glad to hear this is already on your radar! One quick followup thought: I know Cloudflare is able to do this for its users and makes the data available via its API as well as its browser UI. Here's a screenshot I just took for one of my sites:

screen shot 2019-01-21 at 17 34 28

It looks like Read the Docs is using Azure CDN, which I've no experience with, but maybe they provide something similar that could save you some work?

Currently we are only using Azure CDN for static files and not for dynamic content so I don't think it would work in its current form.

Secondly, we attach a lot of data to pageviews and events so we can understand the site better. For example, I look at pageviews by programming language of the docs or pageviews by Sphinx theme pretty frequently. Ideally I'd like to still get that.

Interestingly, the really privacy conscious stuff in GA is the stuff I don't want or need at all. I don't need any demographics info and I don't need to know that a user who visited our site 6 months ago is "returning".

screen shot 2019-01-21 at 2 55 11 pm

Just to show off, here's a small dashboard of custom dimension breakdowns. It's a week's worth of data. I removed stuff that would identify single projects or small groups.

Interesting, thanks!

This would be a more significant change to your current architecture, but could save hosting costs and improve page load times, so just in case it's worth considering:

You could still serve mutable responses (like all <projectid>.rtfd.io pages) with a cache-control: public header and a low (e.g. 5-minute) max-age, such that a CDN can still serve them. You'd then move the server-side metrics from the endpoint that serves top-level pages to some dedicated analytics endpoint that would be accessed from a subrequest of every page (e.g. via XHR), whose response would not have any cache headers (so clients would always re-request it and the CDN would never cache it).

This has worked well for me with Cloudflare, so just thought I'd share in case it's helpful.

Thanks for the tip!

It isn't quite exactly the same but it is partially solved. It's probably good enough to call it done.

Was this page helpful?
0 / 5 - 0 ratings