Amphtml: I2I: Session Id Support in AMP Analytics

Created on 16 Jul 2020  路  26Comments  路  Source: ampproject/amphtml

Summary

Analytics vendors use a client-size session id to track an active session. Today AMP provides the PAGE_VIEW_ID and the high entropy PAGE_VIEW_ID_64 that stays the same for a given site during what the AMP viewer defines as a session. The PAGE_VIEW_ID is persisted in RAM.

The proposal is to support a session id that stays the same for a given site during a time period (e.g. 30 mins). Such session information will be stored in localStorage.

See #1612 for more context.

Design document

Macro

SESSION_ID(opt_expirationTime)

The above macro will be added to <amp-analytics>.

opt_expirationTime is an optional duration in seconds. 1800 (30 minutes) will be used as default value if it is not provided. The expirationTime needs to be in the range of [30, 86400] (30 seconds to one day) due to performance and privacy concerns.

The returned value will be a random base64 string with a high 128 bit entropy. For example U6XEpUs3yaeQyR2DKATQH1pTZ6kg140fvuLbtl5nynb

When retrieved, the stored value will be returned if its set timestamp is within the expiration time, its timestamp will also be updated to the current time.
If a stored value doesn't exist, or has expired. A new random session id will be generated and stored along with the current timestamp.

Storage

The session id will be stored under localStorage entry amp-analytics:session-id. The reason AMP doesn鈥檛 allow storageKey customization is to prevent multiple sessions from being created under different names.

The localStorage API between the AMP page and the AMP viewer will keep unchanged. It鈥檚 up to the <amp-analytics> service to remove stored value after expiration time.

opt_expirationTime is only used when the storage value is read. Since get/set storage value is asynchronous in AMP. We need to make sure that multiple calls to SESSION_ID are handled in order.

Multiple Vendors

Since there will be only one session id stored. In the case where multiple vendors are using the SESSION_ID, the same id with the same duration will need to be shared.

Launch tracker

/cc @ampproject/wg-approvers on adding the storage entry
@ampproject/wg-analytics on the macro design

INTENT TO IMPLEMENT

Most helpful comment

@adamsilverstein I can't share our code, but I can describe the approach.

At the end of the day, we work with the data collected as rows in an SQL database where each row is a session, with a session representing a period of activity by one person where there is no more than 30 minutes of inactivity between events. We accomplished this before using client side cookies that expire in 30 minutes and are continuously renewed. AMP beacons don't support this, so we ultimately changed what we do on the back end to change how we produce these session rows in SQL. Instead of grouping on a "session ID" column in the table containing event rows, we use analytical functions to group by the difference in event time comparing one row to the next. We end up with the same result, except it's all based on event time in the event row only, not the session ID that used to be added to each event row too.

We were originally inspired to look into this approach when I came across this blog post: https://statsbot.co/blog/event-analytics-define-user-sessions-sql/

What I meant by what I said above was that I still like the idea of not forcing people to use advanced SQL techniques to accomplish their use case. They make lack the SQL expertise or their use case might not even work well with calculating sessions based on event time in batch to begin with. They may prefer the ability for the client to be the source of truth for a session ID. Implementing this would be useful for them. :+1:

All 26 comments

Is the only difference from the client id that it has an expiration time?

True. other minor differences include: CLIENT_ID supports cid scope, and allows opt-out.

Would this not be subject to amp-consent?

The proposed SESSION_ID is no different from PAGE_VIEW_ID in terms of opt-out.
One can choose to use <amp-consent> to block <amp-analytics> entirely. If that's not the case, the SESSION_ID will always be generated and sent. It's then up to the analytics vendors to drop the requests on the server side based on consent information.

Makes sense.

Cool! I'll take that as one approval : )

No, sorry, I'd like this to be reviewed in design review first. Don't block on me.

lol agreed. we should review this in design review!
@mattwelke @querymetrics @msukmanowsky Let us know your feedback on the design. And welcome to join our design review next week. #29260

Actually, we ended up going with a 100% at query time implementation for session tracking where we don't even need to store a session ID at all. We're no longer in need of this feature.

But I appreciate it being added and I can see myself using it in the future. It's a nice alternative to worrying about analytical functions and windowing in SQL if you're okay with a fixed session size.

My main concerns would be the value being propagated to non-AMP pages the same way that CLIENT_ID can. I can't immediately tell based on what's described above whether that would work. I'll review the AMP documentation when I have time to understand better how it would work.

Actually, we ended up going with a 100% at query time implementation for session tracking where we don't even need to store a session ID at all. We're no longer in need of this feature.

Hey @mattwelke - I'm curious if you could explain the approach you used here a bit more? Are you passing around query variables? Is the code open source?

@adamsilverstein I can't share our code, but I can describe the approach.

At the end of the day, we work with the data collected as rows in an SQL database where each row is a session, with a session representing a period of activity by one person where there is no more than 30 minutes of inactivity between events. We accomplished this before using client side cookies that expire in 30 minutes and are continuously renewed. AMP beacons don't support this, so we ultimately changed what we do on the back end to change how we produce these session rows in SQL. Instead of grouping on a "session ID" column in the table containing event rows, we use analytical functions to group by the difference in event time comparing one row to the next. We end up with the same result, except it's all based on event time in the event row only, not the session ID that used to be added to each event row too.

We were originally inspired to look into this approach when I came across this blog post: https://statsbot.co/blog/event-analytics-define-user-sessions-sql/

What I meant by what I said above was that I still like the idea of not forcing people to use advanced SQL techniques to accomplish their use case. They make lack the SQL expertise or their use case might not even work well with calculating sessions based on event time in batch to begin with. They may prefer the ability for the client to be the source of truth for a session ID. Implementing this would be useful for them. :+1:

I'm taking over Parsely's response to this issue for @msukmanowsky, and I'm really excited to see movement on this feature!

The semantics of the session ID macro match pretty well with how Parsely tracks sessions, so in that sense this implementation would fit our use case.

My main concerns would be the value being propagated to non-AMP pages the same way that CLIENT_ID can.

I'd like to echo this concern from @mattwelke. Parsely's client-side analytics SDK understands both its own native client ID cookie format and the format written by the CLIENT_ID macro when a cid-scope-cookie-fallback-name is passed to it as documented here. With this approach, the Parsely SDK and the Google AMP component both see the same pool of client ID values instead of each having its own pool, which can lead to duplication in unique client counting. I'm not certain, but I suspect that this fallback cookie approach is what's meant by "supports cid scope" above.

Without the ability to pass a fallback cookie name to the SESSION_ID macro, Parsely would have to figure out another way to consolidate session IDs generated by our own client-side SDK and the AMP component. My understanding is that there might be a way to hack together this behavior using the AMP Linker, but I'm unsure of the implementation details. Certainly, a fallback cookie argument to the SESSION_ID macro would make this much simpler for Parsely to adopt.

The other concern we have is around the scope of the amp-analytics:session-id localStorage key. From the design document, it seems that a site with multiple AMP analytics integrations (with, for example, Parsely and Chartbeat) would have session IDs shared between these analytics providers. Naive use of this field seems like it might run afoul of privacy regulations like GDPR, though again I'm unsure of the details.

The other concern we have is around the scope of the amp-analytics:session-id localStorage key. From the design document, it seems that a site with multiple AMP analytics integrations (with, for example, Parsely and Chartbeat) would have session IDs shared between these analytics providers.

This is a good point. Different analytics vendors (and different session tracking use cases within a single analytics vendor) will want different IDs tracked to account for factors like different session lengths. It makes sense that however the session ID is stored, it is namespaced appropriately, like how the fallback cookie name for CLIENT_ID is allowed to be specified so that it won't clash with other cookies.

Thanks for the feedback! Really good points

My main concerns would be the value being propagated to non-AMP pages the same way that CLIENT_ID can

AMP Linker can be used to propagate the value during navigation. This will work on non AMP landing page, since it can read the query param and store it however it like. But if the landing page is also AMP, there's currently no way to instruct AMP to write data to the localStorage. I can see a few ways to fix this.

  1. Introduce addition opt_defaultValue to SESSION_ID, AMP will use the provided default value as the session_id if it can't find one from storage. Here it would be SESSION_ID(1800, QUERY_PARAM(sid))
  2. Utilize the existing cookies feature to write the session_id to cookie on the landing page, then provide an additional opt_cookieValueOnOrigin. If provided, AMP will first look for the provided cookie value if the document is served from the origin. Caveat here is this creates different code paths for cache versus origin.

The other concern we have is around the scope of the amp-analytics:session-id localStorage key. From the design document, it seems that a site with multiple AMP analytics integrations (with, for example, Parsely and Chartbeat) would have session IDs shared between these analytics providers. Naive use of this field seems like it might run afoul of privacy regulations like GDPR, though again I'm unsure of the details.

I think a good question here is if we find SESSION_ID a special version of CLIENT_ID or PAGEVIEW_ID.
CLIENT_ID is persisted per user per domain and per vendors, PAGEVIEW_ID is persisted per user per domain per session (session defined by the AMP viewer).
CLIENT_ID can be used to identify a user, so we introduced scope to separate the id across different third parties. While PAGEVIEW_ID is a unique pageview, and that's shared across all third parties.
My understanding to the SESSION_ID is that it should never be used to identify user, and its usage is closer to the PAGEVIEW_ID. To mitigate privacy concern, we could reduce the maximum session time from 1 day to a few hours.

Different analytics vendors (and different session tracking use cases within a single analytics vendor) will want different IDs tracked to account for factors like different session lengths

I agree different vendors may have different session lengths. Ideally we'd let vendors customize the namespace and store a session id of their own. But the localStorage size limit is an issue here. I'd prefer to fix the length to 30mins if non-deterministic behavior is a concern when another analytics vendor is included.

Feedback from the design review

  1. No personal identifiable information would be allowed to be stored via the storageAPI. There's privacy concern with storing the high entropy data, and we need to get privacy review before proceed.
  2. Once we get approval from privacy review, we could enlarge the number of session id entries to more than one to satisfy the requirements of multiple vendors.

The privacy review won't be needed if there's a way to bypass the storage API. Is using sessionStorage an option here? It can't cover cases where the tab is closed, but still provide some session coverage.

ping @mattwelke @emmett9001 on thoughts of using sessionStorage as a backup option?
I understand sessionStorage has its limit, but myself as a user don't close tabs that often : )

Another question: Does a lower entropy session id work here? (e.g. 0-10000)

sessionStorage provides slightly different lifecycle guarantees than Parsely's current 30-minute session cookies. In particular, sessionStorage of AMP session data would miss the long tail of 30-minute browsing sessions that span multiple tabs on the same website. I don't have hard numbers comparing the prevalence of this type of session to ones that live entirely within a single browser tab, but I suspect that the vast majority of sessions don't span tabs or browser restarts. Thus, I think Parsely could make use of an AMP session implementation based on sessionStorage.

Parsely's session IDs have very low entropy, so I don't see a reason that AMP session IDs also having low entropy would cause us any problems.

I don't have hard numbers comparing the prevalence of this type of session to ones that live entirely within a single browser tab, but I suspect that the vast majority of sessions don't span tabs or browser restarts.

@zhouyx

We've had the same discussions at GroupBy. We have no idea how many people are the "open 50 tabs and bounce between them" type and how many will prefer to just use one tab. We also don't know how often people are closing the tab or browser window completely and then coming back to complete a purchase. That's why, so far, our client side 30 minute expiry cookie works well. It accounts for all use cases. The proposed approach using sessionStorage would not work well for us. Note that this would only be a real concern for us if we were still doing this client side. With our new server side approach, we're fine. It may be reasonable to state "AMP can do x for you, if you need more, do it server side".

The low entropy is fine. If we were to use this, we would combine CLIENT_ID and the new low entropy value together to get the granularity we need to distinguish all sessions and do useful things with that data. I think that the vast majority of visitors would have no more than 10000 sessions, so that entropy would be high enough, unless there's a high chance of conflicts in the generated value each time it's generated.

Thanks for the feedback! Sounds like sessionStorage is not a preferred but somehow acceptable solution : )

I have another idea. The privacy concern is around the AMP's storage API. We could however use the localStorage without this storage API. When the cached AMP is served from within an iframe, the localStorage value set within the iframe can't be persisted and act like sessionStorage. But when the AMP doc is served from the origin (I think that's most likely the case when the user complete a purchase), the localStorage will be used.
However this will still require special handling when the cached AMP docs generate multiple session ids, while the origin AMP doc stick to one session.

I'll start the privacy review process with low entropy session id in the meanwhile. It would still be great if we can bypass the storage API and skip the process entirely : )

I don't see a lot of issues with using localStorage directly, as long as it's a responsible use.

Any updates on the privacy review?

Hi @emmett9001 Thanks for reaching out. The decision was to use localStorage and not the Storage API. Because no value will be stored to the AMP viewer, we didn't request a privacy review.

Thanks for the update @zhouyx. Where can I find information about when this functionality will be available? Maybe https://github.com/ampproject/amphtml/issues/1612?

Any updates @zhouyx?

Thanks for the reminder. @rebeccanthomas will work on this soon.

What's the latest, @rebeccanthomas?

Was this page helpful?
0 / 5 - 0 ratings