Prebid.js: Types of user ids

Created on 21 Sep 2020 · 25Comments · Source: prebid/Prebid.js

Type of issue

Question

Description

Do consumers of user ids set within either a new or existing userid module need to know more about how the UUID was generated? Or is the id itself sufficient. Would more DSPs integrate against a particular user id if they knew more about how it was generated?

We should consider a new attribute called "stype" (source type). Type would be passed along side the UUID to SSPs & DSPs.

Steps to reproduce

Test page

Expected results

pbjs.setConfig({
    userSync: {
        userIds: [{
            name: "publisherProvided",
            params: {
                eids: [{
                    source: "example.com",
                    atype: 1,
                    uids:[{
                      id: "value read from cookie or local storage",
                      ext: {
                        stype: "sha256email"
                      }
                  }]
                },{
                    source: "id-partner.com",
                    atype: 1,
                    uids:[{
                      id: "value read from cookie or local storage",
                      ext: {
                        stype: "ppuid"
                      }
                  }]
                }]
            }
        }]
    }
});

Source

jdwieland8282

Most helpful comment

I would suggest NOT including any form of hashed email in the options for stype. a hashed email is like a fingerprint in that you can't "reset" it - once you have it, you will always have the same link to a set of user data, even if they've asked someone upstream to reset/clear their data. since there's no efficient way (today) to tell EVERY platform in the industry to wipe data for a user, the best way today is simply to generate a new user id. this is just like apple and android allowing you to reset your MAID. identity providers that base an ID off of a hashed email are fine since they can change the ID they generate if the user has asked to reset/opt out at some point.

smenzer on 24 Sep 2020

👍2

All 25 comments

smenzer on 24 Sep 2020

👍2

So far we have:

DMP - added by a 3rd party id provider like ID5, Liveramp, Lotame, etc..
PPUID - added by the publisher, the publisher can be identified in eids.source

jdwieland8282 on 30 Sep 2020

FWIW, in late July there was a discussion on https://openrtb-iabtechlab.slack.com/archives/C3Y6GHUTH/p1595433523254200 (TechLab Programmatic, general Slack). The use case similar to here, signaling a stable, publisher-generated UID:

As a DSP, I want a site-specific/publisher-provided ID, so to enable basic per-site frequency-capping at least, in the absence of cross-site identifier, though there are probably other uses. I want this to be an ID generated by the site, common to all traffic for that site, i.e. perhaps generated by the PubCommon prebid.js module or similar, but how it it gets made is outside of OpenRTB's scope I think.
...
FWIW, we accept eids with a "source" attribute of "pubcid.org" for that scenario.
...
re: eids, I think probably this should happen..
add an "agent type" for site-specific IDs. Somehow deal with that there won't be a "source", necessarily, if they self-generate them. "source" is defined currently as "Source or technology provider responsible for the set of included IDs. Expressed as a top-level domain."

The following was sketched but I don't recall seeing this discussion thread picked up again.

// Agent Type
// 0   A stable, publisher/site-provided identifier.
// ... etc from OpenRTB spec
//
"eids":[
{
   "source": "localhost",
   "uids": [
      { "id": "c4a4c843-2368-4b5e-b3b1-6ee4702b9ad6", "atype": 0 }
   ],
},   
...

I pinged the channel to see if there was more discussion on eids enhancements.

dmdabbs on 1 Oct 2020

to me, DMP is too generic, and also identifiable simply by looking in the source field. I'm not sure exactly what all the right values are, but I think it's important to get some ideas from the consumers of the IDs (i.e. DSPs) to make sure it's useful.

On the ID5 side for example, we provide a field we call linkType that we use to signal how we linked two 1p IDs together - through no link (i.e. it's a publisher-only ID), through our probabilistic algo, or via deterministic signals. This would let consumers of the ID know the strength of the cross-domain reconciliation and allow them to make decisions on it. Perhaps standardizing something along these lines would be useful for the DSPs?

smenzer on 1 Oct 2020

👍1

I agree that when describing IDs, it would be useful to distinguish among the various "dimensions" of IDs:

Describes Person or Device/App

Directly-Identifiable (e.g., email)
Pseudonymous (e.g., alphanumeric string)

Set/Link of IDs

Device graph
First-party sets?

Source type

Publisher or Brand (who from the consumer's point of view is also a publisher)
Vendor to publisher or brand (or their agents)

Actual source

Which domain of which organization generated/controls ID

Age of ID

Creation date
Last seen date

I would keep all the "dimensions" distinct from uses of ID

Preference management (e.g., opt-in/-out of personalization)
Engagement (or restrictions like frequency cap)
Measurement (distinct counting)

joshuakoran on 1 Oct 2020

Re @dmdabbs note, I think this is what you might be referring to?

This is the spec we're currently using with OpenRTB, https://github.com/Advertising-ID-Consortium/IdentityLink-in-RTB

jdcauley on 2 Oct 2020

@joshuakoran Just working my way through your list. I think the adcom "atype" field handles

Directly-Identifiable (e.g., email)
Pseudonymous (e.g., alphanumeric string)

-Can you further clarify what you meant by "Set/Link of IDs"? Do you expect this to be an array of other ids?
-Same question for "source type", what values do you expect here?
-"Actual source" would be the user id module name value, or "source: if different from name.

age of ID, seems easy enough.
Re dimensions, are you advocating a second array where uses of ID would be enumerated? or just point out that we shouldn't add these at all yet?

jdwieland8282 on 14 Oct 2020

@jdwieland8282

The "set/link of IDs" concept relates to sharing a common ID that maps one ID to other IDs (e.g., x-device link, link across two 1P domains such as required by first-party sets).

The source (or perhaps better name is "controller") of ID is due to permissions/permitted uses tied to ID.

The dimensions defining the ID (e.g., source/controller, type, time) are orthogonal to information tied to the ID (audience attributes/cohorts, restrictions against use for personalization, event-aggregates such as frequency counter, etc.)

joshuakoran on 14 Oct 2020

Hey @joshuakoran , mind adding a few example values for each that you think maybe relevant? I want to be sure clear on what you are suggesting.

jdwieland8282 on 16 Oct 2020

Sorry for the delay, finally coming up for air. Agree that adcom is the right model to improve upon.

As we think about reducing discrepancies and adopting cross-publisher common ID schemes, such as being discussed here and in IAB TL, it seems we can improve how we annotate the interoperable IDs being used to improve engagement, measurement and optimization.

The original question as I understood it was to provide enhanced standard descriptors (metadata) around user IDs + information associated with them, rather than the attribute data (e.g., interest taxonomies, demographic taxonomies, geo taxonomies) or event data (activity_type, optional value of activity such as a purchase transaction).

I think the broad classification of ID metadata can be classed into two buckets of better describing the what-ness of the ID and “provenance” of the ID.

WHAT concepts
Some IDs describe people/households (such as home address), others are describe web clients (like the alphanumeric strings stored in cookies). While privacy language calls the former “directly-identifiable” (to replace the more ambiguous term “PII”), when web activity is not associated directly-identifiable IDs privacy language calls these IDs, “pseudonymous IDs.”

Thus example one might be to define whether the ID is pseudonymous or not.

The second type of ID is one that merely links other IDs, such as a “cluster ID.” This ID is generated server-side to associate various IDs together, either probabilistically OR deterministically. Marketers often use this for “x-device” or even same device “x-app” use cases. When publishers operating different domains link their IDs deterministically they may wish to create a shared ID for their use, which is analogous to the proposed “first-party” sets.

Thus example two might to define whether the ID is deterministically associated with other IDs or not.

FROM WHERE “provenance” concepts
An orthogonal dimension to the ID we are discussing is its provenance. Which organization created it? Privacy regulations tend to call this the “data controller.”

When was it created? When was the last time it was verified as still active?

Syndicating “stale” IDs to be activated in a walled garden or across the Open Web is technically feasible, but not adding value to marketers. Yet most marketers do not have visibility on the age or last seen date of the data syndicated on their behalf to improve media buying.

Ensuring we know where IDs come from likely requires ensuring compact description and perhaps even signing the data.

USE concepts

I also recommend we keep the above annotations about IDs distinct from what processing operations are associated with them:

Preference management (e.g., opt-in/-out of personalization)
Engagement (or restrictions like frequency cap)
Measurement (distinct counting)
Audit (which ID was sent from which org to which other org, when, and what use restrictions were communicated)

Examples (purely for illustration and not in formal spec format or optimized for transport efficiency):
Zeta_Pseudonymous_ID=123, pseudonymous, created 20200915, last_event=20201025
Zeta_Pseudonymous_ID=234, pseudonymous, created 20201001, last_event=20201026
[email protected], directly-identifiable, created 20201001, last_event=20201027
Zeta_Household_ID=abc, pseudonymous, probabilistic_set {ZPID=123, ZPID=234), created=20201027

joshuakoran on 28 Oct 2020

Thanks @joshuakoran what you're describing is going to be tough to express in JSON in a way that makes sense to everyone, let me take a first stab and we can iterate. wrt providence, I feel like the source and stype values do a good job describing that, so I'm going to leave them out for now.

jdwieland8282 on 3 Nov 2020

how about something like this? Anything else to add?

```"user":{
"ext":{
"eids":[
{
"source":"sharedid.org",
"uids":[
{
"id":"d88c96-5cb6-410d-827d-b019e476",
"atype":1,
"ext":[
{
"stype":"ppuid", //ppuid,dmp,sha256email
"origin":"person", //person, household, browser, device, gaming console
"pseudonymous":TRUE, //boolean
"deterministic":FALSE, //boolean
"created":"1604429992", //UNIX timestamp
"lastseen":"1604430025", //UNIX timestamp
"signature":[
{
"signedby":"cryptoboi",
"signature":"cryptostring"
}
]
}
]
}
]
}
]
}
}

jdwieland8282 on 3 Nov 2020

I am assuming all these params and values have to be well defined for any consumer to make sense out of it. Wouldn't it be better if we map combination of origin , pseudoanonymous and deterministic to custom atype values and publish it. Would reduce payload as well as easy to extend without adding extra parameters.

abhinavsinha001 on 4 Nov 2020

Hi @abhinavsinha001, I think you've raised a very good point, to be clear, I don't have a strong opinion yet about what this should look like, I'm channeling the Identity PMC. But to your point about well defined values you are exactly right. We need a way to ensure that creators don't declare there ID deterministic when it isn't. Wrt pseudoanonymous, all ids except email address is pseudoanonymous, and even email can be pseudoanonymous. So in my mind pseudoanonymous should go entirely.

The consumer in this scenario is a DSP.

As far as mapping pseudoanonymous and deterministic to a custom atype, atype isn't well understood or used. In theory that sounds like a good idea to me but in practice I'm not sure it would work. Thanks for your comments, what would be really helpful is a modified example. I don't want to be the only one doing the data modeling.

jdwieland8282 on 4 Nov 2020

Hi Jeff -

"Wrt pseudoanonymous, all ids except email address is pseudoanonymous"

I think that while many IDs we rely on may begin as "pseudonymous," I believe the regulations require organizations to have appropriate technical and/or operational measures in place to keep people's activity distinct from their offline identity (directly-identifiable ID, fkna PII) to be classed as "pseudonymous."

joshuakoran on 5 Nov 2020

sure, no disagreement from me on that pt.

jdwieland8282 on 5 Nov 2020

since the primary consumer here are DSPs, can we get some of them to weigh in on what they'd want to see and whether they want the granularity of separate fields or a single field like atype?

smenzer on 5 Nov 2020

I agree we should get feedback from DSPs on this. I feel most of the parameters do not have any significance individually and can be represented broadly using atype values.

Sample request leveraging atype value

{
  "eids": [
    {
      "source": "sharedid.org",
      "uids": [
        {
          "id": "d88c96-5cb6-410d-827d-b019e476",
          "atype": 501,
          "ext": [
            {
              "created": "1604429992",
              "lastseen": "1604430025",
              "signature": [
                {
                  "signedby": "cryptoboi",
                  "signature": "cryptostring"
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}

Here is how we can maintain metadata for atype and parameters that define a particular atype value.

Atype Metadata

| Adtype | Description |
|:----------|:----------|
|1|An ID which is tied to a specific web browser or device (cookie-based, probabilistic, or other).|
|2|In-app impressions, which will typically contain a type of device ID (or rather, the privacy-compliant versions of device IDs).|
|3|A person-based ID, i.e., that is the same across devices.|
|500x |All the IDs gnerated by publishers (stype:ppuid) |
|501 |stype:ppuid, origin:browser, deterministic:true, method:login , scope:individual, duration:short |
|501 |stype:ppuid, origin:browser, deterministic:true, method:localstore,scope:individual , duration:medium|
|600x | All the Ids aquired from some DMP (stype:dmp) |
|601 |stype:dmp, origin:browser, deterministic:true, method:transaction ,scope:individual ,duration:long|
|601 |stype:dmp, origin:browser, deterministic:true,method:transaction ,scope:household,duration:long|
|700x | All idendifiers generated using some link like IP/Device (stype:probabilistic)|
|701 |stype: probabilistic, origin:gaming-console, deterministic:false , method:algo ,scope:household ,duration:short|

ID metadata Params

| Parameter |Description
|:---------- |:-----------------------------------------|
| stype | Type of source which generated this ID |
| origin | Where this Id was generated / stored |
| deterministic | If the Id can be confidently tied to a browser/person|
| method | How the ID was aquired , login, using some transaction like purchase, algorythm or traditional sync|
| scope | Does this Id represent an individual / household|
| duration | The time this ID can typically last : short < 7 days , medium <30 days , long >30 days|

abhinavsinha001 on 5 Nov 2020

Update: Just realized while on IAB-TL meeting - most of the fields and data are part of Data Transparency Standard 1.0 and there is an active discussion to map these fields to oRTB User object - we can use the same standards for eids type as well.

abhinavsinha001 on 5 Nov 2020

ok, so sounds like we have something that describes the type of user id in the atype field and it's just a matter of defining how we want to support the atype designation:

created
last seen
signed

I'd like to pause here, now that we have some firmer requirements and wait for DSPs to weigh in. Any disagreement with that approach?

jdwieland8282 on 9 Nov 2020

@abhinavsinha001 I like your example. For anyone who missed the 11/11 Identity PMC meeting, we agreed to move forward with this feature. The group felt we should proactively provide some real time metadata about the id to buyers in preparation for a future state with diminished 3rd party cookie availability.

Each UserId module sub adapter will need to decide to support these fields. The PMC will define the standard. Are there any objections to @abhinavsinha001 data model? I'll cross post on our slack channel as well.

```{
"eids": [
{
"source": "sharedid.org",
"uids": [
{
"id": "d88c96-5cb6-410d-827d-b019e476",
"atype": 501,
"ext": [
{
"created": "1604429992",
"lastseen": "1604430025",
"signature": [
{
"signedby": "cryptoboi",
"signature": "cryptostring"
}
]
}
]
}
]
}
]
}

jdwieland8282 on 11 Nov 2020

Just FYI PRAM is suggesting three types of Identifiers:
1) system-generated pseudonymous ID (e.g., cookie or MAID),
2) user-provided ID (e.g., hashed email) and
3) directly-identifiable identity (publisher-agnostic offline identity)

We can augment this by creation/last seen as described above + source (e.g., publisher, vendor, marketer), such that vendor=apple provides IDFA, and vendor=sharedid.org provides cookie ID.

joshuakoran on 20 Nov 2020

@joshuakoran I don't really understand the difference between 1. and 2. ... could you please explain a bit?

smenzer on 20 Nov 2020

Even if the output is a pseudonymous ID, the input mechanism has different friction/control for users.

The user has binary control of generating / resetting ID in 1), but limited technical control over how the ID can be shared across domains.

The user has 100% technical control of providing (different/same) ID to be shared across domains for 2). Once the ID in 2) is generated it has the same limits as 1), but the generation using different IDs (work email, home email as one example) is different than using the same laptop with same browser cookies at home and work.

joshuakoran on 20 Nov 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.