On 32-bit platforms, a Go int is an int32.
I worry about the use of int as the type for Github IDs in the go-github package.
Is there documentation to suggest that Github IDs are limited to 2 billion items? Seems unlikely. But I can't find anything.
I propose we keep "Number" fields as int, but change all the ID fields to be int64.
Thoughts?
/cc @gmlewis @willnorris @shurcooL
I totally agree. Absolutely.
I've noticed that too.
By now, there are definitely some IDs that exceed the capacity of int32. I have the following example. Consider the endpoint that returns all public events:
The first event I got from a query just now was:
{
"id": "5510765310",
"type": "WatchEvent",
"actor": {
"id": 12230841,
},
"repo": {
"id": 76620213,
},
"payload": {
"action": "started"
},
"public": true,
"created_at": "2017-03-17T01:19:24Z"
}
The ID of that event exceeds 2^31 - 1.
So far we've been using int IDs for consistency with existing fields. But it definitely makes sense to address all of them at once in a future change.
I propose we keep "Number" fields as int, but change all the
IDfields to beint64.
SGTM.
Actually, that's not a great example because https://godoc.org/github.com/google/go-github/github#Event.ID type is string, not int.
The fact that both events and notifications (types of objects that are very likely to have IDs surpass 2 billion) encode their IDs as JSON strings rather than JSON numbers suggests that other IDs may be much smaller.
Is there documentation to suggest that Github IDs are limited to 2 billion items? Seems unlikely. But I can't find anything.
I've looked, and couldn't find anything either. But this is something that's important for the API docs to specify.
Perhaps it'd be useful to send GitHub support an email to inquire about this. If they can assure us that those IDs are guaranteed to fit into int32, we won't have to make a change now. If they can confirm that int32 may not be enough, we can feel better about making the change.
If they can assure us that those IDs are guaranteed to fit into int32
It's a global ID, right? It's not an ID scoped to just a single org or repo, right? If so, we don't need their confirmation. 2 billion is too small.
because https://godoc.org/github.com/google/go-github/github#Event.ID type is string
That is its own question. Why is it a string and not some integer type? Can it actually be non-numeric?
It's a global ID, right? It's not an ID scoped to just a single org or repo, right?
Which ID are you talking about? As far as I know, different types of objects have different ID namespaces.
There are IDs for users, repositories, events, notifications, gists (these are non-numbers), installations, etc.
I'm sure that at least the user and repo IDs are separate namespaces (so may share equal IDs; here is user with ID 1, and here's repo with ID 1), but not completely sure about the others.
Why is it a string and not some integer type?
The JSON object actually contains a JSON string, so it has to be decoded into a string.
We could consider the "string" option that encoding/json offers, then decoding into int64 is possible:
The "string" option signals that a field is stored as JSON inside a JSON-encoded string. It applies only to fields of string, floating point, integer, or boolean types. This extra level of encoding is sometimes used when communicating with JavaScript programs:
Int64String int64 `json:",string"`
But then we'd need to be very confident it would always work (not contain non-numeric characters, and not overflow int64). It's easier to be confident that string will always work.
Both Event.ID and Notification.ID were added in 2013-2014, and I didn't see any discussion about the decision to use string type for ID there (maybe the "string" option wasn't even in encoding/json back then).
Can it actually be non-numeric?
It doesn't seem that way. I find it somewhat hard to feel confident if GitHub API docs don't actually specify this.
On the other hand, Gist IDs are non-numeric, and it's clear from examples.
I didn't know (or don't know whether) there is an ID namespace per resource type. But that also doesn't change my point. That just gives us ~5x more headroom if so.
If we're concerned about quoted integers versus not, we could make our own numeric ID type:
type ID int64
... that implements https://golang.org/pkg/encoding/json/#Unmarshaler and deals with string-quoted integers or non-integers and not work about ",string" or changes to the encoding over time. The accessor methods could return int64 instead of github.ID.
For Gists or other hashes-as-IDs, I'd just keep it as "ID *string".
Relevant issue (in a theoretical discussion sense, not practical one):
@shurcooL @gmlewis I see that there was agreement on changing to int64. What's the take on this at this point?
There is no agreement at this time. We're waiting on someone to run into a concrete issue before we can decide how to act on this. So far, there have not been any issue reports, so it's better to wait.
Go 2 might change int to be arbitrary precision (golang/go#19623), so it would be better not to change to int64 if that happens.
Also, see what I wrote in https://github.com/google/go-github/pull/733#discussion_r142304824 relatively recently:
IMO it's unclear what the state of #597 is right now. The issue has been opened a while ago, but I haven't seen any reports of issues from production use. I'm not convinced that we should act right away, or how we should act. I think that we need to collect more data in that issue (or, wait until there's a concrete need to act).
Makes sense! Thanks for clearing, @shurcooL
A relevant comment from GitHub Staff at https://platform.github.community/t/a-few-issues-ive-had-with-the-projects-preview-api/531/3:
When we refer to something as an "int", it's a 32-bit int. That being said, I'd like to stress that it may change in the future, and we'd do our best to communicate that to you should that day come.
That, to me, supports the idea that we don't need to act preemptively at this time, and waiting is a sound strategy.
We've had a report of this issue causing real errors in production on a 32-bit architecture in #806. /cc @maruel
It had to do with RepoStatus.ID ID specifically, but the value "4396447431" provided there was within int64 but outside of int32 range. That means go-github, as is, couldn't be used on a 32-bit arch.
In the light of the recent crash report, are we actively interested in migrating all the ID fields corresponding to GitHub IDs to a more suitable type?
In continuation of the same comment from GitHub Staff as linked above,
Based on that information, we'd recommend using an unsigned 64-bit integer.
Also, shouldn't uint64 be the desired final type instead of int64 based on the recommendation?
cc @shurcooL
@kshitij10496 Are you using go-github on a 32-bit environment? The problem is only relevant there.
Also, shouldn't uint64 be the desired final type instead of int64 based on the recommendation?
I had the same thought. That's one of the options we should consider (and why I didn't want to accept PR #807 before discussing all the options in here first).
My vote against uint64 and for int64 instead. IMHO you shouldn't use unsigned unless strictly necessary and I don't think this is not the case. Preferably via type ID int64 as proposed above.
@kshitij10496 Are you using
go-githubon a 32-bit environment? The problem is only relevant there.
@shurcooL I thank you for your concern. Currently, all my systems are 64-bit architecture so this doesn't directly affect me. (I did figure the implications of this issue after reading this thread and the external resources linked to it.) I thought of addressing the unsigned numeric unint64 type as an alternative since it had not been mentioned above and I wondered if we might have overlooked it somehow.
IMHO you shouldn't use unsigned unless strictly necessary and I don't think this is not the case.
Thanks @maruel for this great advice. I will make a note of it for future references.
I just have a couple of questions regarding usage of signed numeric types here. It would be a good learning experience for me if you can answer them:
int64 is capable of storing(namely, 2^63-1)?Go style is to use signed integers by default. Use of unsigned is reserved for bizarre cases when it's absolutely required.
Some examples of Go using signed:
int, even though it can never be negativeint64, even though you can never have negative-sized filesintGo style is to use signed integers by default.
That makes sense in situations where the integer is often manipulated (and therefore, may end up being negative, even if it can never start out negative).
That's the best reason I know why the builtin len returns a signed int, it's so you can safely do this without an infinite loop due to underflow:
for i := len(s) - 1; i >= 0; i-- {
// use s[i]
}
I'm not sure if the same rationale applies to IDs, since as I understand, they're not expected to be manipulated (by adding, subtracting amounts). They're only assigned or copied. So using an unsigned type can be acceptable.
For completeness, some examples of Go using unsigned ints (where it's appropriate):
uint64 type for a sequence number in a binary network protocol.uint32 to represent alpha-premultiplied color channel values.uint64 for file size within a header, to match the zip spec.Sure. I wouldn't object to uint64 in this case if it's required. But I would be very surprised if GitHub uses the high bit of those 64-bit IDs, as that would be painful for their Java and JavaScript users.
Talked this through with @gmlewis a bit, and both of us are having a hard time forming a strong opinion either way on signed versus unsigned int64. I think @bradfitz is right that it is very unlikely that GitHub ever uses the high bit, and my gut tells me that signed int64 would me slightly easier to work with, since I suspect that's what people are using in other backends, and so will result in less casting. So let's migrate all of these IDs to int64.
I'll try and get #772 resolved and submitted, so we've got proper versioning in place before this lands. But someone can go ahead and start the work on it.
But someone can go ahead and start the work on it.
As part of working on it, I'd like to see a comparison and evaluation of the various alternative ways this can be done.
For example, this can be implemented as replacing int with int64 for all ID fields, or by defining a new type in go-github, ala type ID int64. Which should we choose and why?
Yet another option that I didn't see mentioned here is to not make changes to the package itself, but instead solve this issue in a similar way to the Google App Engine usage:
https://github.com/google/go-github#google-app-engine
Basically, to leave go-github as is unsuitable for 32-bit architectures, but with a way to modify the package to be suitable (in an automatable way). Perhaps offer a copy of go-github for 32-bit architectures.
(It might not be the best solution, but I wanted to mention it anyway so it can be considered. This approach might be more compatible with Go 2 if https://github.com/golang/go/issues/19623 is accepted, but that's far away and not guaranteed.)
A new type would only be beneficial for documentation and/or putting methods on it, but has the downside of likely requiring more conversions for callers.
So the question is whether the documentation would be worth it and/or whether we need any methods on it (e.g. MarshalJSON). I imagine no to both. It seems to document itself (the fields are already named as IDs) and the default JSON marshalling should suffice, unless we need quoted strings numbers larger than 52 bits or whatever for JavaScript. But given that we/GitHub doesn't do that already, I can't imagine we'd need it?
Yet another option that I didn't see mentioned here is to not make changes to the package itself, but instead solve this issue in a similar way to the Google App Engine usage:
That seems gross and unnecessary. Also, App Engine supports Go 1.8 now, so even that existing hack should be deleted.
After thinking about the alternatives, I'm convinced. Changing the ID fields from int to int64 (and incorporating that policy for all future PRs) seems like the best overall choice.
Go style is to use signed integers by default. Use of unsigned is reserved for bizarre cases when it's absolutely required.
Thanks, @bradfitz for the guideline on the conventional use of numbers. I'm sure this will come in handy for me in the future.
There are two considerations that I would like to make with regards to the migration of ID fields from int to int64:
int64 rather than ints.Int64 similar to Int which can be used to allocate int64 values. Given the fundamental role of ID fields in most of the types and the widespread use of Int by many other fields, I think it would be reasonable to create a new helper which can assist with the allocation of ID values.I can give this a try if no one has already started working on it.
Here is a list of all the fields which correspond to GitHub IDs. This can act as an exhaustive, lookup reference while updating the methods which deal with IDs of particular structs as parameters. Some of them are referred as strings which we can ignore. However, I thought of mentioning them as well for the sake of completeness.
If I have somehow missed any type, feel free to add them to the table.
| Field | Field Type | Struct | File
--------| ------|------|------|
| ID | *string| Event| activity_events.go
| ID | *string| Notification| activity_notifications.go
| ID | *int | TeamLDAPMapping | admin.go
| ID | *int | UserLDAPMapping | admin.go
| ID | *int | App | apps.go
| ID | *int | Installation | apps_installation.go
| ID | *int | MarketplacePlan | apps_marketplace.go
| ID | *int | MarketplacePlanAccount | apps_marketplace.go
| ID | *int | Authorization | authorizations.go
| ID | *int | Grant | authorizations.go
| ID | *int | PageBuildEvent | event_types.go
| HookID | *int | PingEvent | event_types.go
| AfterID | *int | ProjectCardEvent | event_types.go
| AfterID | *int | ProjectColumnEvent | event_types.go
| PushID | *int | PushEvent | event_types.go
| ID | *int | PushEventRepository | event_types.go
| ID | *int | StatusEvent | event_types.go
| ID | *string | Gist | gists.go
| ID | *string | GistFork | gists.go
| ID | *int | GistComment | gists_comment.go
| ID | *int | Issue | issues.go
| ID | *int | IssueComment | issues_comments.go
| ID | *int | IssueEvent | issues_events.go
| ID | *int | Label | issues_labels.go
| ID | *int | Milestone | issues_milestones.go
| ID | *int | Timeline | issues_timeline.go
| ID | *int | Source | issues_timeline.go
| ID | *int | Migration | migrations.go
| ID | *int | SourceImportAuthor | migrations_source_import.go
| RemoteID | *string | SourceImportAuthor | migrations_source_import.go
| OID | *string | LargeFile | migrations_source_import.go
| ID | *int | Organization | orgs.go
| ID | *int | Team | orgs_teams.go
| ID | *int | Invitation | orgs_teams.go
| ParentTeamID | *int | NewTeam | orgs_teams.go
| ID | *int | Project | projects.go
| ID | *int | ProjectColumn | projects.go
| ID | *int | ProjectCard | projects.go
| ColumnID | *int | ProjectCard | projects.go
| ContentID | *int | ProjectCardOptions | projects.go
| ColumnID | *int | ProjectCardMoveOptions | projects.go
| ID | *int | PullRequest | pulls.go
| ID | *int | PullRequestComment | pulls_comments.go
| InReplyTo | *int | PullRequestComment | pulls_comments.go
| ID | *int | PullRequestReview | pulls_reviews.go
| CommitID | *string | PullRequestReview | pulls_reviews.go
| CommitID | *string | PullRequestReviewRequest | pulls_reviews.go
| ID | *int | Reaction | reactions.go
| ID | *int | Repository | repos.go
| TeamID | *int | Repository | repos.go
| Since | int | RepositoryListAllOptions | repos.go
| ID | *int | Contributor | repos.go
| TeamID | []int | TransferRequest | repos.go
| ID | *int | RepositoryComment | repos_comments.go
| ID | *int | Deployment | repos_deployments.go
| ID | *int | DeploymentStatus | repos_deployments.go
| ID | *int | Hook | repos_hooks.go
| ID | *int | RepositoryInvitation | repos_invitations.go
| ID | *int | RepositoryRelease | repos_releases.go
| ID | *int | ReleaseAsset | repos_releases.go
| ID | *int | RepoStatus | repos_statuses.go
| ID | *int | User | users.go
| Since | int | UserListOptions | users.go
| ID | *int | GPGKey | users_gpg_keys.go
| PrimaryKeyID | *int | GPGKey | users_gpg_keys.go
| ID | *int | Key | users_keys.go
Edited the table above to add these fields that were missing in the table but present in PR #816 (and they're valid ID fields):
MarketplacePlanAccount.ID
ProjectCard.ColumnID
Repository.TeamID
GPGKey.PrimaryKeyID
Resolved via #816, closing.
Most helpful comment
That makes sense in situations where the integer is often manipulated (and therefore, may end up being negative, even if it can never start out negative).
That's the best reason I know why the builtin
lenreturns a signed int, it's so you can safely do this without an infinite loop due to underflow:I'm not sure if the same rationale applies to IDs, since as I understand, they're not expected to be manipulated (by adding, subtracting amounts). They're only assigned or copied. So using an unsigned type can be acceptable.
For completeness, some examples of Go using unsigned ints (where it's appropriate):
uint64type for a sequence number in a binary network protocol.uint32to represent alpha-premultiplied color channel values.uint64for file size within a header, to match the zip spec.