Go-github: add GitHub URL parsing utility function

Created on 21 Jun 2016  路  10Comments  路  Source: google/go-github

It seems that clients of this package frequently need to parse GitHub URLs in order to validate and/or extract useful information from them. It would be nice to have all this code deduplicated into a single, central location, and this package might be a nice home for it.

Maybe something like:

func ParseURL(url string) (owner, repo, ref string, err error) {...}

Common URLs to parse could be ones similar to:

https://github.com/bazelbuild/bazel/tree/master
http://github.com/bazelbuild/bazel/tree/master
https://www.github.com/bazelbuild/bazel/tree/master
https://github.com/bazelbuild/bazel/releases/tag/0.2.1
https://github.com/bazelbuild/bazel/commit/19b5675725caf69008a717082149237400260edc
https://github.com/bazelbuild/bazel/archive/0.2.1.zip

Thoughts?

enhancement

All 10 comments

obviously, 馃憤 from me since we just talked about it :)

Just to repeat what we talked about, we have at least four places (that I know of) inside Google where we are parsing URLs to extract one or more of these values. Most of the time it's just owner and repo, but in one case, we extract that ref as well and support all of the various forms posted above.

I've been poking around to find a definitive reference for what is a valid GitHub URL and/or ref, but the closest thing I've found so far is: https://git-scm.com/docs/git-check-ref-format

I'll keep searching and update this comment if I find a better source. If anyone knows a better one, I'd like to hear about it.

@gmlewis FWIW, I have never seen these URL's change in my 6 years of using GitHub. The only URL's I have seen change are those to issues and pull requests, e.g. https://github.com/google/go-github/pulls/123 used to resolve to pull 123, but now it's https://github.com/google/go-github/pull/123.

I can see this being useful in situations where you need to convert a repo URL to a owner and repo name, so that you can call (*RepositoriesService) Get(owner, repo string). But I don't really understand how ref would would, or how it would be used.

I don't have objections. I've seen many projects have an internal helper that parses a GitHub repo URL and returns owner, repo, and err.

What if we change the ref name to rest or remainder or remaining to signify that this followed the owner/repo part? Of course, owner would never be empty but repo and rest could be empty.

The one place that we use ref is in a tool that we use to lint a GitHub repo for compliance with our open source policies (does it contain a LICENSE file, is it mixing in third-party code outside of a "third_party" or "vendor" directory, etc). It's a command line tool that takes a GitHub URL as its input. most of the time, you would just specify something like https://github.com/google/go-github which would just use the default branch, but we also wanted to support specifying a particular ref to check.

Though now I'm thinking this may be too specific of a use case to add to the main library (at least the ref part). For example, just looking at this issue, being able to parse https://github.com/google/go-github/issues/380 into its component parts would be equally useful.

And I'm not crazy about a rest or remainder style return value. Just feels kinda clunky.

I guess I've shared some more thoughts on this proposal in the PR.

And I'm not crazy about a rest or remainder style return value. Just feels kinda clunky.

To me, the proposed utility sounded _more_ useful when it returned (owner, repo, rest string, err error), than without the rest value. Because that way, it could be used to parse exact https://github.com/owner/repo URLs by checking that err == nil && rest == "". But without the rest argument, the utility becomes less useful in my eyes, because you can no longer be as selective about the types of URLs it's used to accept.

That said, I don't think rest is great either, but that's why I'm not a huge supporter of this enhancement; it doesn't seem possible/easy (to me) to offer a valuable general helper, since the task is so easy (given that we already have url.Parse and strings.SplitN) and each project's needs may be slightly different.

Will and I have decided to not go through with this. Thank you for your thoughtful comments and time invested, @shurcooL.

(Commenting on a closed issue, but wanted to share something I learned today about url.Parse...)
I expected that if I passed two URLs to url.Parse separated by by ", " that I would get an error, but I don't:

https://play.golang.org/p/LmZKo1Rp4k

Sounds good, and no problem, happy to have a technical discussion that leads to less code being created. :)

About url.Parse, I'm not surprised. I too learned some time ago that it's actually very hard to get an error from url.Parse. It's not suitable for detecting whether something looks like it could be an actual URL or detecting the boundaries of a URL within a larger body of text. But it's great for actually parsing something that is known to be a URL.

Also, https://github.com/bazelbuild/bazel/1, https://github.com/bazelbuild/bazel/2 (or something equivalent with more escaped characters) is technically a valid, albeit unusual, URL.

E.g.:

Here's what dump_httpreq prints for that URL:

req.RequestURI = (string)("/https://github.com/bazelbuild/bazel/1,%20https://github.com/bazelbuild/bazel/2")
req.Host = (string)("localhost:8080")
req.URL = (*url.URL)(&url.URL{
    Scheme:   (string)(""),
    Opaque:   (string)(""),
    User:     (*url.Userinfo)(nil),
    Host:     (string)(""),
    Path:     (string)("/https://github.com/bazelbuild/bazel/1, https://github.com/bazelbuild/bazel/2"),
    RawPath:  (string)(""),
    RawQuery: (string)(""),
    Fragment: (string)(""),
})
req.URL.Query() = (url.Values)(url.Values{})
req.Referer() = (string)("")
req.RemoteAddr = (string)("127.0.0.1:56832")
...

So it'd be a bug if url.Parse failed to parse it.

Was this page helpful?
0 / 5 - 0 ratings