Https-everywhere: Signify that a PR affects a site in the Alexa top 1M

Created on 18 Aug 2016  路  32Comments  路  Source: EFForg/https-everywhere

It would be nice if we had a badge next to PRs (similar to the Travis checkmark) which signifies if a PR affects any domains in the Alexa top X sites. Perhaps different color badges if it's in the Alexa top 1M, 100k, 10k, etc.

This would make it easy to see which PRs are most important for inclusion in an at-a-glace way.

I don't know what the easiest way to do this is, though...

Most helpful comment

@J0WI, @fuglede, @jeremyn if you take a look at the list of PRs currently, all the open ones before and including 6476 will have been measured up against the Alexa top 1M sites.

Based on if any of the hosts in any file modified by the PR is contained in the Alexa top 1M, you may or may not see one of the following labels applied based on their placement within the list:

  1. top-100
  2. top-1k
  3. top-10k
  4. top-100k
  5. top-1m

This corresponds with how popular the site is. I've also color-coded the labels so that (1) is bright red, and as you move down to (5) the color gets progressively less red. This is to make it easy to see which ruleset PRs are more important than others. My hope is that this will make it easier to determine which PRs to prioritize.

Currently I'm going over the PRs with a script manually. Next week I'll try to make it an automated process, so that any new PR that is opened will automatically hit the script and have a tag applied if needed.

All 32 comments

Barring that, maybe just utilizing the Github API to create a script which can go over all pending ruleset PRs and list those that are in the Alexa top X

@J0WI, @fuglede, @jeremyn if you take a look at the list of PRs currently, all the open ones before and including 6476 will have been measured up against the Alexa top 1M sites.

Based on if any of the hosts in any file modified by the PR is contained in the Alexa top 1M, you may or may not see one of the following labels applied based on their placement within the list:

  1. top-100
  2. top-1k
  3. top-10k
  4. top-100k
  5. top-1m

This corresponds with how popular the site is. I've also color-coded the labels so that (1) is bright red, and as you move down to (5) the color gets progressively less red. This is to make it easy to see which ruleset PRs are more important than others. My hope is that this will make it easier to determine which PRs to prioritize.

Currently I'm going over the PRs with a script manually. Next week I'll try to make it an automated process, so that any new PR that is opened will automatically hit the script and have a tag applied if needed.

Updated to include up to & including #6485

That's really awesome!

Not that this depends on the Alexa global rank.
Some rules might be much more important in a specific country. E.g. news sites often have a global rank > 100K, but are in the top ten of one country.

Can we also scan rulesets/domains in the issues to prioritize them too?

Can we use this? http://www.alexa.com/topsites/countries

Actually it is the top 500 sites on the web, so it seems few a bit

Updated to include up to & including #6504

I think in terms of prioritization, we should go with the global rank. It doesn't make sense to focus on one particular country more than any other, as this extension is global in focus (it is called HTTPS _Everywhere_, after all :]). Also, I'm afraid that if we add labels like japan-top100 for a bunch of countries, the labels will get way too cluttered and the utility of seeing clearly which rulesets to prioritize on the list page will be impeded.

@J0WI I don't think there's any good way to prioritize for the issues. This is because the issues can not be programmatically parsed as the PRs can. If someone mentions that they want their site, https://www.somepersonalblog.com/ to be in the rulesets, then mentions that https://www.google.com/ ranks them highly, there's no good way to figure out programatically which site, google.com or somepersonalblog.com, the issue is actually for.

I think it's neat we have more information about the sites so thanks @Hainish for that. Some comments though:

I'm uncomfortable triaging based entirely or mostly on top-N rank. A very small site might nevertheless have extreme, potentially life-threatening impact on its users. Also if we use global rank then I'm concerned we'll bias the project (even more) toward users in developed, English-speaking countries. And, especially at the lower levels, it's not clear what I'm supposed to do differently depending on the tag. What extra review should a top-100k site get that a top-1m site doesn't? Without a specific answer, I'm not sure we need to make that distinction.

Here are some ideas:

  1. Definitely keep top-100, top-1k, and maybe top-10k. We can explore what extra review they should get in #6492 .
  2. Add country-specific tags. This can help connect volunteers to issues in areas they know about or areas they are interested in focusing on for whatever reason.
  3. Encourage users to use the 'High Priority' tag if their issue or PR is actually high priority. There have only been 17 High Priority marked issues and two High Priority marked pull requests ever.

What do you all think?

@jeremyn: You're right - you shouldn't make the decision to prioritize solely on the Alexa rank. The Alexa rank is meant to be a helpful guide for ruleset maintainers, not the sole criteria.

I don't agree with the assertion that global rank weighs more toward users in developed countries. Just looking at the top 20, the list includes baidu.com, qq.com, google.co.in, taobao.com, google.co.jp, sina.com.cn, weibo.com, and yahoo.co.jp. These sites are all non-English by default, and most of the others in the top 20 are localized based on IP and browser headers. I think global ranking is the best way to have as unbiased a sample as you can get.

I'm worried about the clutter country-specific tags will create. The high-priority tag is something that HTTPSE collaborators can tag for specific issues if they thing it's important enough, but Github doesn't allow non-collaborators to create issues with a prioritization specified. I think this is a good thing - it prevents anyone from coming along and jumping stack to get their own site included. Users can always tag collabs in order to ask for a prioritization.

This raises a philosophical question about HTTPS Everywhere's mission. Imagine two local governments that each represent a million people, each with a website. One website is in a developed country with 90% internet usage, so 900k possible site users, and the other is in a developing country with 40% internet usage, so 400k possible site users. Which site deserves more attention from HTTPS Everywhere? You could make an argument for the site in the developed country based on the number of internet users, or you could argue that both sites deserve equal attention based on population.

I'm speculating here but I expect that the global Alexa ranking depends both on the total population a site serves and also the percentage of that population who use the internet. The second number is higher in developed countries so I expect the global ranking to favor developed countries. Anyway I don't have an alternate to the global Alexa rating to suggest, and I think in this case some information about site importance is better than none, so I support the tags. It's worth keeping this in mind though.

The clutter from country tags would only be one extra tag per PR or issue. It's also hard to say how useful they would be because the people who might really benefit from them aren't well represented in this discussion. They might be super useful to some people. We could add them and see how it goes.

If GitHub doesn't allow non-collaborators to tag an issue then that's a technical limitation and so be it. But my suspicion is that people are on the whole too shy rather than too aggressive, especially if they're not very technical and if they don't speak English well. I would prefer dealing with 10 aggressive stack-jumpers if it meant we get extra visibility on one more small but critical broken site.

By the way I do recognize I'm new here and I appreciate your listening to my ideas. I hope I'm not coming across as rude.

A more technical problem: if you search for both top-100 and top-1k at the same time, you get no results. Is there a way to get the top 1-1000 in one search?

Updated to include up to & including #6550

@jeremyn if you're willing to go through and tag those that apply to a specific country, you're welcome to. I suspect it will be more trouble than it's worth, though. @J0WI, @fuglede what do you think about this idea? If we do implement country labels, the one thing I ask is that it be prefixed. So instead of japan, cc-jp for the Japanese country code. This way it's easier to see all the countries there are labels for when typing cc- into the filter, and if we want to get rid of it later we can see easily which tags to delete.

@jeremyn you're not being rude, open discussion is important for making building a sustainable project, and I appreciate the input.

On the technical problem, seems like it's not possible: https://stackoverflow.com/questions/29136057/can-i-search-github-labels-with-logical-operator-or

Adding country labels is too much trouble to do manually. An automated process could be based on the TLD. I agree with a cc- prefix. I admit I'm not personally interested in working on that right now.

It's too bad GitHub doesn't support an OR search for labels. Given that, you would need to modify your tagging process to add multiple tags, so the top 100 would get all five tags, etc. I think that would be too awkward.

@jeremyn TLD would only provide a subset of sites within a country. Some of the largest sites based in countries are not easily identifiable by TLD. qq.com or baidu.com for instance, and most of the largest US-based sites have the .com TLD rather than .us. You could do a geoip-lookup to determine where the servers hosting the sites are located, but this would not be reliable (as sites intended for the population in one country can have servers hosting content in another country, and anycast addressing like CloudFlare uses would complicate matters.)

At any rate, I think it's more work than it's worth.

Country labels or (even better?) language labels are nice to have. I like to review hosts more, when I understand the content. It's easier to see if anything is broken.
But we don't have a reviewer for each language, so I fear it won't change much.

I think I was envisioning a random interested user who wants to look through issues or pull requests in areas they know about in case they want to provide feedback. The list wouldn't need to be 100% complete.

There might be a chicken-and-egg problem here. If it were very obvious that we have a lot of open issues or pull requests for country X, maybe a reviewer who knows about country X would step forward.

I agree with both of you though that adding country codes would probably not help much in the short-term.

Updated to include up to & including #6575

Updated to include up to & including #6602

3985 is now labeled twice for some reason.

I now have this running as a cron job on a server, running at minute 14 and 44 of every hour.

It looks like the same PR was gone over with the script twice, and one of the hosts in the PR changed positions within the top million. It shouldn't happen again, the script has stored state that tells it not to go over PRs it has already seen.

On August 25, 2016 4:25:05 PM PDT, J0WI [email protected] wrote:

3985 is now labeled twice for some reason.

You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/EFForg/https-everywhere/issues/6424#issuecomment-242577514

Sent from my Android device with K-9 Mail. Please excuse my brevity.

So, this seems like a desired outcome to me, don't we want to know when the popularity of a site jumps by an order of magnitude on alexa, for better prioritization? (I assume that in this case the site jumped from 1001 to 999 or something similar so not really an order of magnitude, but you get my point).

Currently we have > 700 PRs open. With the cron job that runs every half hour, it is prudent to only scan the PRs that have been opened since we last checked. Otherwise we come up against API limits and the running time increases dramatically. For registered users, Github gives 3000 API calls/hour, which is pretty low when you consider that for each ruleset PR we have to look at all the ruleset files changed via the API.

Perhaps every day we can do a more thorough scan for every PR that is open at all, so for older PRs we'll get an idea of if a host popularity has changed.

@Hainish possible bug: pull request #6279 did not get marked top-100k even though a duplicate pull request #6750 did.

I'm not certain this is a bug. This may accurately reflects the Alexa ranking at the time the script was run. Currently, this site ranks at ~31k. I'll close this, and we could reassess it if we see similar. If we run the script on the full list every week as I suggest in https://github.com/EFForg/https-everywhere/issues/6424#issuecomment-242828702, this should be tagged regardless, even if we miss it for some reason on the first pass.

Okay, that's fine. I guess I thought that it could matter that the unmarked PR had Nvcc.edu in the description, and the marked one had nvcc.edu. I'm not sure what your script is reading to compare to the Alexa data.

@Hainish Pull request https://github.com/EFForg/https-everywhere/pull/6771 got tagged with both top-100k and (months later) top-10k. A bug in the tag process?

Pull request https://github.com/EFForg/https-everywhere/pull/7519 got double-tagged with top-100 and top-1k.

@jeremyn more probably double-tagged due to targets for that site moving in the Alexa list. I've had to manually delete the placeholder file when the labeller periodically gets stuck.

The fact that both of these were double-labelled on the same day (14 days ago) indicates to me that it was one of the days I did some manual work on the labeller. Indeed, SSHing in to that server just now gives me Last login: Wed Feb 22 19:52:37 2017 which corresponds to this double-labelling.

@Hainish rakuten.co.jp from #11265 did not got the top-1k label. I suspect this is a bug caused by the lack of ^ and www in the respective ruleset. Can you confirm this? Related index.js#L104.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jsha picture jsha  路  3Comments

cschanaj picture cschanaj  路  3Comments

Hainish picture Hainish  路  4Comments

Jochen-A-Fuerbacher picture Jochen-A-Fuerbacher  路  3Comments

cschanaj picture cschanaj  路  4Comments