Rubygems.org: Typosquatters and a Reputation system for rubygems.org (just a rough idea so far, nothing concrete/specific)

Created on 18 Apr 2020 · 10Comments · Source: rubygems/rubygems.org

Recently we again had some press attention it seems:

https://thehackernews.com/2020/04/rubygem-typosquatting-malware.html

And considering past discussions, we have had people who wanted
to make 2FA mandatory, which means that I'd lose the ability to
access rubygems.org (as a developer; as a user I could still use it
just fine, but I'd use github for distribution code if 2FA would be
mandatory, at the least as long as 2FA is not mandatory on
github then).

To anticipate this, I would like to perhaps start a little bit of
a discussion to think about the problem in a slightly different
way. This here is not about typosquatting per se, mind you. This
is more about a level of "trust".

Now, in general: people should not trust anyone or anything, ever.
And, even if trust exists, the supply chain can be tampered
with, perhaps by malicious actors, state or company activities
and so forth. I am aware that we can not be absolutely safe.

But I am thinking that perhaps we could find a SIMPLE system that
would still be somewhat useful.

So I would like to propose introducing some reputation system for
rubygems.org.

My idea is that it would be a bit similar to StackOverflow (not
the same though).

The general idea for it would be that people with a long-term
history of useful contribution, could slowly climb the
reputation chart, so to speak.

This can not replace other mechanisms, but it may give people
a bit more ease-of-working with in the long run. Perhaps even
"gem" itself could then be modified to allow for some
reputation-based system, like "only install gems if their
reputation is higher than THRESHOLD VALUE here".

This would mean that we could have an install supply-chain of
gems where unknown gems could not sneakily be added into the
system without having a certain minimum threshold value.

I am not sure if my idea is good, but I wanted to propose it.

Evidently so far, some details are missing.

The most important one is this:

How to actually count reputation?

Well, this is difficult. One may be based on popularity.
For example, +100.000 downloads could equal 1 reputation.
(Note that this refers ONLY to the original author, not
to any OTHER author; the reputation would be on a per-author
setting, NOT on a per gem-setting ... although this could
also be changed, but the whole idea here is to trust
INDIVIDUALS, not projects per se. And, again, I am aware
that this won't be a way to counter hostile take overs,
but this is not the point made here anyway. I am mostly
referring to finding more ideas against e. g. typosquatting
and similar low complexity attack vectors.)

Perhaps to allow for some customization, different authors
on rubygems could also vote for other projects every now
and then. This could be a bi-yearly voting or yearly voting
or something like that. Votings can be problematic so I
would not recommend to make this too useful, but it could
add some variety to different gems.

Anyway - lots of things are missing here, but my idea is to
perhaps slowly begin a discussion, and who knows, perhaps
in a few years we could have first attempts to see how
useful it would be.

discussion

Source

rubyFeedback

Most helpful comment

I think that -/_ criteria could be a better fit for a fast test that does not unduly prevent short Gem names

Makes sense. Feel free to send PR updating existing GemTypo class to use this.

Good to know that none of 10 mil gems were targeted because of protection we had. Regardless we need a less labour intensive process than what we had. I think one of the ideas we are working on is setting up slack channel for alerts. This kind of check can really be implemented and maintained by anyone by using our api endpoints.

sonalkr132 on 28 Apr 2020

👍2

All 10 comments

RubyGems.org currently applies its levenshein distance protection for Gems with 10,000,000 or more downloads. I'm in the process of analyzing the Reversing Labs data, but one thing is clear is that the number of downloads for the victim Gems in their data none were protected.

[Edit to Remove, See Updated Graph in Comment Below]

I propose that we could consider applying a much lower threshold for a -/_ name funny business criteria, which would have caught many more of these. This can be done within PostgreSQL with something like WHERE regexp_replace(rubygems.name, '[_-]', '', 'g')) = 'strippednametocheck' would be tractable. It could be added as a column or use PostgreSQL index on that function.

rietta on 20 Apr 2020

👍1

RubyGems.org currently applies its levenshein distance protection for Gems with 10,000,000 or more downloads.

This was disabled in https://github.com/rubygems/rubygems.org/pull/2253. It was enabled when the gems in questions were uploaded tho. Levenshtein distance had way too many false positives. Limitations and suggestions on typo protection was previously discussed here. We could reconsider re-enabling if someone can some up with a better fit, while ensuring users aren't bothered by an overzealous algorithm. Another dataset you may find useful is this list.

Really not sure about where we are at with "reputation". I guess there is no harm in keeping it open for discussion.

sonalkr132 on 20 Apr 2020

@sonalkr132 Yikes! I was not aware that the minimum protection had been turned off. One thing that has been frustrating on bulk analysis is that the publishing user is null in the nightly data dumps. I am able to get candidate squats and then have a script poll the API for ownership data. I think I need a better system before I start running automatic analysis nightly to not slam RubyGems.org.

rietta on 20 Apr 2020

@sonalkr132 I can do more data analysis, but I think that -/_ criteria could be a better fit for a fast test that does not unduly prevent short Gem names that would not confuse a human from being used. I have tried many variants for automatic ranking of badness from one Gem to another. Looking at the authors and description field is a big tip off that it at minimum is a clone of the existing Gem rather than an innocent name collision.

rietta on 20 Apr 2020

The graph I shared a few days ago was overly biased by my own algorithm. This is a better graph that I am using in an open source supply chain presentation at https://hella-secure.com/hellaconf-2020 this weekend.

RubyGems Typo Squat Report

rietta on 24 Apr 2020

I think that -/_ criteria could be a better fit for a fast test that does not unduly prevent short Gem names

Makes sense. Feel free to send PR updating existing GemTypo class to use this.

sonalkr132 on 28 Apr 2020

👍2

I'm already working on something like that with @janko and couple other ppl. The system should be up and running under the hood of RubyGems (as a separate app to notify RG team) in up to two weeks. I will announce more details soon. Wanted to do it during RubyKaigi but it looks I need to wrap things up.

mensfeld on 30 Apr 2020

I think that -/_ criteria could be a better fit for a fast test that does not unduly prevent short Gem names

Makes sense. Feel free to send PR updating existing GemTypo class to use this.

Good to know that none of 10 mil gems were targeted because of protection we had. Regardless we need a less labour intensive process than what we had. I think one of the ideas we are working on is setting up slack channel for alerts. This kind of check can really be implemented and maintained by anyone by using our api endpoints.

Pull request submitted at https://github.com/rubygems/rubygems.org/pull/2341.

rietta on 30 Apr 2020

This new method of -/_ variations was merged with https://github.com/rubygems/rubygems.org/commit/86ab6abad43ac3dcf6ae131eb6e6dc12d943d3d3

rietta on 15 Aug 2020

Thought a bit about reputation, it seems a bit too complicated to work for any package management system, which is probably why no one does this at the moment.
While I agree that we should do more to prevent abuse of naming (hence the feature rietta worked on), I think issue like typosquatting specifically should be better dealt with using a smart algorithm to detect (and perhaps block) such attempts.

sonalkr132 on 31 Aug 2020

Was this page helpful?

0 / 5 - 0 ratings