Pkg.jl: spell checking against typo squatting

Created on 27 Nov 2018 · 13Comments · Source: JuliaLang/Pkg.jl

We can use spell checking technology to defend against typo squating.

First, read Peter Norvig’s spell checker essay for the basics of spell checking. (Real-world spell-checkers are more involved than this, but I think this works for our purposes.) In our case, the “language” is all valid registered package names. If we collect statistics on how often packages are added, then we have a distribution on the language, which gives us a language model: P(x) for each registered package x. We can use edit distance to model spelling errors: Damerau–Levenshtein is a good one since transposition is a common typing error; it can be tweaked to consider changing the case of a letter to be a common error as well. Given a word x and a spelling y and d(x, y) is the edit distance, the error model would be β^d(x, y) where 0 < β < 1 is the probability of making a single spelling error. Putting these together, we have a correction model—given that the user typed y the likelihood of each possible package x is:

L(x | y) = P(x) * β^d(x, y)

The first term on the right is the language model and the second term is the error model. The most likely intended package, x, given that y was typed is the x that maximizes this score.

How can we use this to defend against typo squatting? First off, when someone does pkg> add x there are a three cases:

x is not a valid package name:
- prompt the user to choose from the top n packages wrt L(y)
x is a valid package name:
- there are no y such that L(y) ≥ L(x): just install x
- there are y such that L(y) > L(x): warn and prompt the user if they really meant to install y instead of x for each such y

The last case is the one that protects against typo squatting. Suppose someone manages to register DataFarmes—a malicious fork of DataFrames. Since DataFrames is much more popular, when someone unsuspecting user types pkg> add DataFarmes by accident, since L(“DataFrames”) will be higher than L(“DataFarmes”), they will be prompted if they really meand DataFrames.

Of course, for defense in depth, we should also require review whenever someone registers a package with a name that is within a certain edit distance of an existing package.

enhancement security

Source

StefanKarpinski

👍7 ❤5

Most helpful comment

I don't think anyone is working on this, so definitely feel free to work on it. Might I suggest that you start by making a PkgSpellCheck package to start with, which can take a language model in the form of a map (presumably just a dictionary) from package names to likelihoods—i.e. probabilities, but not necessarily normalized—and can classify a given input that's supposed to be a package name. Aside from API design, the error model is where most of the work needs to go:

Doing Unicode normalization.
Coming up with a list of confusable character sequences.
Playing around with variations on edit distances.
Figuring out what a reasonable choice of β is—this may need to be tuned, learned, or it may depend on the language model.

StefanKarpinski on 9 Dec 2019

👍3

All 13 comments

In https://github.com/JuliaLang/METADATA.jl/pull/19689 @aviks suggested disallowing registering packages with names that are within an edit distance of an existing package. I think that's too strict, especially since having a package with the exact same name is the degenerate case and we want to support that—requiring the user to disambiguate, of course. With spell checking, we handle exact name collisions the same as spelling errors, which is nice. There are also cases where it's legitimate to have a similar name to another package. In this scheme, that requires two things: a review open registration and people verifying what they meant upon install. After a while, if the new package becomes popular enough, the prompt step will go away, which seems reasonable.

StefanKarpinski on 27 Nov 2018

👍3

I just wanted to add a similar attack vector: visual spoofing with Unicode characters.
I think this could be easily mitigated by banning all non-ASCII names, but that seems extreme and very anti-i18n.

miguelraz on 6 Dec 2018

❤3

Ah, good point. Considering similar-looking names to be close in edit distance (identical?) would help.

StefanKarpinski on 6 Dec 2018

👍3

Is someone still working on this? We've been discussing this more recently here: https://discourse.julialang.org/t/pkg-attack-vectors/18340/22

I might dive into it if I find the time. But if anyone's already working on this their expertise might help.

dietercastel on 9 Dec 2019

@00vareladavid shouldn't this get the security label?

dietercastel on 9 Dec 2019

Can this be of use maybe as reference https://github.com/cybint/urlinsane?

Use case here is not URLs ofc but a lot of the ideas there are probably useful to consider.

dietercastel on 9 Dec 2019

I'll also look a bit into how other packages do it.

dietercastel on 9 Dec 2019

Doing Unicode normalization.
Coming up with a list of confusable character sequences.
Playing around with variations on edit distances.
Figuring out what a reasonable choice of β is—this may need to be tuned, learned, or it may depend on the language model.

StefanKarpinski on 9 Dec 2019

👍3

Another consideration is to also look at what packages are referenced in the package. Let's say you get a low distance score for widget in new package wdiget. If wdiget is trying to be malicious it will have a reference to widget, probably as simple as using widget.