We can use spell checking technology to defend against typo squating.
First, read Peter Norvig’s spell checker essay for the basics of spell checking. (Real-world spell-checkers are more involved than this, but I think this works for our purposes.) In our case, the “language” is all valid registered package names. If we collect statistics on how often packages are added, then we have a distribution on the language, which gives us a language model: P(x) for each registered package x. We can use edit distance to model spelling errors: Damerau–Levenshtein is a good one since transposition is a common typing error; it can be tweaked to consider changing the case of a letter to be a common error as well. Given a word x and a spelling y and d(x, y) is the edit distance, the error model would be β^d(x, y) where 0 < β < 1 is the probability of making a single spelling error. Putting these together, we have a correction model—given that the user typed y the likelihood of each possible package x is:
L(x | y) = P(x) * β^d(x, y)
The first term on the right is the language model and the second term is the error model. The most likely intended package, x, given that y was typed is the x that maximizes this score.
How can we use this to defend against typo squatting? First off, when someone does pkg> add x there are a three cases:
x is not a valid package name:n packages wrt L(y)x is a valid package name:y such that L(y) ≥ L(x): just install xy such that L(y) > L(x): warn and prompt the user if they really meant to install y instead of x for each such yThe last case is the one that protects against typo squatting. Suppose someone manages to register DataFarmes—a malicious fork of DataFrames. Since DataFrames is much more popular, when someone unsuspecting user types pkg> add DataFarmes by accident, since L(“DataFrames”) will be higher than L(“DataFarmes”), they will be prompted if they really meand DataFrames.
Of course, for defense in depth, we should also require review whenever someone registers a package with a name that is within a certain edit distance of an existing package.
In https://github.com/JuliaLang/METADATA.jl/pull/19689 @aviks suggested disallowing registering packages with names that are within an edit distance of an existing package. I think that's too strict, especially since having a package with the exact same name is the degenerate case and we want to support that—requiring the user to disambiguate, of course. With spell checking, we handle exact name collisions the same as spelling errors, which is nice. There are also cases where it's legitimate to have a similar name to another package. In this scheme, that requires two things: a review open registration and people verifying what they meant upon install. After a while, if the new package becomes popular enough, the prompt step will go away, which seems reasonable.
I just wanted to add a similar attack vector: visual spoofing with Unicode characters.
I think this could be easily mitigated by banning all non-ASCII names, but that seems extreme and very anti-i18n.
Ah, good point. Considering similar-looking names to be close in edit distance (identical?) would help.
Is someone still working on this? We've been discussing this more recently here: https://discourse.julialang.org/t/pkg-attack-vectors/18340/22
I might dive into it if I find the time. But if anyone's already working on this their expertise might help.
@00vareladavid shouldn't this get the security label?
Can this be of use maybe as reference https://github.com/cybint/urlinsane?
Use case here is not URLs ofc but a lot of the ideas there are probably useful to consider.
I'll also look a bit into how other packages do it.
I don't think anyone is working on this, so definitely feel free to work on it. Might I suggest that you start by making a PkgSpellCheck package to start with, which can take a language model in the form of a map (presumably just a dictionary) from package names to likelihoods—i.e. probabilities, but not necessarily normalized—and can classify a given input that's supposed to be a package name. Aside from API design, the error model is where most of the work needs to go:
β is—this may need to be tuned, learned, or it may depend on the language model.Another consideration is to also look at what packages are referenced in the package. Let's say you get a low distance score for widget in new package wdiget. If wdiget is trying to be malicious it will have a reference to widget, probably as simple as using widget.
Okay, to start playing with it I'm gonna need some data. Any easy way to pull all the metadata of the registered packages (without web scrapging)?
You can generate a list of names in General with something like this: https://github.com/JuliaLang/Pkg.jl/blob/50581b24cb2749564b0af7786d0cd0129a7fa489/src/REPLMode/completions.jl#L49
Excellent thx!
This is in the registry registrator now.
Most helpful comment
I don't think anyone is working on this, so definitely feel free to work on it. Might I suggest that you start by making a
PkgSpellCheckpackage to start with, which can take a language model in the form of a map (presumably just a dictionary) from package names to likelihoods—i.e. probabilities, but not necessarily normalized—and can classify a given input that's supposed to be a package name. Aside from API design, the error model is where most of the work needs to go:βis—this may need to be tuned, learned, or it may depend on the language model.