Warehouse: Detect malicious packages, for later removal

Created on 26 Nov 2018 · 15Comments · Source: pypa/warehouse

Looking at the simple package index, there are a number of highly questionable packages (at least so by their names.)

Packages without proper names, authors or descriptions should probably be removed. If not for bloat reasons, but for security concerns.

Stuff like this:

feature request needs discussion

Source

E3V3A

Most helpful comment

Another related issue, is that there seem to be some kind of cyber squatting for package names going on there as well. Packages with little or meaningless content but occupies useful names.

How do you plan to deal with that?

See PEP 541 and #1506.

di on 27 Nov 2018

👍2 ❤1

All 15 comments

There are almost 200K projects on PyPI. We don't have the ability to manually audit each one. How do you propose this should be done?

di on 26 Nov 2018

There are almost 200K projects on PyPI

Exactly! -- And probably 99.9% useless, outdated, fake, deprecated (at best), or possibly containing malware, at worst!

How do you propose this should be done?

:) We are programmers so I'm sure we can figure that out!

How about about searching for packages that:

Has weird name (random or repeated ASCII)?
Has no author
Author has not provided an email
Package has:
- No valid homepage URL
- No description
- No releases in the last 3 years
- No downloads/installs in the last 2 years

That's just a start... and would probably remove a siht load of crud.
It would definitely be interesting to make such a search to see just how many hits we'd get.

E3V3A on 26 Nov 2018

👍1

Another related issue, is that there seem to be some kind of cyber squatting for package names going on there as well. Packages with little or meaningless content but occupies useful names.

How do you plan to deal with that?

E3V3A on 26 Nov 2018

Another related issue, is that there seem to be some kind of cyber squatting for package names going on there as well. Packages with little or meaningless content but occupies useful names.

How do you plan to deal with that?

See PEP 541 and #1506.

di on 27 Nov 2018

👍2 ❤1

Thanks for filing this issue, @E3V3A!

Per discussion today, we'll be addressing this problem during upcoming work on automated detection of malicious uploads. In this issue we'll be nailing down our criteria for "how do we determine what is a bad package?" and plans for removing those packages.

(Note that we're distinguishing between a malicious upload and spam, and between malware and typosquatting, and that there are other issues -- like #194, #4319 and #4004 -- that concentrate on filtering re: packages that have noncompliant metadata or no recent releases.)

brainwane on 21 Jun 2019

👍1

Per a discussion with @ewdurbin last week:

The work we'll do on automated detection of malicious uploads will first concentrate on _finding_ malicious packages, and building the tools around that. Only after that will we be able to provide automated tools to help PyPI admins _remove_ them.

brainwane on 2 Sep 2019

From #7061:

What's the problem this feature will solve?
Malicious and insecure packages are a challenge in the open source community. Malicious packages have been removed several times in the last few years. Improved automated auditing techniques would make it easier for security specialists to quickly remove malicious packages. Smart bad actors would be able to use the same test suite, certainly, but it would at minimum allow for the vetting of existing packages. Likewise, this would set up an automated process which could be enhanced over time.

Describe the solution you'd like
Python's exec() function is not secure and may be a good heuristic for finding malicious packages. There may be other additional heuristics that make a package appear more suspicious, and a likely target for manual auditing. Add a badge or other indicator for packages that pass/fail these tests.

di on 5 Dec 2019

I'm very interested in this effort and would like to help. With the fact that there are so many packages here are a few suggestions that I have:

Most legit packages will have a few things in common:
- a readme/description
- link to source code
- 2 or more contributors
- other common fields filled out such as classifiers
A tally of the top 1000 (or more) top downloaded packages could be collected and compared to others/new
- Compare the name of the package to see if it's _very_ similar (typo squatting)
- Compare the name of the package to see if it's _very_ different. This is a common issue with malicious websites and there's a tool to calculate this (https://github.com/MarkBaggett/freq)
- Code analysis? This would be a very difficult thing to do I think

mertzjames on 5 Dec 2019

Hello friends! I will be working on the backend implementation of the system for adding malware checks. You can track the progress of this work by checking out the malware-detection label.

xmunoz on 13 Dec 2019

😄1

Hey everyone.
We are currently working on a proof of concept at GitHub to detect malicious code on Package manager.
We are currently setting-up an environment to run our test, but our first step is to use a static analysis tool: CodeQL to model the way certain backdoor works to detect them as they get included into pypi.

nicowaisman on 23 Dec 2019

@xmunoz I'm excited about this work! Will we be able to discuss it with you at PyCon and/or help improve it during the sprints?

brainwane on 27 Jan 2020

Yes, absolutely! I'm actually giving a charla about this system at PyCon, but for interested non-Spanish speakers, I can give the English version during the sprints. Also, I'd really love to get feedback on this contribution documentation, and this sounds like a great way to do that.

https://github.com/pypa/warehouse/pull/7369

xmunoz on 12 Feb 2020

@xmunoz Are there any slides of that charla?
Do you guys mantain a database of previous backdoor/malware introduced to pypi ? I have slowly start building my own collection, and I would love to expand it.

nicowaisman on 12 Feb 2020

For the first question, I'll follow up over email :)

The second question could potentially be answered by @ewdurbin.

xmunoz on 12 Feb 2020

The malware-detection branch has been merged onto master with PR #7377

xmunoz on 19 Feb 2020

Was this page helpful?

0 / 5 - 0 ratings