Warehouse: Implement a more robust malware detector

Created on 6 Apr 2020 · 3Comments · Source: pypa/warehouse

Hello there. I'm probably going to say a bunch of obvious things, sorry in advance :/

Current YARA-based malware detector can be circumvented easily:

It's regex based, and the regexs don't account for all the leeway in writing python (e.g. import builtins will happily not be detected because all spaces have not been marked as repeatable)
Even if it was AST-based, I'm afraid it will still be hard to tame this snake. I mean... one would think they've been thourough and then they realize timeit does eval or that platform has a popen method... Did I mention that ().__class__.__bases__[0].__subclasses__()[88] is <class 'zipimport.zipimporter'>? I think it's endless...
That being said, maybe there IS such a thing as being thourough. I doubt it. Maybe detecting nearly all dunder methods AND unusual standard lib modules and functions AND a few builtins... Maybe a whitelist ? I'm afraid this would make more noise than signal, but maybe we should try.
(For reference, https://ctf-wiki.github.io/ctf-wiki/pwn/linux/sandbox/python-sandbox-escape/)

So... There is one remaining way to know what a script does: executing it in a sandboxed environment, but this raises questions too:

How to sandbox Python? My expertise in there is close to zero, but I seem to recall Pypy (yes, with a y) could do that (and the idea of including Pypy in PyPI is a nice level of meta ;) )
Is it only possible to sandbox python in a way that it doesn't know it's sandboxed ? Because if it can figure out it's sandboxed, it can still deactivate the malicious parts, and then it's almost useless...
(One advantage of this approach would be to be able to extract metadata from sdists though, which I believe is another problem that exists out there)

So many questions... I hope this hasn't already been answered in another issue, I couldn't find anything when I searched.

Ping @xmunoz and @woodruffw to continue the discussion.

malware-detection

Source

ewjoachim

Most helpful comment

PEP 578 + the new audit API in Python 3.8 would probably work well for this purpose. We'd still need some amount of sandboxing, though.

woodruffw on 28 May 2020

👍2

All 3 comments

There was a fairly public effort, pysandbox, to create a "python sandbox" that was discontinued since it's really really [redacted] difficult to sandbox Python in-process.

More details are in this LWN article: https://lwn.net/Articles/574215/

pradyunsg on 6 Apr 2020

👍1

Thanks a lot ! This goes in the direction we were heading I guess, leaving at least a few options that were suggested:

Pypy (but I’m afraid the execution context would be so different that it would make it trivially easy to detect the sandbox)
solutions around seccomp and namespaces are hinted, which I believe could hint toward Docker. A bit of googling says I may have to read more about SELinux, SMACK, AppArmor, Tomoyo, and this feels like a rabbit hole :)

I have clearly reached my competency level, and continued a bit beyond, I’d love to learn more but I won’t be able to suggest a lot, and at this point, anything I might add will likely be a laughable proof of the dunning-kruger effect...

ewjoachim on 6 Apr 2020

PEP 578 + the new audit API in Python 3.8 would probably work well for this purpose. We'd still need some amount of sandboxing, though.

woodruffw on 28 May 2020

👍2

Was this page helpful?

0 / 5 - 0 ratings