Modsecurity: homoglyphs translation to ASCII

Created on 17 Oct 2013 · 10Comments · Source: SpiderLabs/ModSecurity

MODSEC-194: Il would be useful to have a filter that convert all homoglyphs to their ASCII (or Latin?) equivalent.
This would be useful to stop SQL smuggling.

Source

rcbarnett-zz

Most helpful comment

I think I can help here.
There are several pre-requisites & limitations.

Pre-requisites:

Let's assume that only UTF-8 is used and we block bad UTF-8 encoding (if you have to accept something else, I think it's game over)
We map all Unicode characters to US-ASCII:
SecUnicodeMapFile {...}/unicode.mapping 20127
We use t:utf8toUnicode (+ t:urlDecodeUni if needed)

Limitations:

The current file "unicode.mapping" is highly incomplete.
We have an extended version (more or less exhaustive) that I generated automatically and updated manually.
This file is not public yet because I consider it potentially not 100% correct and I don't want to distribute this information that we use in highly sensitive environments to attackers.
It needs to be reviewed by several people but, most of all, the mapping principle should be validated: which characters should be mapped? For accented letters, it's obvious but what about greek characters for instance? Should they be mapped to a letter? What about the characters 02C5 (MODIFIER LETTER DOWN ARROWHEAD) & 02C7 (CARON)? Should they be mapped to a V?
In order to answer that, I think we need an exhaustive list of the back-end systems (app servers and DB) that perform this kind of mapping and to adapt the list consistently.
Potentially, we need to create several entries, one for each back-end.
I we can construct complete requirements, I'll complete it and share it with everybody.
In case we have different code mappings dependent on the back-end, that means that we can only support one back-end per WAF, as SecUnicodeMapFile is a global setting.
In case of all above points are solved, the htmlEntityDecode does not support extended characters. We should extend it to have a complete solution: should be automatic when using utf8toUnicode (like urlDecodeUni), or, potentially, have a new transformation "htmlEntityDecodeUni"
Unless there's an optimisation performed in htmlEntityDecode, we (maybe) need to use it twice:
t:utf8toUnicode,t:urlDecodeUni,t:htmlEntityDecode,t:utf8toUnicode
because a Unicode character could be coded as an html entity on top of the opposite - to be validated (as our parsing is maybe paranoiac)
discussion in point 4 should be validated for sqlHexDecode

marcstern on 12 Jun 2017

👍2

All 10 comments

Original reporter: marcstern

rcbarnett-zz on 17 Oct 2013

rbarnett: Agreed. Two comments -

1) We are looking into implementing something similar to Snort's unicode.map file for conversions
http://cvs.snort.org/viewcvs.cgi/checkout/snort/etc/unicode.map?rev=HEAD&content-type=text/plain

2) In the meantime, the latest CRS v2.1.1 has the BETA advanced_filter_converter.lua script that is used to normalize many of the same issues. This file is the Lua port of the PHPIDS Converter.PHP logic which combats many of these evasions attempts. The Lua script is used by the newly named modsecurity_crs_41_advanced_filters.conf file -
http://mod-security.svn.sourceforge.net/viewvc/mod-security/crs/trunk/experimental_rules/modsecurity_crs_41_advanced_filters.conf

rcbarnett-zz on 17 Oct 2013

marcstern: Also, extended characters like %u2329 should be supported. Currently, the lowest byte is zeroed which inhibits the parsing of these characters.
Should I open a new bug?

rcbarnett-zz on 17 Oct 2013

rbarnett: We might be able to extend t:urlDecodeUni to better handle this issue. For example, we could do different Unicode mappings using the data found here -

http://www.lookout.net/2010/12/20/list-of-characters-for-testing-unicode-transformations-and-best-fit-mapping-to-dangerous-ascii/
http://www.lookout.net/wp-content/uploads/2010/12/uni2asc.csv
http://www.lookout.net/wp-content/uploads/2010/12/bestfit.csv

rcbarnett-zz on 17 Oct 2013

@zimmerle why was this abandoned it'd be cool to do homoglyph detection, perhaps we can do this in a CRS rule @dune73, thoughts?

csanders-git on 9 Jun 2017

👍1

Sure think it would be great to do this, but it sounds very tricky. It's certainly more flexible if done within a rule, but maybe it is too expensive and should be covered by ModSec itself.

Also I lack the know-how about much of this encoding, homoglyph stuff. So a couple of attacking payload examples would help me and probably some others to look at this from a practical viewpoint.

dune73 on 10 Jun 2017

👍1

I think I can help here.
There are several pre-requisites & limitations.

Pre-requisites:

Let's assume that only UTF-8 is used and we block bad UTF-8 encoding (if you have to accept something else, I think it's game over)
We map all Unicode characters to US-ASCII:
SecUnicodeMapFile {...}/unicode.mapping 20127
We use t:utf8toUnicode (+ t:urlDecodeUni if needed)

Limitations:

The current file "unicode.mapping" is highly incomplete.
We have an extended version (more or less exhaustive) that I generated automatically and updated manually.
This file is not public yet because I consider it potentially not 100% correct and I don't want to distribute this information that we use in highly sensitive environments to attackers.
It needs to be reviewed by several people but, most of all, the mapping principle should be validated: which characters should be mapped? For accented letters, it's obvious but what about greek characters for instance? Should they be mapped to a letter? What about the characters 02C5 (MODIFIER LETTER DOWN ARROWHEAD) & 02C7 (CARON)? Should they be mapped to a V?
In order to answer that, I think we need an exhaustive list of the back-end systems (app servers and DB) that perform this kind of mapping and to adapt the list consistently.
Potentially, we need to create several entries, one for each back-end.
I we can construct complete requirements, I'll complete it and share it with everybody.
In case we have different code mappings dependent on the back-end, that means that we can only support one back-end per WAF, as SecUnicodeMapFile is a global setting.
In case of all above points are solved, the htmlEntityDecode does not support extended characters. We should extend it to have a complete solution: should be automatic when using utf8toUnicode (like urlDecodeUni), or, potentially, have a new transformation "htmlEntityDecodeUni"
Unless there's an optimisation performed in htmlEntityDecode, we (maybe) need to use it twice:
t:utf8toUnicode,t:urlDecodeUni,t:htmlEntityDecode,t:utf8toUnicode
because a Unicode character could be coded as an html entity on top of the opposite - to be validated (as our parsing is maybe paranoiac)
discussion in point 4 should be validated for sqlHexDecode

marcstern on 12 Jun 2017

👍2

hmm yeah these are some good points... the transformation system as it exists is kinda not great is it... just not sure of other options. likewise good points need to be made about updating the unicode mapping file, i'm gonna link this issue in an open CRS bug we have on that matter.

csanders-git on 13 Jun 2017

Maybe the update to the unicode.map could be eased with something like CLDR transforms like Cyrillic->Latin

The fact that SecUnicodeMapFile is a global setting is a limitation indeed, but I think something like this can work for some scenarios:

<Location "/mysite/english/home/">
SecUnicodeMapFile unicode.mapping 1215
</Location>

<Location "/mysite/russian/home/">
SecUnicodeMapFile unicode.mapping 20127
</Location>