Apktool: Some characters (surrogate pair) are missing in strings.xml

Created on 24 Feb 2020  ยท  5Comments  ยท  Source: iBotPeaches/Apktool

Information

  1. Apktool Version (apktool -version) - 2.4.1
  2. Operating System (Mac, Linux, Windows) - Linux
  3. APK From? (Playstore, ROM, Other) - Other
    https://www.dropbox.com/s/pfa5u20jni5l3eb/app-debug-apktool.apk?dl=0

Stacktrace/Logcat


Steps to Reproduce

  1. build an apk (not necessary, it's linked) with this string values:
    <string name="sun_plus">&#x1f506;</string>
    <string name="sun_minus">&#x1f505;</string>
  1. apktool
java -jar apktool.jar d -s /tmp/app-debug.apk -o /tmp/demo-unpacked                                                                 
I: Using Apktool 2.4.1 on app-debug.apk
I: Loading resource table...
I: Decoding AndroidManifest.xml with resources...
I: Loading resource table from file: /home/jarda/.local/share/apktool/framework/1.apk
I: Regular manifest package...
I: Decoding file-resources...
I: Decoding values */* XMLs...
I: Copying raw classes.dex file...
I: Copying assets and libs...
I: Copying unknown files...
I: Copying original files...
  1. check unpacked strings.xml:
    <string name="sun_minus" />
    <string name="sun_plus" />

Expected results

Non-empty string values:

    <string name="sun_plus">&#x1f506;</string>
    <string name="sun_minus">&#x1f505;</string>

Frameworks

null

APK

https://www.dropbox.com/s/pfa5u20jni5l3eb/app-debug-apktool.apk?dl=0
This is just a trivial app from android studio "new project" wizzard. I have just added this to strings.xml:

    <string name="sun_plus">&#x1f506;</string>
    <string name="sun_minus">&#x1f505;</string>

Questions to ask before submission

  1. Have you tried apktool d, apktool b without changing anything? just d
  2. If you are trying to install a modified apk, did you resign it? not necessary
  3. Are you using the latest apktool version? yes
Bug StringBlock

Most helpful comment

Thank you for the information and highly detailed explanation. I'll need to read up on surrogates and refresh myself on our string decoders and take a stab at fixing this.

All 5 comments

I dug into this as JEB 3.13 suffered the same problem, albeit partially. The problem is fixed in JEB 3.14. I thought I'd add to this thread so that the Apktool authors can implement a fix as well.

Both of those strings were rendered as uFFFDuFFFD in XML in JEB. 0xFFFD is the Unicode 'REPLACEMENT CHARACTER', used by converters (such as Java String's) as a placeholder for badly-encoded characters. So, something was there, but not exactly what was expected.

Let's take ๐Ÿ”† (HIGH BRIGHTNESS SYMBOL), Unicode character 0x1F506. In UTF-16, this character, being >0x10000, must be encoded as a surrogate pair, in this case: 0xD83D, 0xDD06.
Note that the UTF-8 encoding of 0x1F506 would be a 4-byte sequence, the longest possible for this encoding (used to encode surrogate pairs): F0 9F 94 86.

Now, let's examine resources.arsc. The encoded string is at offset 0xADA2, and specifically encoded using UTF-8, per the flag in the container StringTable:

02 06 ED A0 BD ED B4 86
^2 code points (surrogate pair)
   ^6 bytes follow: ED A0 BD ED B4 86

Take note of 6 bytes - we expected 4!? The sequence ED A0 BD ED B4 86 is a 2x 3-byte sequence encoding surrogate characters 0xD83D and 0xDD06.

Knowing that the default/legacy encoding for strings in Android resources is UTF-16 (a flag needs to be set to indicate UTF-8), a possibility is that the arsc encoder first performed UTF-16 encoding of the string, and then UTF-8 encoding. The sequence 0xD83D, 0xDD06 was therefore converted to ED A0 BD ED B4 86 (endianness aside), instead of performing a straight encoding to F0 9F 94 86.

I am not familiar enough with the Unicode Transformation Format specifications to know what and why this restriction is in place . It may be that non-canonical encodings are not allowed, e.g. encoding of surrogate characters. In any case, writing a custom UTF-8 decoder is the simplest way to work around this issue.

Thank you for the information and highly detailed explanation. I'll need to read up on surrogates and refresh myself on our string decoders and take a stab at fixing this.

As @nfalliere wrote, the problem is that instead of using 4-byte UTF-8 encoding for > 0x10000 code points (as specified in UTF-8 definition), Android uses 3-byte sequences.

I found this change in AOSP where support for 4-byte sequence decoding was added 5 years ago. I'm not sure if or how it is relevant to this...

I also found that the DEX format uses Modified UTF-8 which includes:
_Code points in the range U+10000 โ€ฆ U+10ffff are encoded as a surrogate pair, each of which is represented as a three-byte encoded value._
...
_MUTF-8 is actually closer to the (relatively less well-known) encoding CESU-8 than to UTF-8 per se._

I'm not familiar with Unicode/ UTF-8 encoding/decoding beside of what I read when trying to work on this, so I started by trying to write a decoder myself and ended up with the above findings. I made a PR with a proposed fix that uses CESU-8 decoder, if decoding with UTF-8 fails.

Old

โžœ  2299 grep -i -r 'sun_' *                             
app-debug-apktool/res/values/public.xml:    <public type="string" name="sun_minus" id="0x7f0f0053" />
app-debug-apktool/res/values/public.xml:    <public type="string" name="sun_plus" id="0x7f0f0054" />
app-debug-apktool/res/values/strings.xml:    <string name="sun_minus" />
app-debug-apktool/res/values/strings.xml:    <string name="sun_plus" />
app-debug-apktool/smali/com/example/myapplication/R$string.smali:.field public static final sun_minus:I = 0x7f0f0053
app-debug-apktool/smali/com/example/myapplication/R$string.smali:.field public static final sun_plus:I = 0x7f0f0054

New

โžœ  2299 grep -i -r 'sun_' *                                                                                           
app-debug-apktool/res/values/public.xml:    <public type="string" name="sun_minus" id="0x7f0f0053" />
app-debug-apktool/res/values/public.xml:    <public type="string" name="sun_plus" id="0x7f0f0054" />
app-debug-apktool/res/values/strings.xml:    <string name="sun_minus">๐Ÿ”…</string>
app-debug-apktool/res/values/strings.xml:    <string name="sun_plus">๐Ÿ”†</string>
app-debug-apktool/smali/com/example/myapplication/R$string.smali:.field public static final sun_minus:I = 0x7f0f0053
app-debug-apktool/smali/com/example/myapplication/R$string.smali:.field public static final sun_plus:I = 0x7f0f0054

Thanks to @Comnir above. Might do some thinking about the emiting of warnings as the UTF8 decoder fails and fallback to the new CESU-8.

โžœ  2299 apktool d app-debug-apktool.apk -f                                                                            
I: Using Apktool 2.5.1-d1c006-SNAPSHOT on app-debug-apktool.apk
I: Loading resource table...
W: Failed to decode a string at offset 35379 of length 6
W: Failed to decode a string at offset 35388 of length 6

This will close on merge shortly.

Might do some thinking about the emiting of warnings as the UTF8 decoder fails and fallback to the new CESU-8.

Yes... If there are multiple such strings, the user might be flooded with useless warnings.
Maybe it would be better to

  1. Only log the first time the decoding fails.
  2. Improve to message to clarify there's a retry.

For example:
Failed to decode a string as UTF-8 at offset X of length Y. Will retry with CESU-8. Further similar failures won't be logged.

Was this page helpful?
0 / 5 - 0 ratings