Wednesday, July 22, 2009

Python Unicode oddness

If you have a Python string with an invalid character for the current character set, then other characters may get removed unexpectedly. For instance:


>>> a = 'FOO\xe0BAR'
>>> print '%r' % a
'FOO\xe0BAR'

>>> print '%r' % unicode(a, 'utf8')
Traceback (most recent call last):
File "", line 1, in
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-5: invalid data


Fair enough, \xe0 isn't a valid utf8 character. But if I tell the decoder to just ignore any characters it doesn't understand, it also eats the "B" and "A" characters!


>>> print '%r' % unicode(a, 'utf8', 'replace')
u'FOO\ufffdR'
>>> print '%r' % unicode(a, 'utf8', 'ignore')
u'FOOR'


That was unexpected. One day I'm sure someone will explain why it happens to me.

3 comments:

NicDumZ said...

Actually, the explanation is quite simple ;)
As you might know, UTF8 is an encoding of variable length. A character can be coded on 1,2,3 or 4 bytes.

How does a decoder knows the length of the character it's decoding? It just looks at the first byte. See the second table here.

In your case, "E0" as a first byte implies that the character is coded on 3 bytes. It then tries to interpret the sequence "\xe0BA" as a single Unicode code point. That's in fact 0xe0,0x42,0x41 and that sequence is not a valid 3 byte character, since the second and third bytes are not in the 80-BF range for multi-byte sequences.
Python then assumes that the whole 3 bytes are wrong, and discard them.

More "fun": "\xf0BAR"->u'' ; "\xc2BAR"->u'AR'

Cheers!

Matt Doar said...

Merci! Still, discarding those 1, 2 or 3 following bytes is a bit unexpected. I checked that using FOO\xeXYZ does the same thing, so it was nothing to do with B and A being valid hex chars either.


a = 'FOO\xf0XYZT'
print '%r' % unicode(a, 'utf8', 'ignore')
u'FOOT'

NicDumZ said...

Sure!

"B" in UTF8 is encoded on one byte ( hex(ord("B")) ). "B" is represented by the single one-byte sequence "0x42".
Similarly, "A" is "0x41".

So, when you feed Python "\xe0BA" encoded in utf8, all it sees is a 3-byte long sequence: the 3 bytes are 0xe0,0x42,0x41. Because the first byte indicates the start of an 3byte-long character in utf8, it then gets confused when trying to interpret "0x42" or "0x41".

Maybe it's clearer if I use 2-byte long characters?

>>> unicode('\xe0éXYZ', 'utf8', 'ignore')
u'XYZ'

See, when you feed Python "é", it's a 2-byte sequence. Not convinced?
>>> len('é')
2
>>> [hex(ord(c)) for c in 'é']
['0xc3', '0xa9']

So, again, '\xe0é' is simply the 3 bytes long 0xe0,0xc3,0xa9 in utf8. And it does not make senses as an utf8 character.

More complex?
>>> unicode('\xc2éé', 'utf8', 'ignore')
u'\xe9'
Strange, isn't it?
'\xc2éé', is 0xc2,0xc3,0xa9,0xc3,0xa9 in utf8. xc2 indicates the start of a 2-byte long sequence. utf8 codec tries to decode 0xc2,0xc3 but chokes, and discards those two bytes. Then 0xa9 is read. It cannot be the start of an utf8 character, and it's discarded.
Only 0xc3,0xa9 are left, 'é', and it gets decoded correctly (u'\xe9' is a barbarian notation, but is correctly representing 'é' in Unicode)

Cheers!