Showing posts with label unicode. Show all posts
Showing posts with label unicode. Show all posts

Wednesday, July 22, 2009

Python Unicode oddness

If you have a Python string with an invalid character for the current character set, then other characters may get removed unexpectedly. For instance:


>>> a = 'FOO\xe0BAR'
>>> print '%r' % a
'FOO\xe0BAR'

>>> print '%r' % unicode(a, 'utf8')
Traceback (most recent call last):
File "", line 1, in
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-5: invalid data


Fair enough, \xe0 isn't a valid utf8 character. But if I tell the decoder to just ignore any characters it doesn't understand, it also eats the "B" and "A" characters!


>>> print '%r' % unicode(a, 'utf8', 'replace')
u'FOO\ufffdR'
>>> print '%r' % unicode(a, 'utf8', 'ignore')
u'FOOR'


That was unexpected. One day I'm sure someone will explain why it happens to me.