Consulting Toolsmiths - Matt Doar: unicode

Wednesday, July 22, 2009

Python Unicode oddness

If you have a Python string with an invalid character for the current character set, then other characters may get removed unexpectedly. For instance:


>>> a = 'FOO\xe0BAR'
>>> print '%r' % a
'FOO\xe0BAR'

>>> print '%r' % unicode(a, 'utf8')
Traceback (most recent call last):
 File "", line 1, in 
 File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/encodings/utf_8.py", line 16, in decode
   return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-5: invalid data

Fair enough, \xe0 isn't a valid utf8 character. But if I tell the decoder to just ignore any characters it doesn't understand, it also eats the "B" and "A" characters!


>>> print '%r' % unicode(a, 'utf8', 'replace')
u'FOO\ufffdR'
>>> print '%r' % unicode(a, 'utf8', 'ignore')
u'FOOR'

That was unexpected. One day I'm sure someone will explain why it happens to me.

Consulting Toolsmiths - Matt Doar

Wednesday, July 22, 2009

Python Unicode oddness

Who is this?

Archive

Shared Articles of Interest

Search This Blog

Subjects

Followers

DuckDuckGo

Consulting Toolsmiths - Matt Doar

Wednesday, July 22, 2009

Python Unicode oddness

Who is this?

RSS Feeds

Archive

Shared Articles of Interest

Search This Blog

Subjects

Followers

DuckDuckGo