If you have a Python string with an invalid character for the current character set, then other characters may get removed unexpectedly. For instance:
>>> a = 'FOO\xe0BAR'
>>> print '%r' % a
'FOO\xe0BAR'
>>> print '%r' % unicode(a, 'utf8')
Traceback (most recent call last):
File "", line 1, in
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-5: invalid data
Fair enough, \xe0 isn't a valid utf8 character. But if I tell the decoder to just ignore any characters it doesn't understand, it also eats the "B" and "A" characters!
>>> print '%r' % unicode(a, 'utf8', 'replace')
u'FOO\ufffdR'
>>> print '%r' % unicode(a, 'utf8', 'ignore')
u'FOOR'
That was unexpected. One day I'm sure someone will explain why it happens to me.