Python Unicode Encoding Bugs

It’s a tricky business to work with applications that pass around Unicode strings as ASCII byte strings. Small bugs can lead to lost data or data that is processed differently by different programs. Here’s an example.

One program outputs the following Unicode string representation of a Tweet:

C:\\Documents and Settings\\u30e6\u30fc\u30b6

If you process this string in Python, it will not be interpreted correctly. You can follow along with the Python interpreter (there is just a space between “and Settings”, not a new line):

»> s=u’C:\\Documents and Settings\\u30e6\u30fc\u30b6’»> s.encode(‘ascii’,errors=’replace’)

‘C:\\Documents and Settings\u30e6??’

»> len(s.encode(‘ascii’,errors=’replace’))

33

When the Python Unicode string is encoded into ASCII, you can see the bug easily. The \u30e6 is not converted to a Unicode character by the Python library because it is parsing the string from left to right. When moving from left to right, it sees the double \ characters and interprets them differently than what they really are.

The moral of the story is: don’t pass data this way. Just store it in proper Unicode format and pass it around with the file’s encoding and endian order.

Blog comments powered by Disqus