ewx: (geek)
Richard Kettlewell ([personal profile] ewx) wrote2009-06-18 02:22 pm
Entry tags:

More Python character encoding braindamage

rjk@rackheath:~$ python -c 'print "møøse"'
møøse
rjk@rackheath:~$ python -c 'print u"møøse"'
møøse
rjk@rackheath:~$ locale
LANG=en_GB.UTF-8
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_PAPER="en_GB.UTF-8"
LC_NAME="en_GB.UTF-8"
LC_ADDRESS="en_GB.UTF-8"
LC_TELEPHONE="en_GB.UTF-8"
LC_MEASUREMENT="en_GB.UTF-8"
LC_IDENTIFICATION="en_GB.UTF-8"
LC_ALL=
rjk@rackheath:~$ python -V
Python 2.5.2

(...but it's the same in 2.6.x.)

I think it's interpreting each byte in the (actually UTF-81) input string as a Unicode code point in its own right, though another (philosophically different but pragmatically essentially identical) possibility is that it thinks all input strings are encoded using ISO-8859-1.

1 and don't give me the nonsense someone came up with last time about Python having no way to know how the input is encoded. It does have a way, that's what LC_CTYPE is for.

[identity profile] geekette8.livejournal.com 2009-06-18 01:26 pm (UTC)(link)
Funnily, I was just reading http://en.wikipedia.org/wiki/Mojibake about 10 minutes ago.

[identity profile] imc.livejournal.com 2009-06-18 10:27 pm (UTC)(link)
I can see some merit in the argument that the source file should specify its own encoding, because if I write my source file in ISO-8859-1 it shouldn't break when I send it to you just because you have LANG=en_GB.UTF-8.

But that doesn't really explain this:
$ LANG=en_GB.ISO8859-1 python -c 'print u"møøse"'
møøse
$ LANG=POSIX python -c 'print u"møøse"'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-2:
ordinal not in range(128)