ewx | More Python character encoding braindamage

You're viewing

ewx's journal
Create a Dreamwidth Account Learn More

Reload page in style: site light

rjk@rackheath:~$ python -c 'print "møøse"'
møøse
rjk@rackheath:~$ python -c 'print u"møøse"'
mÃ¸Ã¸se
rjk@rackheath:~$ locale
LANG=en_GB.UTF-8
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_PAPER="en_GB.UTF-8"
LC_NAME="en_GB.UTF-8"
LC_ADDRESS="en_GB.UTF-8"
LC_TELEPHONE="en_GB.UTF-8"
LC_MEASUREMENT="en_GB.UTF-8"
LC_IDENTIFICATION="en_GB.UTF-8"
LC_ALL=
rjk@rackheath:~$ python -V
Python 2.5.2

(...but it's the same in 2.6.x.)

I think it's interpreting each byte in the (actually UTF-8¹) input string as a Unicode code point in its own right, though another (philosophically different but pragmatically essentially identical) possibility is that it thinks all input strings are encoded using ISO-8859-1.

¹ and don't give me the nonsense someone came up with last time about Python having no way to know how the input is encoded. It does have a way, that's what LC_CTYPE is for.

Flat | Top-Level Comments Only

From:

geekette8.livejournal.com

Funnily, I was just reading http://en.wikipedia.org/wiki/Mojibake about 10 minutes ago.

From:

imc.livejournal.com

I can see some merit in the argument that the source file should specify its own encoding, because if I write my source file in ISO-8859-1 it shouldn't break when I send it to you just because you have LANG=en_GB.UTF-8.

But that doesn't really explain this:

$ LANG=en_GB.ISO8859-1 python -c 'print u"møøse"'
møøse
$ LANG=POSIX python -c 'print u"møøse"'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-2:
ordinal not in range(128)

From:

ewx.livejournal.com

I can't tell what encoding your terminal used, so it's hard to interpret the first result with certainty, although it seems to have done what one would have hoped for.

The second result is seems clear though: you explicitly asked for ASCII only and tried to print something non-ASCII.

From:

imc.livejournal.com

My terminal is iso-8859-1, so yes my first example worked.

My second example was misguided. It was meant to demonstrate that Python does sometimes obey the value of LANG. However, it turns out to be complaining about the output, not the input. In this case it looks like python is defaulting to a source encoding of iso-8859-1.

However, that only applies for programs executed via "-c"; the default for source read from a file seems to be ASCII (unless the file starts with a UTF-8 byte-order mark).

$ cat moose.py
#!/usr/bin/python
print u"møøse"
$ LANG=en_GB.ISO8859-1 python moose.py
  File "moose.py", line 2
SyntaxError: Non-ASCII character '\xf8' in file moose.py on line 2, but no
encoding declared; see http://www.python.org/peps/pep-0263.html for details