More Python character encoding braindamage
rjk@rackheath:~$ python -c 'print "møøse"' møøse rjk@rackheath:~$ python -c 'print u"møøse"' møøse rjk@rackheath:~$ locale LANG=en_GB.UTF-8 LC_CTYPE="en_GB.UTF-8" LC_NUMERIC="en_GB.UTF-8" LC_TIME="en_GB.UTF-8" LC_COLLATE="en_GB.UTF-8" LC_MONETARY="en_GB.UTF-8" LC_MESSAGES="en_GB.UTF-8" LC_PAPER="en_GB.UTF-8" LC_NAME="en_GB.UTF-8" LC_ADDRESS="en_GB.UTF-8" LC_TELEPHONE="en_GB.UTF-8" LC_MEASUREMENT="en_GB.UTF-8" LC_IDENTIFICATION="en_GB.UTF-8" LC_ALL= rjk@rackheath:~$ python -V Python 2.5.2
(...but it's the same in 2.6.x.)
I think it's interpreting each byte in the (actually UTF-81) input string as a Unicode code point in its own right, though another (philosophically different but pragmatically essentially identical) possibility is that it thinks all input strings are encoded using ISO-8859-1.
1 and don't give me the nonsense someone came up with last time about Python having no way to know how the input is encoded. It does have a way, that's what LC_CTYPE is for.
no subject
no subject
But that doesn't really explain this:
no subject
I can't tell what encoding your terminal used, so it's hard to interpret the first result with certainty, although it seems to have done what one would have hoped for.
The second result is seems clear though: you explicitly asked for ASCII only and tried to print something non-ASCII.
no subject
My second example was misguided. It was meant to demonstrate that Python does sometimes obey the value of LANG. However, it turns out to be complaining about the output, not the input. In this case it looks like python is defaulting to a source encoding of iso-8859-1.
However, that only applies for programs executed via "-c"; the default for source read from a file seems to be ASCII (unless the file starts with a UTF-8 byte-order mark).
no subject
Source files do need some means other than locale, yes, and PEP 0263 doesn't look too bad. But -c should obviously honor LC_CTYPE.
You'd see that it stops honoring LC_CTYPE for output if you redirected output anywhere.