ewx: (geek)
[personal profile] ewx
rjk@rackheath:~$ python -c 'print "møøse"'
møøse
rjk@rackheath:~$ python -c 'print u"møøse"'
møøse
rjk@rackheath:~$ locale
LANG=en_GB.UTF-8
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_PAPER="en_GB.UTF-8"
LC_NAME="en_GB.UTF-8"
LC_ADDRESS="en_GB.UTF-8"
LC_TELEPHONE="en_GB.UTF-8"
LC_MEASUREMENT="en_GB.UTF-8"
LC_IDENTIFICATION="en_GB.UTF-8"
LC_ALL=
rjk@rackheath:~$ python -V
Python 2.5.2

(...but it's the same in 2.6.x.)

I think it's interpreting each byte in the (actually UTF-81) input string as a Unicode code point in its own right, though another (philosophically different but pragmatically essentially identical) possibility is that it thinks all input strings are encoded using ISO-8859-1.

1 and don't give me the nonsense someone came up with last time about Python having no way to know how the input is encoded. It does have a way, that's what LC_CTYPE is for.

(no subject)

Date: 2009-06-18 01:26 pm (UTC)
From: [identity profile] geekette8.livejournal.com
Funnily, I was just reading http://en.wikipedia.org/wiki/Mojibake about 10 minutes ago.

(no subject)

Date: 2009-06-18 10:27 pm (UTC)
From: [identity profile] imc.livejournal.com
I can see some merit in the argument that the source file should specify its own encoding, because if I write my source file in ISO-8859-1 it shouldn't break when I send it to you just because you have LANG=en_GB.UTF-8.

But that doesn't really explain this:
$ LANG=en_GB.ISO8859-1 python -c 'print u"møøse"'
møøse
$ LANG=POSIX python -c 'print u"møøse"'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-2:
ordinal not in range(128)

(no subject)

Date: 2009-06-19 08:23 am (UTC)
ext_8103: (Default)
From: [identity profile] ewx.livejournal.com

I can't tell what encoding your terminal used, so it's hard to interpret the first result with certainty, although it seems to have done what one would have hoped for.

The second result is seems clear though: you explicitly asked for ASCII only and tried to print something non-ASCII.

(no subject)

Date: 2009-06-19 12:14 pm (UTC)
From: [identity profile] imc.livejournal.com
My terminal is iso-8859-1, so yes my first example worked.

My second example was misguided. It was meant to demonstrate that Python does sometimes obey the value of LANG. However, it turns out to be complaining about the output, not the input. In this case it looks like python is defaulting to a source encoding of iso-8859-1.

However, that only applies for programs executed via "-c"; the default for source read from a file seems to be ASCII (unless the file starts with a UTF-8 byte-order mark).
$ cat moose.py
#!/usr/bin/python
print u"møøse"
$ LANG=en_GB.ISO8859-1 python moose.py
  File "moose.py", line 2
SyntaxError: Non-ASCII character '\xf8' in file moose.py on line 2, but no
encoding declared; see http://www.python.org/peps/pep-0263.html for details

(no subject)

Date: 2009-06-19 12:38 pm (UTC)
ext_8103: (Default)
From: [identity profile] ewx.livejournal.com

Source files do need some means other than locale, yes, and PEP 0263 doesn't look too bad. But -c should obviously honor LC_CTYPE.

You'd see that it stops honoring LC_CTYPE for output if you redirected output anywhere.

February 2025

S M T W T F S
      1
2345678
9101112131415
16171819202122
232425262728 

Most Popular Tags

Expand Cut Tags

No cut tags