ewx: (geek)
[personal profile] ewx
chymax$ python -V
Python 2.5.1
chymax$ python -c 'print u"\xA9";'
©
chymax$ python -c 'print u"\xA9";' >/dev/null
Traceback (most recent call last):
  File "<string>", line 1, in 
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 0: ordinal not in range(128)
chymax$ locale
LANG=
LC_COLLATE="C"
LC_CTYPE="en_GB.utf-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL="C/en_GB.utf-8/C/C/C/C"

(no subject)

Date: 2007-12-15 09:47 pm (UTC)
From: [identity profile] covertmusic.livejournal.com
http://mail.python.org/pipermail/python-list/2006-May/384226.html

That corner of Python's changing in 3.0, I think, but it's all a complete mess.

(no subject)

Date: 2007-12-15 10:10 pm (UTC)
From: [identity profile] gareth-rees.livejournal.com
When stdout is a terminal, sys.stdout.encoding gets set using the locale but when stdout is a file, sys.stdout.encoding is None.

I agree that this is annoying, but I disagree that it's stupid.

The right thing to do is that when stdout is a file, Python should use the correct encoding for that file, whatever that is. But unfortunately there's no way for Python to find out. There's no standard operating system API or protocol for discovering the encoding. In particular, there's no good reason to expect the encoding for the file to be the same as the encoding used by your terminal. So I agree with Python that the only sensible thing to do is to default to None for a file encoding.

Note that you can write bytes to the file without error, it's only characters that need to be encoded:
$ python >/dev/null
Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04) 
[GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> print '\xa9' # string of bytes, writes successfully
>>> print u'\xa9' # string of characters, need to be encoded before writing
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 0: ordinal not in range(128)

(no subject)

Date: 2007-12-15 10:31 pm (UTC)
From: [identity profile] sweh.livejournal.com
No, it's wrong and stupid, in a Unix context (other operating systems may differ).

In Unix the ENVIRONMENT defines the locale, not the output stream. Files don't have encodings. They're merely byte streams. The application gets to decide how it wants to interpret it. One of the beauties and hates of Unix; files are boring with no distinction between binary or text, no locale information, no structured records... just bytes.

(no subject)

Date: 2007-12-15 10:58 pm (UTC)
From: [identity profile] gareth-rees.livejournal.com
You're right that in the Unix model files are just streams of bytes. That's why Python lets me write a string of bytes to stdout without error, as you can in my example above.

But the Unix model only describes how the files are stored and low-level interfaces to access them, not how the file contents is used by applications. At the application level, files definitely do have encodings.

So supposing I have a string of characters. If I want to write this data to a file, I have to encode it into a byte stream in the appropriate encoding, depending on which other applications are going to need to read and decode it.

For example, suppose I have a string containing the character COPYRIGHT SIGN and I want to write it to a file. If the file needs to be encoded in ISO Latin-1, I should output the byte A9. If the file needs to be encoded in UTF-8, I should output the bytes C2 A9. If the file needs to be encoded in UTF-16LE, I should output the bytes A9 00.

As we've both said now, operating systems in common use don't have any protocol or metadata for specifying the encoding of a file. So there's no way for Python to do the right thing. So it's not stupid to default to None.

I think you already know all this, and you know that I know all this.

(no subject)

Date: 2007-12-15 11:05 pm (UTC)
From: [identity profile] sweh.livejournal.com
For specific file streams, sure; here the programmer should define what is needed.

For inherited implicit file streams (stdin/out/err at the very least; potentially others... basically any file handle already open when python starts) then the _environment_ is the only place that can define the encoding. Let the programmer override as necessary.

There is no reason why "blah" and "blah | cat" should result in different encodings.

(no subject)

Date: 2007-12-16 02:19 pm (UTC)
From: [identity profile] gareth-rees.livejournal.com
That blah and blah | cat should generate the same byte stream is a fine principle (though I wouldn’t be completely dogmatic about it; I think the behaviour of ls is useful). However, another important principle is that non-interactive execution like blah > file should not depend on which terminal you happen to run it from.

Python provides a simple way to adhere to both of these principles: don’t write character strings to file handles, only write byte strings.

(no subject)

Date: 2007-12-16 02:30 pm (UTC)
ext_8103: (Default)
From: [identity profile] ewx.livejournal.com
I don't know where you've got this idea from that the encoding comes from the terminal; it does not. It comes from the environment, and it does not come from an environment variable meaning “the local terminal's encoding” (as Python treats it) it comes from an environment variable meaning “the encoding to use everywhere” (i.e. implicitly “in the absence of additional file-specific configuration”).

(no subject)

Date: 2007-12-19 12:46 pm (UTC)
From: [identity profile] david jones (from livejournal.com)
I think sweh and ewx are on the money here. It is incorrect to say that Unix only regards files as a stream of bytes. Sure, at the C programming level yes. At the shell programming level (using the provided utilities), then no. Mostly files are regarded as streams of characters encoded according to LC_CTYPE etc. For example sort(1) sorts according to LC_CTYPE (for encoding) and LC_LOCALE (for collation order). sort(1) cannot be told to "see the bytes", it only sees the characters.

Unix has a protocol for specifying the encoding of a file and it is generally to set the relevant environment variables.

Python is not a good Unix citizen in this regard.

(no subject)

Date: 2007-12-16 10:53 am (UTC)
ext_8103: (geek)
From: [identity profile] ewx.livejournal.com

I agree with Stephen; LC_CTYPE tells you what encoding to use. Every other UNIX tool (that needs to know) understands this, and it's the only sane approach if you ever want to use I/O redirection. (You really think that sticking |less on the end should cause the program to stop working?)

The current C specification takes the same view (7.19.3): text files use the current multibyte character encoding. Microsoft Windows also adopts this model (http://msdn2.microsoft.com/en-us/library/yeby3zcb(VS.71).aspx). It's Python that's out on a limb here.

If you have text that isn't in the current LC_CTYPE encoding then you have to do something extra, indeed. Whether you achieve that by more complicated programs or by pipelines involving iconv depends on what you're trying to do.

The Python approach breaks the overwhelmingly common case without making most of the obscure cases work. The only case I can see that it does make work is where you want to be sure that some output was ASCII even though you were using a non-ASCII locale and wanted your tool to enforce it. And, well, that's not the kind of thing that's difficult to check for, except that you'd get it right and make it independent of output device, wouldn't you?

(no subject)

Date: 2007-12-15 10:13 pm (UTC)
From: [identity profile] gareth-rees.livejournal.com
A similar complaint (http://drj11.wordpress.com/2007/05/14/python-how-is-sysstdoutencoding-chosen/) from [livejournal.com profile] drj11.

(no subject)

Date: 2007-12-18 08:29 am (UTC)
cjwatson: (Default)
From: [personal profile] cjwatson
Pain. In. The. Arse. If sys.stdout.encoding were at least writable then I wouldn't mind so much.

(no subject)

Date: 2007-12-18 10:20 am (UTC)
cjwatson: (Default)
From: [personal profile] cjwatson

This seems to restore things to sanity, though I'm not convinced I've got the decode nonsense right (however, it works for me):

import sys

# Avoid having to do .encode('UTF-8') everywhere. This is a pain; I wish
# Python supported something like "sys.stdout.encoding = 'UTF-8'".
def fix_stdout():
    import codecs
    sys.stdout = codecs.EncodedFile(sys.stdout, 'UTF-8')
    def null_decode(input, errors='strict'):
        return input, len(input)
    sys.stdout.decode = null_decode

fix_stdout()

(no subject)

Date: 2007-12-18 10:24 am (UTC)
ext_8103: (geek)
From: [identity profile] ewx.livejournal.com
A step in the right direction, but doesn't support non-UTF-8 locales l-(

(no subject)

Date: 2007-12-18 10:42 am (UTC)
cjwatson: (Default)
From: [personal profile] cjwatson
Should be sufficient to replace 'UTF-8' with locale.getpreferredencoding() (not tested).

February 2025

S M T W T F S
      1
2345678
9101112131415
16171819202122
232425262728 

Most Popular Tags

Expand Cut Tags

No cut tags