Stupid Python
Dec. 15th, 2007 04:41 pm![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
chymax$ python -V Python 2.5.1 chymax$ python -c 'print u"\xA9";' © chymax$ python -c 'print u"\xA9";' >/dev/null Traceback (most recent call last): File "<string>", line 1, inUnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 0: ordinal not in range(128) chymax$ locale LANG= LC_COLLATE="C" LC_CTYPE="en_GB.utf-8" LC_MESSAGES="C" LC_MONETARY="C" LC_NUMERIC="C" LC_TIME="C" LC_ALL="C/en_GB.utf-8/C/C/C/C"
(no subject)
Date: 2007-12-15 09:47 pm (UTC)That corner of Python's changing in 3.0, I think, but it's all a complete mess.
(no subject)
Date: 2007-12-15 10:10 pm (UTC)sys.stdout.encoding
gets set using the locale but when stdout is a file,sys.stdout.encoding
isNone
.I agree that this is annoying, but I disagree that it's stupid.
The right thing to do is that when stdout is a file, Python should use the correct encoding for that file, whatever that is. But unfortunately there's no way for Python to find out. There's no standard operating system API or protocol for discovering the encoding. In particular, there's no good reason to expect the encoding for the file to be the same as the encoding used by your terminal. So I agree with Python that the only sensible thing to do is to default to
None
for a file encoding.Note that you can write bytes to the file without error, it's only characters that need to be encoded:
(no subject)
Date: 2007-12-15 10:31 pm (UTC)In Unix the ENVIRONMENT defines the locale, not the output stream. Files don't have encodings. They're merely byte streams. The application gets to decide how it wants to interpret it. One of the beauties and hates of Unix; files are boring with no distinction between binary or text, no locale information, no structured records... just bytes.
(no subject)
Date: 2007-12-15 10:58 pm (UTC)But the Unix model only describes how the files are stored and low-level interfaces to access them, not how the file contents is used by applications. At the application level, files definitely do have encodings.
So supposing I have a string of characters. If I want to write this data to a file, I have to encode it into a byte stream in the appropriate encoding, depending on which other applications are going to need to read and decode it.
For example, suppose I have a string containing the character COPYRIGHT SIGN and I want to write it to a file. If the file needs to be encoded in ISO Latin-1, I should output the byte
A9
. If the file needs to be encoded in UTF-8, I should output the bytesC2 A9
. If the file needs to be encoded in UTF-16LE, I should output the bytesA9 00
.As we've both said now, operating systems in common use don't have any protocol or metadata for specifying the encoding of a file. So there's no way for Python to do the right thing. So it's not stupid to default to
None
.I think you already know all this, and you know that I know all this.
(no subject)
Date: 2007-12-15 11:05 pm (UTC)For inherited implicit file streams (stdin/out/err at the very least; potentially others... basically any file handle already open when python starts) then the _environment_ is the only place that can define the encoding. Let the programmer override as necessary.
There is no reason why "blah" and "blah | cat" should result in different encodings.
(no subject)
Date: 2007-12-16 02:19 pm (UTC)blah
andblah | cat
should generate the same byte stream is a fine principle (though I wouldn’t be completely dogmatic about it; I think the behaviour ofls
is useful). However, another important principle is that non-interactive execution likeblah > file
should not depend on which terminal you happen to run it from.Python provides a simple way to adhere to both of these principles: don’t write character strings to file handles, only write byte strings.
(no subject)
Date: 2007-12-16 02:30 pm (UTC)(no subject)
Date: 2007-12-19 12:46 pm (UTC)Unix has a protocol for specifying the encoding of a file and it is generally to set the relevant environment variables.
Python is not a good Unix citizen in this regard.
(no subject)
Date: 2007-12-16 10:53 am (UTC)I agree with Stephen; LC_CTYPE tells you what encoding to use. Every other UNIX tool (that needs to know) understands this, and it's the only sane approach if you ever want to use I/O redirection. (You really think that sticking |less on the end should cause the program to stop working?)
The current C specification takes the same view (7.19.3): text files use the current multibyte character encoding. Microsoft Windows also adopts this model (http://msdn2.microsoft.com/en-us/library/yeby3zcb(VS.71).aspx). It's Python that's out on a limb here.
If you have text that isn't in the current LC_CTYPE encoding then you have to do something extra, indeed. Whether you achieve that by more complicated programs or by pipelines involving iconv depends on what you're trying to do.
The Python approach breaks the overwhelmingly common case without making most of the obscure cases work. The only case I can see that it does make work is where you want to be sure that some output was ASCII even though you were using a non-ASCII locale and wanted your tool to enforce it. And, well, that's not the kind of thing that's difficult to check for, except that you'd get it right and make it independent of output device, wouldn't you?
(no subject)
Date: 2007-12-15 10:13 pm (UTC)(no subject)
Date: 2007-12-18 08:29 am (UTC)sys.stdout.encoding
were at least writable then I wouldn't mind so much.(no subject)
Date: 2007-12-18 10:20 am (UTC)This seems to restore things to sanity, though I'm not convinced I've got the decode nonsense right (however, it works for me):
(no subject)
Date: 2007-12-18 10:24 am (UTC)(no subject)
Date: 2007-12-18 10:42 am (UTC)'UTF-8'
withlocale.getpreferredencoding()
(not tested).