Stupid Python
Dec. 15th, 2007 04:41 pm![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
chymax$ python -V Python 2.5.1 chymax$ python -c 'print u"\xA9";' © chymax$ python -c 'print u"\xA9";' >/dev/null Traceback (most recent call last): File "<string>", line 1, inUnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 0: ordinal not in range(128) chymax$ locale LANG= LC_COLLATE="C" LC_CTYPE="en_GB.utf-8" LC_MESSAGES="C" LC_MONETARY="C" LC_NUMERIC="C" LC_TIME="C" LC_ALL="C/en_GB.utf-8/C/C/C/C"
(no subject)
Date: 2007-12-15 10:31 pm (UTC)In Unix the ENVIRONMENT defines the locale, not the output stream. Files don't have encodings. They're merely byte streams. The application gets to decide how it wants to interpret it. One of the beauties and hates of Unix; files are boring with no distinction between binary or text, no locale information, no structured records... just bytes.
(no subject)
Date: 2007-12-15 10:58 pm (UTC)But the Unix model only describes how the files are stored and low-level interfaces to access them, not how the file contents is used by applications. At the application level, files definitely do have encodings.
So supposing I have a string of characters. If I want to write this data to a file, I have to encode it into a byte stream in the appropriate encoding, depending on which other applications are going to need to read and decode it.
For example, suppose I have a string containing the character COPYRIGHT SIGN and I want to write it to a file. If the file needs to be encoded in ISO Latin-1, I should output the byte
A9
. If the file needs to be encoded in UTF-8, I should output the bytesC2 A9
. If the file needs to be encoded in UTF-16LE, I should output the bytesA9 00
.As we've both said now, operating systems in common use don't have any protocol or metadata for specifying the encoding of a file. So there's no way for Python to do the right thing. So it's not stupid to default to
None
.I think you already know all this, and you know that I know all this.
(no subject)
Date: 2007-12-15 11:05 pm (UTC)For inherited implicit file streams (stdin/out/err at the very least; potentially others... basically any file handle already open when python starts) then the _environment_ is the only place that can define the encoding. Let the programmer override as necessary.
There is no reason why "blah" and "blah | cat" should result in different encodings.
(no subject)
Date: 2007-12-16 02:19 pm (UTC)blah
andblah | cat
should generate the same byte stream is a fine principle (though I wouldn’t be completely dogmatic about it; I think the behaviour ofls
is useful). However, another important principle is that non-interactive execution likeblah > file
should not depend on which terminal you happen to run it from.Python provides a simple way to adhere to both of these principles: don’t write character strings to file handles, only write byte strings.
(no subject)
Date: 2007-12-16 02:30 pm (UTC)(no subject)
Date: 2007-12-19 12:46 pm (UTC)Unix has a protocol for specifying the encoding of a file and it is generally to set the relevant environment variables.
Python is not a good Unix citizen in this regard.