Stupid Python
chymax$ python -V Python 2.5.1 chymax$ python -c 'print u"\xA9";' © chymax$ python -c 'print u"\xA9";' >/dev/null Traceback (most recent call last): File "<string>", line 1, inUnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 0: ordinal not in range(128) chymax$ locale LANG= LC_COLLATE="C" LC_CTYPE="en_GB.utf-8" LC_MESSAGES="C" LC_MONETARY="C" LC_NUMERIC="C" LC_TIME="C" LC_ALL="C/en_GB.utf-8/C/C/C/C"
no subject
But the Unix model only describes how the files are stored and low-level interfaces to access them, not how the file contents is used by applications. At the application level, files definitely do have encodings.
So supposing I have a string of characters. If I want to write this data to a file, I have to encode it into a byte stream in the appropriate encoding, depending on which other applications are going to need to read and decode it.
For example, suppose I have a string containing the character COPYRIGHT SIGN and I want to write it to a file. If the file needs to be encoded in ISO Latin-1, I should output the byte
A9
. If the file needs to be encoded in UTF-8, I should output the bytesC2 A9
. If the file needs to be encoded in UTF-16LE, I should output the bytesA9 00
.As we've both said now, operating systems in common use don't have any protocol or metadata for specifying the encoding of a file. So there's no way for Python to do the right thing. So it's not stupid to default to
None
.I think you already know all this, and you know that I know all this.
no subject
For inherited implicit file streams (stdin/out/err at the very least; potentially others... basically any file handle already open when python starts) then the _environment_ is the only place that can define the encoding. Let the programmer override as necessary.
There is no reason why "blah" and "blah | cat" should result in different encodings.
no subject
blah
andblah | cat
should generate the same byte stream is a fine principle (though I wouldn’t be completely dogmatic about it; I think the behaviour ofls
is useful). However, another important principle is that non-interactive execution likeblah > file
should not depend on which terminal you happen to run it from.Python provides a simple way to adhere to both of these principles: don’t write character strings to file handles, only write byte strings.
no subject
no subject
Unix has a protocol for specifying the encoding of a file and it is generally to set the relevant environment variables.
Python is not a good Unix citizen in this regard.