ewx | Stupid Python

You're viewing

ewx's journal
Create a Dreamwidth Account Learn More

Reload page in style: site light

chymax$ python -V
Python 2.5.1
chymax$ python -c 'print u"\xA9";'
©
chymax$ python -c 'print u"\xA9";' >/dev/null
Traceback (most recent call last):
  File "<string>", line 1, in 
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 0: ordinal not in range(128)
chymax$ locale
LANG=
LC_COLLATE="C"
LC_CTYPE="en_GB.utf-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL="C/en_GB.utf-8/C/C/C/C"

Flat | Top-Level Comments Only

From:

sweh.livejournal.com

No, it's wrong and stupid, in a Unix context (other operating systems may differ).

In Unix the ENVIRONMENT defines the locale, not the output stream. Files don't have encodings. They're merely byte streams. The application gets to decide how it wants to interpret it. One of the beauties and hates of Unix; files are boring with no distinction between binary or text, no locale information, no structured records... just bytes.

From:

gareth-rees.livejournal.com

You're right that in the Unix model files are just streams of bytes. That's why Python lets me write a string of bytes to stdout without error, as you can in my example above.

But the Unix model only describes how the files are stored and low-level interfaces to access them, not how the file contents is used by applications. At the application level, files definitely do have encodings.

So supposing I have a string of characters. If I want to write this data to a file, I have to encode it into a byte stream in the appropriate encoding, depending on which other applications are going to need to read and decode it.

For example, suppose I have a string containing the character COPYRIGHT SIGN and I want to write it to a file. If the file needs to be encoded in ISO Latin-1, I should output the byte A9. If the file needs to be encoded in UTF-8, I should output the bytes C2 A9. If the file needs to be encoded in UTF-16LE, I should output the bytes A9 00.

As we've both said now, operating systems in common use don't have any protocol or metadata for specifying the encoding of a file. So there's no way for Python to do the right thing. So it's not stupid to default to None.

I think you already know all this, and you know that I know all this.

From:

sweh.livejournal.com

For specific file streams, sure; here the programmer should define what is needed.

For inherited implicit file streams (stdin/out/err at the very least; potentially others... basically any file handle already open when python starts) then the _environment_ is the only place that can define the encoding. Let the programmer override as necessary.

There is no reason why "blah" and "blah | cat" should result in different encodings.

From:

gareth-rees.livejournal.com

That blah and blah | cat should generate the same byte stream is a fine principle (though I wouldn’t be completely dogmatic about it; I think the behaviour of ls is useful). However, another important principle is that non-interactive execution like blah > file should not depend on which terminal you happen to run it from.

Python provides a simple way to adhere to both of these principles: don’t write character strings to file handles, only write byte strings.

From:

ewx.livejournal.com

I don't know where you've got this idea from that the encoding comes from the terminal; it does not. It comes from the environment, and it does not come from an environment variable meaning “the local terminal's encoding” (as Python treats it) it comes from an environment variable meaning “the encoding to use everywhere” (i.e. implicitly “in the absence of additional file-specific configuration”).

From:

david jones (from livejournal.com)

I think sweh and ewx are on the money here. It is incorrect to say that Unix only regards files as a stream of bytes. Sure, at the C programming level yes. At the shell programming level (using the provided utilities), then no. Mostly files are regarded as streams of characters encoded according to LC_CTYPE etc. For example sort(1) sorts according to LC_CTYPE (for encoding) and LC_LOCALE (for collation order). sort(1) cannot be told to "see the bytes", it only sees the characters.

Unix has a protocol for specifying the encoding of a file and it is generally to set the relevant environment variables.

Python is not a good Unix citizen in this regard.