WTF?

Sep. 24th, 2006 09:29 am
ewx: (Default)
[personal profile] ewx
mime-version: 1.0
content-type: text/plain; charset=gb2312
content-transfer-encoding: quoted-printable
Message-Id: <20060924082345.DB8133FFCB@zxtm3.redstardevelopment.com>
From: "Order Confirmation" <info@play.com>
To: 
subject: Confirmation of order received by play.com
date: 24 Sep 2006 09:23:56 +0100

GB2312? What's that then?

GB2312 is the registered internet name for a key official character set of the People's Republic of China, used for simplified Chinese characters. GB abbreviates Guojia Biaozhun (国家标准), which means national standard in Chinese.

(The text of the mail is in English, although their interpretation of quoted-printable is a bit individual...)

I suppose to be fair I should mention that Amazon's equivalent emails claim a charset of ascii but then include £.

(no subject)

Date: 2006-09-24 09:07 am (UTC)
simont: A picture of me in 2016 (Default)
From: [personal profile] simont
Curiously, "GB2312" is actually a misnomer in this context. East Asian character set standards tend to divide rather sharply into character sets, which simply define a list of characters each indexed by an ordered pair of integers, and character encodings which specify exactly how to represent a sequence of those numeric indices within a stream of byte values. GB2312 is actually the name of a character set, but MIME headers really ought to be specifying an encoding in order to make it unambiguous how to decode the message body. As best I can determine, the notation "GB2312" in MIME headers is a historical accident, and what it really means to say is "EUC-CN", which is one of a family of East Asian EUC encodings. EUC-CN uses single low-half bytes to encode normal ASCII, and pairs of high-half bytes to encode Chinese characters from the GB2312 set.

It's not totally implausible that someone might have a reason for using this character set in an English message. East Asian national character sets do tend to include a smattering of other parts of Unicode apart from what's needed for their own language; GB2312, for example, contains some ISO8859-1 accented characters, some Greek, some Cyrillic, maths symbols, box drawing characters, Japanese kana, and miscellaneous oddities. (I once had mail in KS X 1001 from somebody reporting a PuTTY bug; he got as far as giving his system specifications without departing from the ASCII subset, and then he said he had a Pentium <splodge>. When I pasted the splodge into my handy character-set decoder it translated it as U+2163 ROMAN NUMERAL FOUR :-) But I agree that seeing it from play.com seems odd, and seeing it specified gratuitously in a message which really does use nothing outside ASCII is a bit odd as well.

(no subject)

Date: 2006-09-24 09:31 am (UTC)
ext_8103: (Default)
From: [identity profile] ewx.livejournal.com
The misnomerism is not limited to this case. "character set" is MIME-speak for "character set and character<->octet encoding", with encoding being "wire data<->octet encoding".

(no subject)

Date: 2006-09-24 06:55 pm (UTC)
fanf: (Default)
From: [personal profile] fanf
Don't forget that there are content-encodings, transfer-encodings, and content-transfer-encodings, depending on whether you are in the HTTP or SMTP worlds. Other profiles of MIME have similarly weird variations.

January 2026

S M T W T F S
    123
45678910
111213141516 17
18192021222324
25262728293031

Most Popular Tags

Expand Cut Tags

No cut tags