日 Japanese

Japanese character sets

JIS C 6220-1969 (= JIS X 0201)

『7ビット及び8ビットの情報交換用符号化文字集合』

Ref.: ISO-IR 013 The Japanese KATAKANA graphic set of characters / Jeu de caractères graphiques japonais KATAKANA.

This small character set contains only katakana (apart from a few Japanese punctuation marks), mapped to halfwidth characters in Unicode (PDF).

JIS X 0208-1990

『7ビット及び8ビットの2バイト情報交換用符号化漢字集合』

Ref.: ISO-IR 168 Japanese Graphic Character Set for Information Interchange.

Unihan J0 = Jis0 enumerates the 2,965 JIS Level 1 kanji (Chinese characters or 漢字) in Rows 16–47, the 3,390 JIS Level 2 kanji in Rows 48–84 and the single kanji 仝 in Row 1 (considered a symbol rather than a kanji by this Japanese standard), 6,356 kanji in total (PDF — kanji).

Apart from 仝 (mentioned above), Rows 1–8 consist of non-kanji, more specifically a set of 523 alphabetic characters (Hiragana, Katakana, Latin, Greek and Cyrillic), miscellaneous symbols and line-drawing elements (PDF — non-kanji).

Firefox has different Unicode mappings for seven characters:

OthersFirefox
― 2015— 2014
~ FF5E〜 301C
∥ 2225‖ 2016
- FF0D− 2212
¢ FFE0¢ A2
£ FFE1£ A3
¬ FFE3¬ AC

Common extensions

Unihan IBMJapan enumerates 360 kanji known as ‘IBM Selected Kanji’ encoded in Rows 115–118 (PDF — IBM kanji). There is also a set of 28 ‘IBM Selected Non-kanji’ encoded in Row 115 (PDF — IBM non-kanji). This extension is only available in the Shift-JIS encoding since ISO-2022 and EUC do not include rows beyond 94.

Nippon Electronics Corporation (NEC) has defined an alternative encoding of the IBM extension with the kanji in Rows 89–92 (PDF — IBM–NEC kanji). and the non-kanji split between Row 92 and Row 13, to which has been added a number of additional non-kanji and ligatures (PDF — IBM–NEC non-kanji).

JIS X 0212-1990

『情報交換用漢字符号-補助漢字』

Ref.: ISO-IR 159 Supplementary Japanese Graphic Character Set for Information Interchange.

Unihan J1 = Jis1 enumerates the 5,801 Supplemental JIS kanji (Chinese characters or 漢字) in Rows 16–77 (PDF — kanji).

Rows 2–11 contain 266 non-kanji (accented Latin, Greek and Cyrillic letters, diacritics, punctuation and a few other symbols) (PDF — non-kanji).

Firefox and Opera have different Unicode mappings for one character:

FirefoxOpera
~ FF5E~ 7E

ISO-2022 encoding

PDF. Successive versions of Japanese ISO-2022 encodings, incorporating an increasing number of escape sequences and character sets, are defined in RFC 1468 (ISO-2022-JP), RFC 2237 (ISO-2022-JP-1) and RFC 1554 (ISO-2022-JP-2). ISO-2022-JP-3 and ISO-2022-JP-2004 do not seem to have been published as RFCs. Furthermore, the predecessor of the ISO-2022-JP series, the JIS encoding, includes additional escape sequences.

MIME charset label: iso-2022-jp.

ISO646-US (ASCII) / ISO646-JP (JIS-Roman)

In theory, the escape sequence ESC ( B designates ISO646-US and ESC ( J designates ISO646-JP.

For historical reasons unknown to the writer, ESC ( H (technically reserved for ISO646-SE2, a Swedish character set of little use in a Japanese context) may also be used to designate ISO646-JP.

Only the positions 0x5C and 0x7E differ between ISO646-US and ISO646-JP. The following table summarises standards and implementations:

ISO646IESafariFirefoxOpera
USJPB/J/HBJ/HJBJ
0x5C\ 5C¥ A5¥ 5C¥ 5C¥ A5\ 5C\ 5C¥ A5
0x7E~ 7E¯ AF~ 7E~ 7E‾ 203E~ 7E~ 7E‾ 203E
Whether 0x7E in ISO646-JP should map to U+AF macron or U+203E overline may be debatable. More interesting is the character ‘¥ 5C’ found in Safari and IE, which appears as a yen sign but has the Unicode scalar value of a backslash.

JIS X 0208

The escape sequences ESC $ @, officially designating the 1978 version, and ESC $ B, officially designating the 1983 version, both select the newer 1990 vintage, whose official two-part escape sequence ESC & @ ESC $ B remains unrecognised.

Implementation error: In IE, the escape sequence ESC $ ( D (see below) also designates this character set, thus making JIS X 0212 inaccessible.

All browsers include IBM and NEC extensions. IE additionally includes 63 half-width katakana (the ones from JIS X 0201) in Row 10, presumably a subset of NEC’s Row 10.

JIS X 0212

The escape sequence ESC $ ( D works as expected in Firefox and Opera. IE misinterprets it (see above). Safari does not recognise it at all.

JIS X 0201 (half-width katakana)

All browsers recognise the escape sequence ESC ( I.

Only Internet Explorer recognises SI (shift out) as a method of switching to half-width katakana.

Opera and Internet Explorer interpret 8-bit characters as half-width katakana. Opera must be in ISO646-JP mode.

Non-Japanese

Firefox recognises the escape sequences ESC . A for ISO 8859-1 (Latin-1), ESC . F for ISO 8859-7 (Greek), etc. [add Chinese and Korean here].

Opera interprets 8-bit characters in ISO646-US according to ISO 8859/1.

Misc.

Internet Explorer often interprets 8-bit bytes and other undefined bytes according to (or inspired by) Shift-JIS. A hybrid of ISO-2022-JP and Shift-JIS might not be a bad idea to deal with mislabelled material, but the actual implementation is much more complex than seems to be necessary. Details may be added later.

EUC encoding

PDF. MIME charset label: euc-jp

ISO646-US (ASCII) / ISO646-JP (JIS-Roman)

Code set 0 (7-bit characters) is assigned to ISO646-US. Safari and IE displays the backslash as a yen sign (cf. ISO-2022-JP above).

JIS X 0208

Code set 1 (unprefixed 8-bit characters) encodes JIS X 0208-1990 with NEC extensions.

JIS X 0201

Code set 2 (8-bit characters prefixed by SS2, 0x8E) encodes half-width katakana.

Safari includes the following characters in the range 0xE0–0xE4: ¢, £, ¬, ¥, ~.

JIS X 0212

Code set 3 (8-bit characters prefixed by SS3, 0x8F) encodes JIS X 0212-1990. Firefox and Opera provide complete implementations of this code set, whereas Safari’s implementation only covers around five per cent of the characters and Internet Explorer has no implementation at all.

Shift-JIS encoding

PDF. MIME charset label: shift-jis

ISO646-US (ASCII) / ISO646-JP (JIS-Roman)

7-bit bytes are assigned to ISO646-US characters. Safari and IE displays the backslash as a yen sign (exactly as for EUC-JP).

JIS X 0201

8-bit bytes defined in JIS X 0201 encode half-width katakana.

JIS X 0208

8-bit bytes not defined in JIS X 0201 (followed by a second byte) encode kanji, viz, JIS X 0208-1990 with both NEC and IBM extensions.

Shift-JIS 2004 encoding

PDF. Supported in Safari, but no MIME label found.

Ad­ver­tise­ments

Contact

temp-ooyd@coq.no