中 Simplified Chinese

Mainland Chinese character sets

GB 2312-80

《信息交换用汉字编码字符集·基本集》 (Code of Chinese Graphic Character Set for InformationInterchange, Primary Set)

Ref.: ISO-IR 58 Coded Chinese Graphic Character Set for Information Interchange.

Unihan G0 = GB0 enumerates all the 6,763 hanzi (Chinese characters or 漢字) in Rows 16–87 (PDF — hanzi). All four browsers follow Unihan.

There are 682 other characters in Rows 1–9: alphabetic characters (Hiragana, Katakana, Latin, Greek and Cyrillic), numerals, miscellaneous symbols and line-drawing elements (PDF — non-hanzi).

Safari’s ISO-2022 implementation has a few divergent Unicode mappings:

ESC $ ) AESC $ ) E
· b7・ 30fb・ 30fb
— 2014 ― 2015― 2015
~ ff5e~ 7e
$ ff04$ 24
¢ ffe0¢ a2
£ ffe1£ a3
' ff07´ b4
g ff47ɡ 261
73

ISO-IR 165:1992

Ref.: ISO-IR 165 Codes of the Chinese graphic character set for communication.

This is a GB 2312-80 superset incorporating the 705 hanzi (whereof 69 quasi-hanzi ㋀–㋋, ㏠–㏾, ㍘–㍰ and 〷) added in GB 8565.2-88 (another GB 2312-80 superset) and the 132 alphabetic characters added in GB 6345.1-86 (yet another GB 2312-80 superset) as well as a further selection of 139 hanzi and 22 shading characters, which gives 7,607 hanzi and 836 others, 8,443 in total (matches Lunde 1999) or 998 added characters.

Safari’s ISO-2022 implementation includes most of the added characters apart from the 22 shading characters and the 32 half-with pinyin, most of which are missing from Unicode (PDF). No other browser seems to support this character set at all.

The reference above furthermore mentions final sigma (ς) and two unspecified additional ‘Chinese phonetic symbols’ (Latin lowercase with diacritics, mostly pinyin), 8,446 characters in total, but these three supplementary characters do not seem to appear in the table, only in the overview.

GBK

GBK is an extension to GB 2312 divided into five parts:

GBK/1

GBK/1 includes all the non-hanzi of GB 2312 as well as 45 additional characters, namely 35 vertical variants from GB/T 12345-90 (Mainland China’s basic repertoire of traditional Chinese characters) and 10 Roman numerals (PDF).

10 of the vertical variants were apparently missing from the original definition of GBK (cf. Lunde). All browsers currently map 10 vertical variants (presumably the same ones) to private-use-area characters or U+FFFD:

IESafariFirefoxOpera
*HZ/GBK2022/18030
︐ fe10e78de78d� fffde78d� fffd
︒ fe12e78ee78e� fffde78e� fffd
︑ fe11e78fe78f� fffde78f� fffd
︓ fe13e790e790� fffde790� fffd
︔ fe14e791e791� fffde791� fffd
︕ fe15e792e792� fffde792� fffd
︖ fe16e793e793� fffde793� fffd
︗ fe17e794e794� fffde794� fffd
︘ fe18e795e795� fffde795� fffd
︙ fe19e796e796� fffde796� fffd
*) Safari’s ISO-2022 implementation does not include GBK/1 extensions at all.

2 accented letters (not used in any European language or found in any other character character set and probably added to Unicode relatively recently) are also mapped to PUA characters or U+FFFD in many impementations:

IESafariFirefoxOpera
HZ/GBK18030HZ/GBK18030HZ/GBK2022/18030
ḿ 1e3fe7c7e7c7e7c7e7c7� fffde7c7� fffd
ǹ 1f9e7c8ǹ 1f9e7c8ǹ 1f9� fffdǹ 1f9ǹ 1f9

The euro symbol at Column 2 Cell 67 should probably be considered a GBK/1 extension as well.

IESafariFirefoxOpera
HZ/GBK18030HZ/GBK2022/18030
2–67€ 20ac€ 20ac€ 20ac€ 20ac

GBK/2

GBK/2 is identical to the hanzi part of GB 2312.

GBK/3

GBK/3 enumerates, in Unicode order, all the 6,080 hanzi in the range U+4E00–U+72DB missing from GB 2312 (PDF).

GBK/4

GBK/4 enumerates, in Unicode order, all the 8,059 hanzi in the range U+72DC–U+9FA5 missing from GB 2312 (PDF). GBK thus covers all the 20,902 characters in the range U+4E00–U+9FA5, apparently the full set of hanzi included in Unicode 1.0.

GBK/4 furthermore contains 101 additional hanzi.

14 are defined in Unihan G9 (PDF). None of these are mapped to the correct Unicode scalar value in any implementation:

IESafariFirefoxOpera
GBKGB18030
𠂇 20087e816e816� fffde816� fffd
𠂉 20089e817e817� fffde817� fffd
𠃌 200cce818e818� fffde818� fffd
龴 9fb4e81ee81e� fffde81e� fffd
龵 9fb5e826e826� fffde826� fffd
龶 9fb6e82be82b� fffde82b� fffd
龷 9fb7e82ce82c� fffde82c� fffd
𡗗 215d7e831e831� fffde831� fffd
龸 9fb8e832e832� fffde832� fffd
𢦏 2298fe83be83b� fffde83b� fffd
龹 9fb9e843e843� fffde843� fffd
龺 9fbae854e854� fffde854� fffd
𤇾 241fee855e855� fffde855� fffd
龻 9fbbe864e864� fffde864� fffd

The mapping of the remaining 87 cannot be derived from Unihan (PDF). All GB18030 implementations handle these characters correctly, whereas most GBK implementations use PUA characters or U+FFFD:

IESafariFirefoxOpera
GBK18030GBK18030GBK18030
⺁ 2e81e815⺁ 2e81e815⺁ 2e81� fffd⺁ 2e81⺁ 2e81
⺄ 2e84e819⺄ 2e84e819⺄ 2e84� fffd⺄ 2e84⺄ 2e84
㑳 3473e81a㑳 3473e81a㑳 3473� fffd㑳 3473㑳 3473
㑇 3447e81b㑇 3447e81b㑇 3447� fffd㑇 3447㑇 3447
⺈ 2e88e81c⺈ 2e88e81c⺈ 2e88� fffd⺈ 2e88⺈ 2e88
⺋ 2e8be81d⺋ 2e8be81d⺋ 2e8b� fffd⺋ 2e8b⺋ 2e8b
㖞 359ee81f㖞 359ee81f㖞 359e� fffd㖞 359e㖞 359e
㘚 361ae820㘚 361ae820㘚 361a� fffd㘚 361a㘚 361a
㘎 360ee821㘎 360ee821㘎 360e� fffd㘎 360e㘎 360e
⺌ 2e8ce822⺌ 2e8ce822⺌ 2e8c� fffd⺌ 2e8c⺌ 2e8c
⺗ 2e97e823⺗ 2e97e823⺗ 2e97� fffd⺗ 2e97⺗ 2e97
㥮 396ee824㥮 396ee824㥮 396e� fffd㥮 396e㥮 396e
㤘 3918e825㤘 3918e825㤘 3918� fffd㤘 3918㤘 3918
㧏 39cfe827㧏 39cfe827㧏 39cf� fffd㧏 39cf㧏 39cf
㧟 39dfe828㧟 39dfe828㧟 39df� fffd㧟 39df㧟 39df
㩳 3a73e829㩳 3a73e829㩳 3a73� fffd㩳 3a73㩳 3a73
㧐 39d0e82a㧐 39d0e82a㧐 39d0� fffd㧐 39d0㧐 39d0
㭎 3b4ee82d㭎 3b4ee82d㭎 3b4e� fffd㭎 3b4e㭎 3b4e
㱮 3c6ee82e㱮 3c6ee82e㱮 3c6e� fffd㱮 3c6e㱮 3c6e
㳠 3ce0e82f㳠 3ce0e82f㳠 3ce0� fffd㳠 3ce0㳠 3ce0
⺧ 2ea7e830⺧ 2ea7e830⺧ 2ea7� fffd⺧ 2ea7⺧ 2ea7
⺪ 2eaae833⺪ 2eaae833⺪ 2eaa� fffd⺪ 2eaa⺪ 2eaa
䁖 4056e834䁖 4056e834䁖 4056� fffd䁖 4056䁖 4056
䅟 415fe835䅟 415fe835䅟 415f� fffd䅟 415f䅟 415f
⺮ 2eaee836⺮ 2eaee836⺮ 2eae� fffd⺮ 2eae⺮ 2eae
䌷 4337e837䌷 4337e837䌷 4337� fffd䌷 4337䌷 4337
⺳ 2eb3e838⺳ 2eb3e838⺳ 2eb3� fffd⺳ 2eb3⺳ 2eb3
⺶ 2eb6e839⺶ 2eb6e839⺶ 2eb6� fffd⺶ 2eb6⺶ 2eb6
⺷ 2eb7e83a⺷ 2eb7e83a⺷ 2eb7� fffd⺷ 2eb7⺷ 2eb7
䎱 43b1e83c䎱 43b1e83c䎱 43b1� fffd䎱 43b1䎱 43b1
䎬 43ace83d䎬 43ace83d䎬 43ac� fffd䎬 43ac䎬 43ac
⺻ 2ebbe83e⺻ 2ebbe83e⺻ 2ebb� fffd⺻ 2ebb⺻ 2ebb
䏝 43dde83f䏝 43dde83f䏝 43dd� fffd䏝 43dd䏝 43dd
䓖 44d6e840䓖 44d6e840䓖 44d6� fffd䓖 44d6䓖 44d6
䙡 4661e841䙡 4661e841䙡 4661� fffd䙡 4661䙡 4661
䙌 464ce842䙌 464ce842䙌 464c� fffd䙌 464c䙌 464c
䜣 4723e844䜣 4723e844䜣 4723� fffd䜣 4723䜣 4723
䜩 4729e845䜩 4729e845䜩 4729� fffd䜩 4729䜩 4729
䝼 477ce846䝼 477ce846䝼 477c� fffd䝼 477c䝼 477c
䞍 478de847䞍 478de847䞍 478d� fffd䞍 478d䞍 478d
⻊ 2ecae848⻊ 2ecae848⻊ 2eca� fffd⻊ 2eca⻊ 2eca
䥇 4947e849䥇 4947e849䥇 4947� fffd䥇 4947䥇 4947
䥺 497ae84a䥺 497ae84a䥺 497a� fffd䥺 497a䥺 497a
䥽 497de84b䥽 497de84b䥽 497d� fffd䥽 497d䥽 497d
䦂 4982e84c䦂 4982e84c䦂 4982� fffd䦂 4982䦂 4982
䦃 4983e84d䦃 4983e84d䦃 4983� fffd䦃 4983䦃 4983
䦅 4985e84e䦅 4985e84e䦅 4985� fffd䦅 4985䦅 4985
䦆 4986e84f䦆 4986e84f䦆 4986� fffd䦆 4986䦆 4986
䦟 499fe850䦟 499fe850䦟 499f� fffd䦟 499f䦟 499f
䦛 499be851䦛 499be851䦛 499b� fffd䦛 499b䦛 499b
䦷 49b7e852䦷 49b7e852䦷 49b7� fffd䦷 49b7䦷 49b7
䦶 49b6e853䦶 49b6e853䦶 49b6� fffd䦶 49b6䦶 49b6
䲣 4ca3e856䲣 4ca3e856䲣 4ca3� fffd䲣 4ca3䲣 4ca3
䲟 4c9fe857䲟 4c9fe857䲟 4c9f� fffd䲟 4c9f䲟 4c9f
䲠 4ca0e858䲠 4ca0e858䲠 4ca0� fffd䲠 4ca0䲠 4ca0
䲡 4ca1e859䲡 4ca1e859䲡 4ca1� fffd䲡 4ca1䲡 4ca1
䱷 4c77e85a䱷 4c77e85a䱷 4c77� fffd䱷 4c77䱷 4c77
䲢 4ca2e85b䲢 4ca2e85b䲢 4ca2� fffd䲢 4ca2䲢 4ca2
䴓 4d13e85c䴓 4d13e85c䴓 4d13� fffd䴓 4d13䴓 4d13
䴔 4d14e85d䴔 4d14e85d䴔 4d14� fffd䴔 4d14䴔 4d14
䴕 4d15e85e䴕 4d15e85e䴕 4d15� fffd䴕 4d15䴕 4d15
䴖 4d16e85f䴖 4d16e85f䴖 4d16� fffd䴖 4d16䴖 4d16
䴗 4d17e860䴗 4d17e860䴗 4d17� fffd䴗 4d17䴗 4d17
䴘 4d18e861䴘 4d18e861䴘 4d18� fffd䴘 4d18䴘 4d18
䴙 4d19e862䴙 4d19e862䴙 4d19� fffd䴙 4d19䴙 4d19
䶮 4daee863䶮 4daee863䶮 4dae� fffd䶮 4dae䶮 4dae

GBK/5

GBK/5 is a set of 166 symbols (PDF).

The ideographic variation indicator and 12 ideographic description characters are mapped to PUA characters or U+FFFD in most GBK implementations:

IESafariFirefoxOpera
GBK18030GBK18030GBK18030
〾 303ee7e7〾 303ee7e7〾 303e� fffd〾 303e〾 303e
⿰ 2ff0e7e8⿰ 2ff0e7e8⿰ 2ff0� fffd⿰ 2ff0⿰ 2ff0
⿱ 2ff1e7e9⿱ 2ff1e7e9⿱ 2ff1� fffd⿱ 2ff1⿱ 2ff1
⿲ 2ff2e7ea⿲ 2ff2e7ea⿲ 2ff2� fffd⿲ 2ff2⿲ 2ff2
⿳ 2ff3e7eb⿳ 2ff3e7eb⿳ 2ff3� fffd⿳ 2ff3⿳ 2ff3
⿴ 2ff4e7ec⿴ 2ff4e7ec⿴ 2ff4� fffd⿴ 2ff4⿴ 2ff4
⿵ 2ff5e7ed⿵ 2ff5e7ed⿵ 2ff5� fffd⿵ 2ff5⿵ 2ff5
⿶ 2ff6e7ee⿶ 2ff6e7ee⿶ 2ff6� fffd⿶ 2ff6⿶ 2ff6
⿷ 2ff7e7ef⿷ 2ff7e7ef⿷ 2ff7� fffd⿷ 2ff7⿷ 2ff7
⿸ 2ff8e7f0⿸ 2ff8e7f0⿸ 2ff8� fffd⿸ 2ff8⿸ 2ff8
⿹ 2ff9e7f1⿹ 2ff9e7f1⿹ 2ff9� fffd⿹ 2ff9⿹ 2ff9
⿺ 2ffae7f2⿺ 2ffae7f2⿺ 2ffa� fffd⿺ 2ffa⿺ 2ffa
⿻ 2ffbe7f3⿻ 2ffbe7f3⿻ 2ffb� fffd⿻ 2ffb⿻ 2ffb

GB 18030

GB 18030 is a GBK encoding which includes all Unicode characters not already in GBK as four-byte sequences. This large extension is not currently tested here — only the GBK subset.

HZ encoding

MIME charset label: hz-gb-2312.

ISO646-US (ASCII)

One-byte mode encodes ISO646-US. This is the default mode and can be selected by the sequence ~ }: hz-gb-2312. A literal tilde in one-byte mode has to be doubled.

GB 2312

Two-byte mode encodes GB 2312 (with GBK/1 extensions and often the euro sign). The sequence ~ { selects this mode: hz-gb-2312.

ISO-2022 encoding

PDF. The Chinese ISO 2022 encoding is defined in RFC 1922.

MIME charset labels: iso-2022-cn-ext (Safari and Firefox), iso-2022-cn (Safari, Firefox and Opera; limited number of character sets in Firefox and Opera).

Note: This encoding includes Traditional Chinese character sets as well, as mentioned elsewhere.

ISO646-US (ASCII)

G0 encodes ISO646-US, which is the default character set in absence of any escapes/shifts and can be selected explicitly by shift in (SI, 0x0F): iso-2022-cn-ext, iso-2022-cn.

GB 2312

The designator sequence ESC $ ) A selects GB 2312 (usually with GBK/1 extensions and euro sign) as G1, which can be invoked by shift out (SO, 0x0E): iso-2022-cn-ext, iso-2022-cn.

ISO-IR 165

The designator sequence ESC $ ) E selects ISO-IR 165 as G1: iso-2022-cn-ext, iso-2022-cn.

EUC encoding

PDF. None of the browsers has a pure EUC-CN implementation (at least not associated with one of the expected MIME charset labels); instead, they implement the GBK superset. They all implement GB 18030 as well, but keep the two separate.

MIME charset labels: gbk, gb18030.

ISO646-US (ASCII)

Code set 0 (7-bit characters) encodes ISO646-US: gbk, gb18030.

GB 2312

Code set 1 (unprefixed 8-bit characters) encodes GB 2312: gbk, gb18030.

All four browsers include the euro sign at 0x80 in their GBK implementations. Firefox does the same for GB 18030.

Opera adds ISO-8859/1 characters to its HZ and ISO-2022 implementations.

Ad­ver­tise­ments

Contact

temp-ov4e@coq.no