Charset 5

This page is intended to provide a link between Anne van Kesteren’s standardisation effort and the information found on this site.

Single-octet encodings

The following table provides links to our tables for the single-byte encodings currently (per 8th April 2012) included in Van Kesteren’s draft standard. The ‘∆’ column indicates differences between the definitions in the draft standard and our tables (none at the moment).

dos-864 (ibm864) PDF JS
dos-866 (ibm866) PDF JS
iso-8859-2 PDF JS
iso-8859-3 PDF JS
iso-8859-4 PDF JS
iso-8859-5 PDF JS
iso-8859-6 PDF JS
iso-8859-7 PDF JS
iso-8859-8 PDF JS
iso-8859-10 PDF JS
iso-8859-13 PDF JS
iso-8859-14 PDF JS
iso-8859-15 PDF JS
iso-8859-16 PDF JS
koi8-r PDF JS
koi8-u PDF JS
macintosh PDF JS
windows-874 PDF JS
windows-1250 PDF JS
windows-1251 PDF JS
windows-1252 PDF JS
windows-1253 PDF JS
windows-1254 PDF JS
windows-1255 PDF JS
windows-1256 PDF JS
windows-1257 PDF JS
windows-1258 PDF JS
mac-cyrillic (x-mac-cyrillic) PDF JS

The inclusion of Code page 864 (Arabic) is somewhat surprising. No proper reference has been found, a few positions differ between implementations, and it is unclear whether the mapping to presentation forms is essential or incidental.

It may be that KOI8-RU should be included instead of KOI8-U.

(Previous versions of Van Kesteren’s draft included spurious mappings for a few undefined positions in Windows-874 and Windows-1253. This has now been corrected.)

Multi-octet encodings

Each of the larger and more complex encodings below are defined in terms of an encoding table and a set of character set tables. (The encoding tables only cover the mapping of two-byte character sets; there may be additional one-byte character sets with trivial mappings.)

The character set tables whose name starts with ‘u-’ are derived from Unihan, whereas ‘o-’ indicates a complementary table listing characters missing from Unihan.

Chinese (simplified) encodings

Single-byte charset: iso646-us.



Character sets: u-g0 o-g0 o-gbk1 €.

IE decodes 8-bit bytes according to (a version of) EUC/GBK.



A (circle) character sets: u-g0 o-g0 o-gbk1 €.
B (diamond) character set: u-gbk3.
C (square) character sets: u-gbk4 u-g9 o-g9 o-gbk5.

GB18030 includes additional characters not mentioned above.

Chinese (traditional) encoding

Single-byte charset: iso646-us.

Big5 and Big5-HKSCS


A (circle) character sets: u-big5-1 u-big5-2 o-big5-1eten1 eten2 eten1-hk.
B (diamond) character sets: u-h o-h o-h-comp.

Japanese encodings

Single-byte charsets: iso646-jp iso646-us hybrids (the details are rather complex).



ESC $ @ and ESC $ B character sets: u-j0 o-j0 u-nec o-nec.
ESC $ ( D character sets: u-j1 o-j1.
ESC ( I and SI and 8-bit (single-octet) character set: jis-x-0201.

Non-Japanese character sets are not mentioned here.

IE decodes 8-bit bytes according to (a version of) Shift-JIS.



Character sets 1 (unprefixed): u-j0 o-j0 u-nec o-nec.
Character set 2 (SS2 prefix, single-octet): jis-x-0201.
Character sets 3 (SS3 prefix): u-j1 o-j1.



Characters set (single-octet): jis-x-0201.
Characters sets: u-j0 o-j0 u-ibmjapan o-ibmjapan u-nec o-nec

Korean encodings

Single-byte charsets: iso646-kr iso646-us hybrids.



Character sets: u-ksc0 o-k0 wansung.

8-byte hangul usually not supported.

IE decodes 8-bit bytes according to (a version of) EUC-KR.



A (circle) character sets: u-ksc0 o-k0 wansung.
B (diamond) character set: uhc.

8-byte hangul?