Miscellaneous 8-bit encodings

KOI-8 or КОИ-8
(Код Обмена Информацией, 8 битов)

The KOI-8 family of encodings encode basic Cyrillic letters in such a way that a case-swapped Latin transliteration is obtained if the eighth bit is stripped and the resulting bytes are interpreted as ISO 646 or ASCII.

KOI8-R (Russian)

koi8-r254
1
252
3
253
1 + 1
255
PDF. Ref.: RFC 1489.

Roman Czyborra surmises that position 0x95 should actually be U+2022 (general bullet) rather than U+2219 (mathematical bullet operator) as defined in RFC 1489. Whatever the case may be, it is almost certainly too late now to correct this possible error.

KOI8-U (Ukrainian and Russian)

koi8-u254
1
250
3 + 2
253
1 + 1
255
PDF. Ref.: RFC 2319 (typo in the last table: U0403 for U0404).

KOI8-U replaces 8 line-drawing characters from columns 10 and 11 in KOI8-R with the Ukrainian letters Є/є, І/і, Ї/ї and Ґ/ґ.

Internet Explorer interprets KOI8-U as KOI8-RU (see below).

KOI8-RU (Byelorussian, Ukrainian and Russian)

koi8-ru252
3
koi8-u252
3
PDF. Ref. missing.

KOI8-RU can be considered a further modification of KOI8-U obtained by replacing 2 line-drawing characters with the Byelorussian letter Ў/ў. Some sources (e.g., Валентин Нечаев’s exposé) show versions with typographical symbols replacing a number of line-drawing characters in column 9.

Internet Explorer interprets KOI8-U as KOI8-RU.

KOI8-E (ECMA)

iso-ir-111254
1
PDF. Refs: ISO-IR 111, ECMA-113 (1st Ed., 1986).

NB: The character set referred to as ‘iso-ir-111’ in RFC 1345 is completely different.

Vietnamese

The number of precomposed characters required for Vietnamese does not fit into an ISO-8859 encoding. Instead, a number of encodings have been developed which place printable characters not only in columns 8 and 9 (in the same way as Windows and Macintosh encodings do), but also in columns 0 and 1 which are normally strictly reserved for control characters.

VISCII

viscii254
1
255
PDF. Refs: RFC 1456, Vietnamese Character Encoding Standardization Report: VISCII and VIQR 1.1 (local PDF).

VISCII fills columns 8–15 with Vietnamese letters and puts the remaining six relatively infrequent accented uppercase letters in columns 0 and 1. The potentially problematic columns 8 and 9 are reserved for uppercase letters to ensure that a full set of lowercase letters will always be available.

This encoding was designed to be compatible with ISO 8859/1 in the sense that all accented letters present in both be encoded in the same positions, which was indeed the case for version 1.0 of the standard. Unfortunately, the lowercase letter ạ was put at 0xA0, which may be problematic under Windows since this position is normally used for a non-breaking space. To solve the problem, it was swapped with Õ (originally at 0xD5) in version 1.1. Full ISO-8859/1 compatibility could easily have been preserved by choosing another uppercase letter from columns 10–15, so it is not clear why Õ was selected.

TCVN-5712 or VSCII

tcvn255
x-viet-tcvn5712254
1
PDF. Ref.: TCVN 5712:1993 (English plain-text summary).
Subset (columns 10–15): ISO-IR 180.

In addition to a full set of precomposed Vietnamese letters, TCVN-5712 includes non-breaking space at 0xA0 and the five tone marks as combining diacritics at 0xB0–0xB4, which means that twelve letters are relegated to columns 0 and 1. Uppercase and lowercase Vietnamese letters without tone marks are placed at 0xA1–0xAE in column 10, and all lowercase letters with tone marks can be found in columns 11–15.

There is a third variant of this encoding which contains no uppercase letters with tone marks. This is intended to be used with a dedicated uppercase font.

VPS or VNCII

vps255
x-viet-vps254
1
PDF. Ref. missing.

The Windows version of the Vietnamese Professionals Society’s encoding, as implemented in their fonts, contains, in addition to a full set of precomposed Vietnamese letters, inverted commas (‘ and ’), non-breaking space and an eclectic collection of five lowercase European letters (ß, ö, ü, î and ç), possibly an attempt to satisfy the basic needs for French and German (although the absence of û and ä makes this explanation less plausible). These eight positions are all undefined in the Society’s Unix fonts, however, and it is not clear why they were not used to encode Vietnamese letters, fourteen of which were instead put in columns 0 and 1.

Sami

A Swedish standard defines Sami character sets, an ISO-2022-compatible version (Sami 1, registered as ISO-IR 209) as well as Windows (Sami 2) and Macintosh (Sami 3) variants. Befitting its Nordic roots, Opera implements one of these.

Sami 2

windows-sami-2248
7
PDF. Ref.: Teknisk norm nr. 35:1.

Pan-European Latin

T.51: Latin-based Coded Character Sets for Telematic Services

ISO 6937 or ITU-T T.51, registered as ISO-IR 156, defines a (fairly) comprehensive set of extended and accented Latin letters needed for European languages, which is made possible by using two bytes for characters with diacritics (PDF).

In Firefox, the MIME charset string t.61 (amongst others) selects this encoding. Errors: Ż replaces Ź, Ņ and ¤ are missing. Alternative mappings: ^ 2C and ~ 7E replace ˆ 2C6 and ˜ 2DC, the visual mapping to ǵ (g with acute) is used instead of the logical mapping to ģ (g with cedilla), and Đ 110 (d with stroke) has been chosen instead of the visually identical Ð D0 (eth). $ is encoded twice, but # is not.

IE has an implementation associated with the MIME charset string x-cp20269. Some characters are missing, and accented letters are handled incorrectly (diacritics will appear either in front of the letter they are meant to modify or above/below the preceding letter).

T.61: Character Repertoire and Coded Character Sets for the International Teletex Service

ITU-T T.61, registered as ISO-IR 102 (left-hand side) and ISO-IR 103 (right-hand side), defines a subset of an older version of T.51 (T.51, Nov 1988) which is based on the old ISO 646-IRV with currency sign (¤) instead of dollar ($). It may have been the intention to update T.61 to reflect changes in T.51, but the 1993 version is unavailable, marked as ‘withdrawn’ and ‘never published’.

Compared to T.51, the following characters are missing from the left-hand side: \, ^, `, {, }, ~ and delete. The right-hand side excludes the following: no-breaking space, ‘, “, ←, ↑, →, ↓, ’, ”, —, ¹, ®, ©, ™, ♪, ¬, ¦, ⅛, ⅜, ⅝, ⅞, soft hyphen. Non-spacing underscore used for underling, since deprecated, is included.

IE implements this encoding under the MIME label x-cp20261. As in Firefox’s T.51 implementation, ǵ, ^ and ~ replace ģ, ˆ and ˜. A number of additional accented letters are included.

T.101

T.101 includes another variation on T.51.

Ad­ver­tise­ments

Contact

temp-ozn3@coq.no