Generating Unicode-style character charts

The code described on this page can be used to make character charts in the same style as those provided by the Unicode Consortium. It is a by-product of our Windows-1252 table for the Unicode book.

The Unibook utility may fulfil the same purpose, but is slightly less flexible and, more importantly, strictly limited to Windows, whereas our script works with any standard PostScript interpreter, including Ghostscript and Apple’s pstopdf.

PostScript code

The chart-generating code can be downloaded as chart.ps. Please do not redistribute modified versions. The only dependency is Adobe’s list of glyph names for new fonts, which is used to find PostScript names corresponding to Unicode values.

If the file chart.ps is sent to a printer or otherwise interpreted on its own (well, together with the glyph-name list), the result will be a blank page. To obtain useful results, chart.ps must instead be referenced from a separate file as follows:

        (chart.ps) run

(PostScript strings are enclosed in round brackets.) The chart’s title and code range (not to exceed 256 characters) are then indicated like this:

        (Windows CP 1252, alternative C1 range)
        16#80 16#9F range

(The notation b#n is PostScript for n base b.) These three lines, which must always be included, are sufficient to get a title and an empty grid with the correct number of columns (limited to 16).

Three macros in chart.ps are used to fill in the boxes, viz., x for a diagonally hatched box, u for a Unicode character and ps for a named PostScript glyph, e.g.:

        16#81 x
        16#82 16#201A u
        16#83 16#0192 /florin ps

The first argument always indicates the position in the code chart. u and ps take a second argument that indicates the wanted Unicode code point. In the case of ps, the Unicode value is only used for the label (number printed below the character), whereas an additional third argument supplies the name of the wanted character. (Note that PostScript, names start with a solidus.) The command ps is needed only occasionally, e.g., when the font uses a non-standard or old glyph name.

Examples

Unicode provides mappings for many legacy encodings. For instance, the file CP1252.TXT provides data for Windows-1252 with lines like the following:

        0x9C    0x0153    #LATIN SMALL LIGATURE OE
        0x9D              #UNDEFINED

Considering the discussion above and knowing that the comment character in PostScript is %, the lines need to be converted into the following format:

        16#9C    16#0153    u %LATIN SMALL LIGATURE OE
        16#9D               x %UNDEFINED

The following sed command does the trick:

        sed 's/#UNDEFINED/x %UNDEFINED/; s/#/u %/; s/0x/16#/g'

Extracting the interesting entries and adding the three boilerplate lines described earlier gives the file CP1252.ps.

* * *

Likewise, the file ROMAN.TXT can be used to create MacRoman.ps, a code chart for the upper half of the Mac OS Roman character set. This file uses ps for a few non-standard glyph names and also illustrates how to use multiple fonts.

* * *

It is worth stressing that arbitrary PostScript code can be used, which gives great flexibility. ASCII.ps takes advantage of the for loop to fill in a table of the printable ASCII characters as follows:

        first 1 last {
            dup u
        } for

In this example, first and last hold the characters given as parameters to range, whereas dup is a standard PostScript operator which effectively duplicates a parameter.

PDF charts

The PostScript code described in the previous section gives CP1252.pdf, MacRoman.pdf and ASCII.pdf when passed through ps2pdf.

Ad­ver­tise­ments

Contact

temp-otpn@coq.no