CSA Newsletter, Winter '02: Unicode technical note

Vol. XIV, No. 2

Winter, 2002

Technical Notes for "The Way Your Computer Handles Text is Changing"

If you do not see a delta within the parentheses (δ), click for help.

256 entries.

The limit of 256 (2⁸) numbers for the ASCII code system is arbitrary. Digital computers represent numbers as electric pulses and consequently express them as binary digits, as strings of zeros and ones - 01010101. The necessity to know when to end one number and start another requires that you group the sequence into regularized "bursts." For example, is "01010101" to be interpreted as "01010101," or "010101" followed by "01?" Relatively early, bursts were standardized at eight digits. A burst of 8 digits is called a byte, while a single digit is a bit. Using bursts of eight digits, "010101" becomes of single byte of "00010101" and "01" becomes "00000001."

The architectures of PCs and Macs are now based on multiples of 8 digits, so that standards are based on 8-bit, 16-bit, and 32-bit groupings. Because space required to store a message and time to transmit it were higher priorities in the early days of computers, the idea of using a single 8-digit burst to represent a character was a reasonable restriction when early computer standards were developed.

Multiple international versions character sets.

The crucial standard for data transmission was codified in 1968 by the American National Standards Institute (ANSI) as a 7-bit sequence plus an eighth bit used to verify the integrity of the transmitted byte. Seven bits allow 128 (2⁷) discrete characters. Characters 0-31 and 127 were assigned to internal computer commands -- unprintable characters like tabs, line feeds and carriage returns; numbers 32-126 represented characters and symbols found on the "standard American typewriter" keyboard -- Latin lower case, upper case, punctuation marks, and common symbols.

The ASCII character set was defined in 1972 as the equivalent to the ANSI system described above. With advances in computer technology, the eighth bit was no longer needed for transmission verification; so the number of characters designated by a byte expanded to 256 (2⁸) from 128 (2⁷). ASCII, however, has never defined what characters these added numbers represented. The character set officially designated by ASCII as the numbers 0-127 is often called lower ASCII, and the portion of the character set represented by the numbers 128-255 upper ASCII.

To reiterate, upper ASCII is not defined by ASCII standards. As a result, each individual software producer could, and did, utilize them for that selection of extra characters that they considered important. Thus, Macintosh, Microsoft, Abode, etc., developed proprietary character sets based on ASCII, but differing in their use of the numbers 128-255.

In 1972 the International Standards Organization's (ISO) standard 646, was published; based on ASCII, it standardized the upper ASCII around the extended Latin characters required by "major" European languages. As computer use expanded, the need to standardize the representation of more than 256 characters grew more apparent.

In 1998 ISO 646 was replaced by ISO 8859. The numbers 0-127 remain the ASCII set; 128-256 were assigned to extended Latin and non-Latin letters. Rather than limit the characters that could be included, however, multiple variations of ISO 8859 were adopted. By 1999, there were fourteen variations officially adopted or proposed:

ISO Number	Name	Use of numbers 128-255
ISO 8859-1	Latin1	Western European languages' extended Latin characters
ISO 8859-2	Latin2	Eastern European languages' extended Latin characters
ISO 8859-3	Latin3	South European languages' extended Latin characters
ISO 8859-4	Latin4	North European languages' extended Latin characters
ISO 8859-5	Cyrillic	Cyrillic characters
ISO 8859-6	Arabic	Arabic characters
ISO 8859-7	Greek	Greek characters
ISO 8859-8	Hebrew	Hebrew characters
ISO 8859-9	Latin5	Turkish extended Latin characters
ISO 8859-10	Latin6	Nordic extended Latin characters
ISO 8859-11	Thai	Thai characters
ISO 8859-12	Unused
ISO 8859-13	Latin7	The Baltic rim languages' extended Latin characters
ISO 8859-14	Latin8	Latin1 & Celtic extended Latin characters
ISO 8859-15	Latin9/Latin0	A revised Latin1 with a currency sign for the Euro & more French and Finnish characters

Once the idea of a universal character representation system was widely accepted, further variations of ISO 8859 were not proposed.

Note that IBM mainframes use a completely different code book, EBCDIC, which does not have any character representation overlap with ASCII. Discussion of EBCDIC is outside the scope of this article.

Three sites with more detailed information on various standards for international data exchange are:

The Diffuse Project of the Information Society DG of the European Commission, "Character Set Standards," for the ISO 646 character set and other standard sets (including EBCDIC) used for data interchange. Visited 17 Oct 2001, last updated Aug 2001.
Roman Czyborra." The ISO 8859 Alphabet Soup," http://czyborra.com/charsets/iso8859.html for a discussion of the different ISO 8859 standards. He lays out the actual numeric codes in hexadecimal. Visited 24 Sep 2001, last updated 12 Jan 1998.
Dmitry Kirsanov, Sams.net Publishing. "Character Encoding Standards", http://www.webreference.com/dlab/books/html/39-1.html, accessed 24 Sep 2001, last updated 16 June 1997.

The version of "ASCII" loaded with the operating system.

Which ASCII variation comes loaded on your computer (Mac or PC)? The ASCII variation used by your computer is loaded as part of the operating system by either you or your computer vendor. Vendors in this country generally ignore the issue and rarely give you an option. The creators of operating systems and other software use the upper ASCII numbers for characters that they require for needs related to their own software or their perceived markets. Microsoft, Adobe, Apple and IBM, among others, have individualized character sets. Computers marketed abroad, of course, use the ISO character set appropriate for each market.

For a list of correspondences between Microsoft, Macintosh and ISO character sets, see Michael S. Kaplan, "Language Codes: ISO 639, Microsoft and Macintosh," visited 23 October 2001; last updated 16 July, 2001. Note that Mr. Kaplan cautions that this is a draft and the correspondences are not one-to-one.

For a list of the conversions from these various specialized code schemes to Unicode, see the Unicode Consortium directory: www.unicode.org/Public/MAPPINGS/VENDORS/ accessed 23 Oct 2001; last updated 01 August 2001.

Mixing foreign and "American" software.

In discussing problems with non-Latin alphabets, I use my experiences with modern Greek. Similar problems exist with other alphabets and with syllabic and ideographic scripts.

For any non-American script to be accommodated in an ISO 8859 variation, its character set must be mapped into numbers in the upper ASCII range, 128-255; the lower ASCII must remain constant. With only 128 available numbers, inclusion of one language's character set may preclude another's. Thus, the modern Greek variation (ISO 8859-7) includes the lower ASCII character set and uses the upper ASCII for its characters, precluding the use of upper ASCII for anything else. For example, 228 is assigned to lower case delta (δ), not as a lower case umlauted a (ä) as in ISO 8859-1. However, if the software is expecting ISO 8859-1 when it encounters a file, it will display 228 not to lower case delta (δ) but as lower case umlauted a (ä), despite the original intent.

My Greek-English dictionary on a CD was purchased in Greece for a Windows PC. My Windows NT is set up to switch between Greek and English keyboards with the "Local Input" option; so I mounted the dictionary and assumed that there would be no problems. Even in this standard Microsoft/PC universe, clicking on the icon to start the dictionary produces a request for the location of the software in an unintelligible sequence of extended Latin letters. Obviously, the sequence is in the upper tier of ISO 8859 and, equally obviously, the variation of ISO 8859 that my Windows operating system expects is not the one used by this software. (Fortunately for me, the question is posed secondarily in English, in characters within the 0-127 range in all ISO 8859 variations.) The program is intended to work on both native English and native Greek computers and a variety of platforms. When the software application is loaded, ISO 8859-7 is established as the basis for representing characters on my bilingual computer. Therefore, once I have opened the dictionary, the extended Latin character strings disappear. The software has loaded the proper code scheme and I can use the dictionary. Notice that I have never touched the underlying ISO 8859 variation that the operating system uses. Hence the same gibberish appears every time I open the dictionary. This goes a long way towards explaining why American computers can not be easily used in Greece.

To further complicate the issues surrounding the Greek alphabet, ISO 8859-7 does not include the symbol set needed to produce Classical Greek. For example it contains neither aspiration marks nor the digamma.

Problems can occur with multilingual texts even within a fully Microsoft universe. In Office 97 Microsoft Word will work beautifully to produce bilingual documents, but Access and Excel with this same "bilingual" setup can not handle a mixture of English and Greek fields within a single record or row.

Fonts.

Character sets, as defined in this article, are relationships between the numbers 0-257 and groups of symbols, such as letters of an alphabet. A character set does not specify the visual representation of its symbols. For example, ASCII and all ISO 8859 versions define 97 as Latin lower case a, not as a or a or a. It is a font that defines the appearance of the character, by specifying a shape and a size for each character.

Theoretically, every software package requires a font specification as well as a character set specification to display text. Every software package, however, has a default font, so that it can display its stored numbers as readable text. [If you do not see different fonts displayed in this paragraph, your browser either does not have the fonts that I chose for this example, or you have requested that no fonts but your default will be displayed. In these cases, your browser has replaced one or more of these font(s) with the default font.]

Since fonts define the visual appearance of a character by their unique design for its shape, they can substitute the shape of a Greek lower case alpha - α - for the more typical shape of a Latin lower case a - a - without affecting the underlying number - 97. Equally easily, a font could set the shape of a Greek lower case alpha - α - to represent any character in the set. Since fonts define the visual appearance of a character (based on the underlying number from 0-255), they can be used to present any mixture of symbols. A simple signal to change fonts for a given character string can produce a multilingual document.

Fonts, since they are conceptualized as visual representations of character sets and not character sets themselves, do not have established international standards. It should be obvious, however, that the effect is the same. An underlying number can not be trusted to produce the desired shape/symbol on any computer, at any time. Specialized computer software is required to represent understandable text, and the knowledge of which software was used to produce the document is critical.

Substituting fonts for character sets is a solution to a specific, localized problem. Such fonts were developed outside the standards organizations and each is unique. Thus "GreekKeys" and "WinGreek" are not mutually intelligible. Both produce text using the ancient Greek alphabet. Each has its own unique, encoding scheme that bears no commonality with the other or with the official ISO variant for modern Greek. For example, a final sigma (ς) is character 119 in GreekKeys, 106 in WinGreek, and 242 in ISO 8859-7.

There is currently at least one web site that provides conversions between the various standardized Greek language processors: Sean Redmond, Greek Font to Unicode Converter, visited 26 Oct 2001.

See also the APA page for Greek font information, "GreekKeys, Greek Font Information from the American Philological Association," visited 26 Oct 2001.

Comparing visual appearances.

Many word processors can include a specified font within the search criteria. In theory that should make it possible to search for a non-Latin character. This ability, however, is of limited value since you must know the font used in the document to produce the non-Latin characters. For example, there are several ways to produce Greek characters with Word 97 fonts, the Symbol, Greek C, and Greek S fonts. Therefore, to locate all occurrences of δ, you need 3 separate searches, one for each font, because selecting for a font is an additional restriction to the simple number match. (This is a truly visual match; even if you use lower ASCII characters in a font-specific search, only character sequences in the specified font will be found.) Of course, if you search for the underlying number sequence and do not specify a font, all occurrences of that string of numbers will be found, regardless of their visual appearance as a Latin or Greek font.

A standard.

Actually two separate standards have involved: Unicode and the Universal Multiple-Octet Coded Character Set (UCS). The UCS is the standard defined by the International Standards Organization (ISO) publication 10646.

Unicode was developed in the commercial world by a consortium that included such hardware and software developers as IBM, Macintosh, Microsoft, Adobe, and Sun Micro Systems; UCS was developed by a consortium of governmental and academic groups. These two standards are pledged to contain exactly the same character set. If characters are added to either set, then they will be added to the other. This means that if Mayan glyphs are added to Unicode at some future date, then ISO 10646 will also include them. This does not mean that the internal binary pattern that represents any given character is identical in the two standards, but that each standard contains a unique, fixed-length binary pattern for each defined character and anyone will be able transform one binary pattern set to the other to the other unambiguously.

Further information on ISO/IEC 10646 is available: http://anubis.dkuug.dk/JTC1/SC2/WG2/, updated 26 Jan 01, visited 17 Sept 01.)

An overview of the relationship between USC and Unicode is available: Olle Järnefors, A short overview of ISO/IEC 10646 and Unicode, updated 26 Feb 1996, accessed 19 Nov 2001.

90,000 characters.

Unicode expands the number of characters encoded by using multiple bytes per character. There are both 16-bit (2 bytes) and 32-bit (4 bytes) versions of Unicode. These versions have limits of 65,536 (2¹⁶) and 4,294,967,296 (2³²) characters respectively. The current 94,000+ characters in Unicode 3.1 include such diverse signs as Latin, Greek and Cyrillic alphabets; Hebrew and Arabic scripts; Japanese Hiragana; Chinese ideograms; Kangxi radicals; unified Canadian aboriginal syllabic; mathematical operators; and currency symbols.

For the complete list of the currently assigned characters in Unicode see: http://www.Unicode.org/charts/, updated 09 Oct 2001, accessed 19 Nov 2001.

Implementation.

Unicode is required in all new internet protocols and is supposedly implemented in all "modern" operating systems and computer languages. The Unicode Consortium was founded in 1991, but not all operating systems and computer languages developed since 1991 can process it. For instance, Microsoft says that Unicode has "built in" support in Windows NT and Windows 2000, but "only limited support on Windows 95 and Windows 98."

http://www.microsoft.com/globaldev/articles/Unicode.asp, Microsoft Global Software Development "Unicode, character sets and codepages," updated 24 August 2001; accessed 30 Aug 2001.

For a list of "Unicode enabled products" see: http://www.unicode.org/unicode/onlinedat/products.html, updated 07 Nov 2001, accessed 19 Nov 2001.

CSA Newsletter homepage

CSA Newsletter article, "The Way Your Computer Handles Text is Changing"

CSA Home Page