CSA Newsletter, Winter '02: The Way Your Computer Handles Text is Changing

Vol. XIV, No. 3

Winter, 2002

The Way Your Computer Handles Text is Changing

Susan C. Jones

The links in this discussion allow those who wish to follow the discussion in greater technical depth; these added sections should not be necessary to follow the main arguments.

If you cannot see the non-Latin characters within the parentheses (α, β) with your current browser settings, click for help.

Introduction:

For anyone using modern computers, it is difficult to remember that most computers are digital devices that deal with numbers, not text or images. Computers process electric pulses representing these numbers and present the results in a more usable, "user-friendly," form -- text or images. The problems of translating these numbers into readable text on a screen or printer are discussed below. (I discussed image processing in "Raster And Vector Images - An Important Distinction," in the Spring, 1997 Newsletter, vol. X no. 1, http://csanet.org/newsletter/spring97/nl059707.html.)

Representing text numerically should not be a difficult problem. Anyone can develop a simple numeric code that sets a number sequence against the alphabet: 1 = a, 2 = b, etc. Then he or she can exchange messages in this code with anyone else who knows the sequence. Such a simple one-to-one number-to-letter code is the basis for the American Standard Code for Information Interchange (ASCII), a character set with 256 numbers permitting 256 characters. The characters on the standard American typewriter keyboard and non-printing computer "control characters" (end-of-file markers, line feeds, etc.) take up the first 128 numbers of the ASCII code. American typewriters are, however, quite parochial. They ignore both extensions to the Latin alphabet used in most modern European languages (á, ü) and other non-alphabetic symbols (currency designators like the euro, mathematical operators like the integral sign). ASCII, having defined only 128 characters, left 128 numbers undefined. Various software vendors, recognizing ASCII's lacunae, used the "open" 128 numbers to extend the character set that they offered their customers.

It is obvious that the 128 "open" ASCII entries are not enough to cover all extensions to the Latin alphabet and the non-alphabetic symbols. The size of the problem only swells when non-Latin alphabets, syllabaries, or ideograms need to be fitted into the character set. As a result, there are currently multiple international character sets to deal with the various extensions of the American Latin keyboard. These character sets are defined by the International Standards Organization (ISO). In the fourteen different character sets define by ISO standard 8859, the various extended Latin alphabets are grouped together somewhat arbitrarily by geographic area, such as western Europe and eastern Europe, while non-Latin alphabets are assigned their own versions. In all these variations the first 128 entries remain the standard American keyboard and nonprinting characters; only the second tier of 128 differ. How any individual computer translates this upper tier depends upon which character set was loaded with the operating system when it was booted. Variants of the ASCII standard loaded into the operating system cause many of the problems that are encountered when mixing foreign and "American" software on the same computer. In the discussion below, I will use Greek as an example since that is the non-Latin character set with which I am most familiar.

Expanding the character set with fonts:

Since multilingual documents may require a mixture of alphabets that is not covered by a single ISO 8859 variant, the computing vendors and users of the world turned to another method to present such texts, using fonts to replace one alphabet with another. When a different font is used, the visual appearance of the letter changes; a b can become a b, varying the font, or a β. The results to a reader can be the same as a change in character set.

This is the approach taken by "GreekKeys" and "WinGreek" for representing ancient Greek. Both have a specialized font of ancient Greek characters that replaces the standard one. Each package was developed independently as the need was perceived, so each was tailored to a specific computer platform and software and each works within its own realm. There is, however, no consistency among these and other packages, even for encoding a given alphabet. Therefore, documents generated using GreekKeys produce garbled text on systems that use WinGreek, and both appear to be garbage to a computer equipped only with standard fonts.

Searching, and Ordering:

Even on a single computer, varying fonts to produce Greek letters will create problems. These problems are primarily in the areas of searching for text strings and ordering them.

Most computer text searches are based on matching underlying numeric codes that generate a character string, not on its visual appearance, its font. The software, in effect, converts any sequence of symbols into its underlying numeric sequence and finds occurrences of this numeric sequence in the numeric representation of the text according the options it has been given. Searches can not, therefore, reliably recognize letter sequences if unknown character sets or fonts are used.

The order and generation of multilingual indices is equally unpredictable for the same reasons. Computers use the numbers, not the visual appearance to sort text. The numbers underlying the text are the controlling factor, not an "alphabet."

Unicode solution:

Ambiguities inherent in multiple character sets and multiple fonts constructed for specific use and with the multiple ASCII/ISO standards have always been recognized by the international community. As computing and the internet became more universal, the use of individually modified character sets and special fonts, although sufficient for some uses, was increasing inadequate. A single international standard containing a complete set of all characters used in any language in the world was needed to replace the overlapping variations of ISO 8859. Work on such a standard began in the early 1990s and continues today. Unlike ASCII and ISO 8859, this standard was designed from the beginning to be unambiguous, universal, efficient, and uniform. In other words it was intended to contain all letters, syllabic symbols, and ideographs used in any written language that the user might wish to depict. All told, there is a staggering number of unique characters in all the world's written languages (both ancient and modern); so the task was not easy. The universal standard adopted by all the major software companies is Unicode. It now has 90,000 characters defined. Creators of web browsers and word processors are committed in principle to using Unicode, but full implementation is another story.

The full implementation of Unicode will not be quick or easy. I presume that the computer industry will not keep generating new systems based on "outmoded" standards. But until Unicode is universally used by software and operating system developers who believe that standardization overrides all other considerations of effort and money, processing multilingual documents will remain a quagmire.

Changeover problems:

Unicode has been in existence for over ten years now, and it is still not the most common character set in use. For example, older software, such as Office 97, does not have Unicode options. Most web browsers, however, do have a Unicode option along with ISO 8859 options.

Even after the universal adoption of Unicode by the computer industry, there will be a long period of transition when multiple character sets (and fonts used as character sets) will be in use. There will be a time, however, when every computer user will have to complete the conversion from older systems, with their specialized fonts and their ambiguous character sets, to Unicode.

In the meantime, those who must deal with more than one characters set (multiple ISO 8859 versions, Unicode and any ISO 8859 version) face some critical problems. For instance, file names produced under one character set may appear as gibberish or illegal under another. Similarly, applications may or may not be able to open files that use an unexpected character set. Such a problem occurred recently at CSA. We received graphics files (in TIFF format) from Propylaea Project personnel in Athens. The names of the files appeared correctly in a directory listing on a PC (running either Windows NT or Windows 2000), but Abode PhotoShop would not open them, producing an error message that the file could not be found. Conversely, the names would not appear correctly in a MAC directory listing, but PhotoShop on the MAC would open them.

These and similar problems will continue to plague those computer users who must use extended Latin character sets, non-Latin character sets, or specialized fonts until Unicode has become the universal standard and all files comfort to it.

-- Susan C. Jones

To send comments or questions to the author, please see our email contacts page.

References:

The many of the web pages cited in the notes accompanying this article are sponsored by the Unicode Consortium and will provide the most up-to-the-moment information available. A primary source on Unicode remains The Unicode Standard, Version 3.0, published by Addison Wesley Longman, 2000. Version 3.1.1 is too new to be avail except the web.

For other Newsletter articles concerning the the use of electronic media in the humanities, consult the Subject index.

Next Article: Digital Preservation Meets Electronic Publishing: Towards an Integrated Resource

Table of Contents for the Winter, 2002 issue of the CSA Newsletter (Vol. XIV, no. 3)

Master Index Table of Contents for all CSA Newsletter issues on the Web

CSA Home Page