Vol. XV, No. 2
CSA Newsletter Logo
Fall, 2002

Naming Files in a Multi-Script World

Harrison Eiteljorg, II

If a digital file is transferred over the Internet from a scholar in, for instance, Moscow, to another in Spain, how does the Spanish scholar know what character set to expect? Quite simply, he or she cannot know unless the Russian scholar explicitly identifies the character set used in the file. That is, there is no file extension or other open and visible indicator of the character set used in any file. Some varieties of software may recognize the character set used because of information contained in the header, but the user has no way to determine the character set from the external attributes of the file.

One might expect that anyone using an ISO standard would automatically use the same character set that was used in making the file when creating its name; the name would therefore imply the internal character set. That is not the case, though, because all the ISO standards permit both Roman characters and others to be used at the same time, and, of course, not all file names in any specific language/script would certainly imply a specific character set. (Consider the file name ATBEIX.DOC; the characters in the file name proper, ATBEIX exist in Roman, Greek, Cyrillic, and so on. The extension, DOC, is supplied by Microsoft Word, not the user; so it is not instructive, though the use of those characters might otherwise imply a different set of possible scripts.)

Unicode might seem to eliminate this problem, but there are two reasons to suspect that it will not. First, Unicode itself is evolving, and fonts that include all the necessary characters are not widely available. Therefore, a user of a Unicode file may not actually know what font is necessary to display the document, and, without the right font, the document could be useless.

Second and more important, the assumption that Unicode will be the final step in the development of character sets is one based more on optimism than on a dispassionate examination of the past. Other ways of encoding characters will surely arise and replace Unicode. How then will a user distinguish between them?

A solution that comes easily to mind is an extra extension, perhaps a file name convention such as File-name.character-set.type. Thus a file named "newsletter" in the HTML format and using the ISO 8859-7 character set might be named newsletter.i07.html, and the similar file with a Unicode character set might be named newsletter.uni.html. Since this may not be an acceptable naming format in Windows, perhaps a better suggestions would be File-name_character-set.type, yielding names such as newsletter_i07.html and newsletter_uni.html. In such a naming system the underscore character would need to be reserved for use to separate the file name from the character set.

Computer users tend to be short-sighted, concentrating on the very pressing needs of today, not the demands of a decade or two from now when specific digital files may be quite strange to their new users -- and when those new users may need all the information they can find in order to use those files. Those who are concerned about the longevity of digital files -- among whom should be counted all scholars -- need to take that longer-term view and consider whether character sets should be specified in file names.

-- Harrison Eiteljorg, II

To send comments or questions to the author, please see our email contacts page.

For other Newsletter articles concerning the use of electronic media in the humanities, consult the Subject index.

Table of Contents for the Fall, 2002 issue of the CSA Newsletter (Vol. XV, no. 2)

Master Index Table of Contents for all CSA Newsletter issues on the Web

CSA Home Page