Vol. XV, No. 1
CSA Newsletter Logo
Spring, 2002

The Importance of Standard File Formats

There have been many comments in the CSA Newsletter about the importance of using non-proprietary file formats. It is crucial to the long-term survival of important digital documents that those documents be saved in file formats that can be migrated into new formats when necessary. Proprietary formats can often be migrated, but those migrations depend upon the cooperation of the software producer, and files that remain in proprietary formats may be inaccessible once the software that created them has disappeared from the market. Migrations from standard formats, on the other hand, can be carried out without the cooperation of any particular vendor, since the formats are publicly defined.

The need to choose non-proprietary file formats can easily be seen as a simultaneous recommendation to use specific software -- or to avoid specific software -- but that is not necessarily so. For instance, the widespread use of Microsoft Word® is not inconsistent with the need to store text in standard file formats. Users of Word -- CSA personnel among them -- need only save their files in the RTF format instead of Word's proprietary DOC format to satisfy the need for standards-based file formats -- without giving up the benefits of using Word. In fact, one may even save in the DOC format until a file is complete and then switch to the RTF format for the final version. The point is simply to produce the final document -- the one to be preserved for posterity -- in a standard format.

There are also times when more careful control of how a document is stored can be important, and it is not difficult to exert that control, sometimes even within a proprietary format. A simple example may be helpful. Microsoft Word has a feature that "corrects" certain things Microsoft thinks the user wants to correct -- replacing "-" or "--" with "—" (the em-dash) for instance. Since the em-dash is not a member of the standard ASCII character set, however, some may prefer to turn off the so-called "auto-correct" feature so that the document created retains "-" or "--" and consequently does not contain a non-ASCII character. Similarly, some users will want to avoid "curly quotes." Neither of these special characters is particularly worrisome, but avoiding their use does reduce the number of problems that may occur as text is moved from a word-processing document into an HTML document or an email message, since many browsers for HTML documents and many email programs will not recognize the em-dash or "curly quotes."

Carefully managing the content of files -- even when the format cannot be controlled -- can help with future migration when the use of proprietary file formats is unavoidable. By controlling what is stored, users can make the migration process easier and more accurate. For instance, in the last issue of the Newsletter there was a discussion of a system to link CAD files to database files in ways that would survive migration from any proprietary CAD format (http://csanet.org/newsletter/winter02/nlw0201.html Harrison Eiteljorg, II, "Linking Text and Data to CAD Models," CSA Newsletter, Vol. XIV, no. 3, Winter 2002). The process described there makes the linkage between CAD and database files reside in explicit CAD data, not in hidden, inaccessible parts of the file. As a result, migration from one CAD format to another preserves the linkage to database files.

Similar precautions may be used when designing databases. For instance, some database management systems permit the use of something called a repeating field. That particular feature can be very handy, but it does not permit the information to be migrated properly into formats used by other database management systems. That does not mean that one should not use the programs that allow repeating fields, only that, using such a program, one should not use repeating fields. Exporting to standard formats then remains a possibility, since most common database management system will permit export of tables from their proprietary format to one of the standard formats.

Similarly, there are database systems that permit the storing of images in the database files. As is obvious, however, moving such database files to any other database format is more than problematic; in most cases that migration process will not work; the images will have to be omitted and treated separately. It is quite possible, though, to use the software, omit the images from the data files, include pointers to them in the data tables, and end up with data in formats that can be exported to other formats as required. Equally important, the data files and the image files will be migrated when appropriate for each alone, not when required by one or the other.

There is also a practical, short-term reason for avoiding special features that make file format migration difficult. As one ties oneself more and more closely to a specific, proprietary file format, it becomes progressively more difficult to change from the software package that uses that format, even when there are compelling reasons to do so. That, after all, is the intention of the vendor -- to tie users into the product so tightly that they cannot escape. Therefore, all should be mindful of the hidden costs of using software features that offer little real value. The result may be more difficult long-term storage of data, and the user may be locked into the software long after a more appropriate program has been found.

The long-term utility of data files remains the major issue for scholars. As important data files are created, their creators must take pains to be sure the information will be accessible for as long as necessary. Using standard file formats is the best guarantee of that longevity. When proprietary formats are unavoidable, avoiding proprietary features that complicate data migration is the next-best guarantee.

For other Newsletter articles concerning the use of electronic media in the humanities, or issues surrounding digital archiving, consult the Subject index.

Next Article: Multi-File CAD Models

Table of Contents for the Spring, 2002 issue of the CSA Newsletter (Vol. XV, no. 1)

Master Index Table of Contents for all CSA Newsletter issues on the Web

CSA Home Page