Who cares about file formats? All scholars should.
Harrison Eiteljorg, II
This discussion must begin with some comments about software marketing, which is an odd topic for this place but a necessary one. Since new programs regularly come to market, manufacturers of existing software need to try to prevent users from switching to new competitors. There are many ways to keep that from happening; the most obvious way, of course, is to keep making the old software better. However, there are other effective ways to keep people from switching. One is to use proprietary data formats so that a user cannot change to a new program without risking the loss of old data or access to it. It is actually very difficult to do that effectively because most file formats can be parsed, even if they are proprietary. Users, however, do not want to take the chance of losing data and may have to go to considerable trouble to change the data format; so they are often intimidated by proprietary formats. In addition, there are some formats that are not used commonly enough to have available translators, and some programs are so important to a specific task that possible complications and data loss cannot be risked. As a result, software manufacturers continue to use proprietary data formats as a way to keep users from switching to competitive programs.
I have often discussed the problems caused by proprietary formats of digital files because of the impact on archival storage and/or transferring data from one user/computer/program to another. I have not, however, discussed the simpler, more practical problems that arise from using proprietary file formats for common, every-day work. I had not tried to make a case that file formats for every-day computing really matter until, almost by accident, I slipped a comment into a paper delivered at the last SAA meetings in Denver. In that paper I commented that nobody should use the DOC format that is Microsoft's proprietary format for Word. In the context of that talk, I did not take time to explain or justify the comment, but a longer and more careful discussion of the issue, not only file formats for word processing documents but also file formats for other common scholarly computer files -- not database or CAD files, but simple email, images, and other routine files -- seems not only appropriate but overdue.
Scholars are more likely than most computer users to need long-term access to the materials they produce on a computer and, at the same time, less likely to have rules for how they use their computers, what software they choose, how they store their files, or even what computers they select. Lawyers provide a good counter-example. They are at least as likely as scholars to need their old documents, but most law firms have definite policies for how documents are created, labeled, and stored to insure long-term access. They also have staff members paid to worry about long-term access. Similarly, the computers and software used in a law firm are controlled by the firm, not the user; so file formats used and the long-term retention of files can be guaranteed by staff members for whom those issues are considered paramount.
Scholars operate far more independently, although those working cooperatively on a project may be somewhat constrained. Most scholars use computers and software of their own choosing, starting with the operation system and including not only word processors but email programs, GIS software, CAD programs, and image editors. As a result, the work of any given scholar is very unpredictable in terms of computers used, software applied, and file formats written. I am not so foolish or naïve as to suggest that scholars should be more constrained. I am, however, just idealistic enough to suggest that individual scholars should consider their computer use and make some careful, conscious choices about ultimate file formats to be written for any task. Whether I write with a ball-point pen or an ink pen does not matter; the results can be read by anyone (maybe not, if I am writing quickly, but the choice of pen will not be the issue). The file format in which this article is stored, on the other hand, will limit the readers to a more or less restricted group, depending on my choice. Furthermore, a poor choice may doom my work to be of value for only a very short time span. As a scholar, therefore, I have a very serious need to make good choices of file formats if I care about how long the fruits of my labor will last and how widely they may be used.
The simplest and most ubiquitous example of stored files is email files. Scholars do so much communicating by email today -- and therefore need to be able to access old email messages -- that email is an especially important category of data. In my own case, I now have few letters on paper filed away for future access but thousands in my email system. Despite that importance of email, most scholars pay little or no attention to the email programs they use and the resulting nature of the files retained.
Widely used email clients1 include the following: Microsoft Outlook® and Outlook Express® (both for Windows®), Microsoft Entourage® (for MACs), Eudora® (for Windows and MACs), Mozilla/Netscape® (for Windows, MACs, and Linux), and Mail® (for MAC OS X only). There are many other email clients, but these are probably the most widely used.
Outlook and Entourage include personal information management along with email functions (address books, calendars, and notes), and the programs store all the information -- email messages, addresses, appointments, and notes -- in a single, complex data file. The file formats of those two programs from Microsoft are proprietary and, despite coming from the same manufacturer, different from one another. Outlook Express does not include other information but uses yet another file format, which is also proprietary. The programs strip out attachments and store them separately.
Eudora stores email messages in a simple text stream, with specific characters that indicate the end of one message and the beginning of another. The format is an old and established one that is not proprietary. Eudora strips out attachments and saves them separately. The MAC OS X Mail program seems to do essentially the same thing, storing email in text streams and stripping out the attachments.
Mozilla and Netscape (which are virtually identical under the skin) are similar to Eudora and Mail in that they use the same text format, but they do not strip out attachments. Instead, the attachments are left in the text stream. The user may save the attachments to a file, but there seems to be no way to remove the attachment and save the email message without it. (Eudora, OS X Mail, Netscape, and Mozilla include address books in their email systems, but the information for the address books is stored in separate files.)
The email text file format used by these programs is so simple that it is possible to open an email file with a text editor and edit it. One may even correct errors such as missing dates, senders, or subjects and re-save the file -- to be used further by either a text editor or an email program.
Although Eudora, MAC OS X Mail, and Mozilla/Netscape all use the same basic file format, they name the files differently, and they construct the directory hierarchies differently. Some programs use a file extension (usually MBX or MBOX); others do not. All add some form of index file, but those are remade whenever necessary; so they are not crucial to any use of the mail files.
All these efforts to make one file format or one directory structure different from another stem less from any true data requirement than from a need to try to make it seem to users that they must continue to use the program they have been using. Otherwise they may lose their email files.
Now that you know more than you want or need to know about email file formats, the natural question is, "What difference does it make." The answer is none -- until you want to change your email client or your computer operating system (which may require a change in email client). Then it matters enormously. A new email client must be able to import the email from your current file(s) if you want to be able to access old email along with new email. In fact, many email programs provide import routines to bring email into the new file from the old one (but usually not export routines); they do so, of course, to make it easier for you to try switch to their programs. Were they to provide export programs as well, it would be too easy to switch away.
I have changed email systems several times, using Eudora, Mozilla, OS X Mail, Evolution (under Linux), PowerMail® (on the MAC), Entourage, and Outlook. The file transfer processes have not been painless and have sometimes been very time-consuming, but I seem to have lost only things like individual message dates, not the messages themselves. Of the programs I have tried, I would only reject Outlook out of hand -- because of its susceptibility to viruses, not the functioning of the program, and I would consider each of the others to have specific strengths and weaknesses. I am now using Mozilla -- not for the program's capabilities but for the simplicity and open nature of its file structure and because it runs on all the machines I use -- MACs, PCs, and Linux machines. If I should choose to change computers, I can simply take the entire email database directly from the MAC on which it rests to the CSA Linux machine and start working there. I assume I could do the same with a Windows machine, though I have not tried to do so. That file structure would remain the same for Netscape, but differences in file extensions and/or file names and/or directory structure would require other changes to move the files into the proper form for Eudora or OS X Mail (or for the Linux email program called Evolution that I think is a very good program). The files themselves would not need modification. An import routine and a complete change in file structure would be required to move to Entourage or Outlook. In addition, Entourage and Outlook files cannot be read at all without the programs, whereas the files based on the simple text stream can be read with any simple text editor.
The foregoing is not meant to be a recommendation. Rather, I hope you will agree that the file format -- unrelated as it may be to actual daily use -- really may be an important matter for scholarly computer users. It may even be more important than the features that are normally used to rate and compare programs, because file structure has a major impact on one's ability to use email files over time and with new, different programs -- even with text editors. I certainly have important email messages that are now nearly a decade old, and I think it is an absolute certainty that I will change email clients or operating systems or both in the next decade. When I do, I will want all those email files; so the format in which they live now is very important, not for today but for tomorrow.
If email is important as a record of correspondence, how important are word-processing documents? The answer, of course, depends on the individual document. Some notes or preliminary paper drafts that were discarded may not be worth saving at all. Other files, texts of oral presentations or project reports to supporting institutions for instance, may be crucial documents that need to be preserved but do not exist on paper in your office. For files such as those, file format choices may again be important; indeed, my comment from the SAA meeting may be repeated here: ". . . nobody, to repeat, nobody should be using Microsoft Word®'s native file format for word-processing documents (DOC)." Please note that this recommendation does not include a recommendation not to use Word. Quite the contrary, Word is the standard for word processing today, and most scholars use it. However, it is possible to store documents in the non-proprietary format called RTF when using Word or virtually any other word-processing program.2 If you save documents in the RTF format, you do not then need to worry that a recipient of the document will be unable to read it -- regardless of the program he or she is using -- or that it will be useless any time in the foreseeable future. That could not be said of a DOC file.
A similar circumstance exists in the graphics world. Users of Adobe PhotoShop® generally store their images in the proprietary format for PhotoShop (PSD) because of its important features. However, any file of importance should ultimately be stored in a non-proprietary format such as TIFF. There is another wrinkle for graphics files -- compression formats such as JPEG.3 While such a compressed format may be an appropriate format to use for files to be kept at hand, it is not a good format to use for long-term storage. Compression, at best, removes the user one full step further from the original file and imposes an unnecessary extra step in the process leading from digital file to screen representation thereof. In the case of graphics files, compression formats also cause some loss of image quality, a cost that can only be borne when best quality is not an issue.
It is possible to make some general observations about digital file formats generated by scholars. Anything important should be stored, uncompressed, in the most generic and non-proprietary format available. That is not to say that functionality should be sacrificed, but only very important benefits in functionality should be permitted to over-ride the need for non-proprietary file formats. If in doubt, try saving a file as an open and a proprietary format and open each to see whether there is any loss of function and, if so, how important that loss is. For instance, one might compare the DOC and the RTF formats for a text file. Similarly, one might try saving a spreadsheet as an XLS (Excel) file rather than a SYLK (non-proprietary) file and check to see if there are losses. For graphics, PSD and TIFF. Similar dual-file approaches can be used in many cases, and it may often be appropriate to work in a proprietary format but then to save the final version in a non-proprietary one. The point is simply to make certain that files you want to keep will be useful for as long as that is necessary.
When it comes to email, the choice is simpler because the differences in software features and capabilities are so small. It may not be necessary or desirable to switch programs today, but anyone who uses email for serious matters should be thinking carefully about the importance of the file formats used for email messages and choosing software accordingly in the future.
In all categories the issue is not what computer, operating system, or program is in use. (Relevant issues about operating systems are discussed in another article in this issues of the Newsletter; see http://csanet.org/newsletter/wunter03/nlw0305.html). The issue is not the programs in use but the file formats relied upon. In the right formats, the files will retain their utility well into the future. In the wrong format, they may become useless or, more likely, be useful only after considerable time and trouble has been squandered to make them so.
-- Harrison Eiteljorg, II
To send comments or questions to the author, please see our email contacts page.
1.Email programs on personal computers are properly called email clients; an email client is used on a personal computer when one accesses a server to get email and bring it onto the personal computer; hence the terms client and server. For email there are servers, generally in the college or university computing center in an academic environment, that receive and hold the email until a client accesses the server to download messages and remove them from the server. Return to text.
2. RTF is actually a Microsoft format, but it is a public one with fully specified format instructions available to anyone. As a result, it is safe to use now and should remain so until/unless Microsoft decides both to change it and to make it proprietary. Return to text.
3. Compression can be used for other formats as well, though most users do not think about using compression on non-graphics files. The same comments apply.Return to text.
For other Newsletter articles concerning the Index CATEGORY or Index CATEGORY, consult the Subject index.
Table of Contents for the Winter, 2003 issue of the CSA Newsletter (Vol. XV, no. 3)
Table of Contents for all CSA Newsletter issues on the Web
Propylaea Project Home Page | CSA Home Page |
CSA Home Page |