CSA Newsletter, January 2009: What Data Will Be in Use Tomorrow?

What Data Will Be in Use Tomorrow?

Harrison Eiteljorg, II
(See email contacts page for the author's email address.)

Given my experience with data archiving, articles about archival data storage naturally attract interest; so this link to an article about photographs stored online by professional photographers immediately attracted my attention: "Will you lose pictures stored online if your photo site goes bust?" (posted on ZDNet by Janice Chen on November 11, 2008 and last accessed on January 6, 2009). Although the particular issue -- the unexpected and abrupt closure of a commercial archive for professional photographers' digital photos -- is not directly relevant to archaeological archiving, the article led, in a very circuitous route, to some thoughts about the archiving issue that are important to scholars.

For a time after seeing the online note I reconsidered some of the major issues, and I was pleased to find an excellent publication about archival preservation that has recently appeared on the web: "Sustaining the Digital Investment" (interim report of the Blue Ribbon Task Force on Sustainable Digital Preservation and Access, December 2008 at brtf.sdsc.edu/biblio/BRTF_Interim_Report.pdf). This publication goes into considerable depth to discuss the kinds of materials deserving archival preservation and the economic and business models that may apply. In effect, it asks all those questions that need to be asked, taking little for granted.

However, even this excellent publication, not to mention most typical discussions of archival preservation including much of what I have written about archiving, does not focus on one of the most important matters involved in archival preservation: version control. The term version control refers to the duty of anyone responsible for a data set to track the development of that data set (no matter the kind of data) so that a specific version is always properly designated as the current and authoritative one, with past versions identified so as not to be confused with the current one. The term is most commonly applied to data sets during their creation, since the point, of course, is to be sure project personnel are always using the authentic version of the data set when entering or accessing data. Confusion leading to the use of the wrong data set during a data-entry session could yield multiple versions of the data, none of which is current in every respect.

While making certain that the "master" data set is the one in use on a project at all times is the chief issue when using the term version control, the term is also applicable when considering the use of data that have been finalized and passed on to an archaeological archive. In fact, version control can be just as important when data have become public and widely used. That is the case because, once a given data set has become widely used, there may be many versions in circulation, each the result of a different scholar's re-purposing of the data for his/her own research needs and with his/her own additions, deletions, and alterations. Unless there is a reliable source from which to obtain the authentic, original data set, there can be no confidence that the one you have obtained is the one produced by the original scholars. Thus, an archaeological archive provides the ultimate version control, after data have become public and widely used; the archive provides the authentic, original data set for any potential user. This aspect of archival storage has, in my view, been overlooked too often, and the issue is far too important to remain in the background.

As someone with archived material, I am particularly concerned about this problem of version control. I have willingly, even eagerly, made the material about the older propylon available to other scholars who may be interested in the older entrance to the Acropolis. I am eager to have that information completely in the public domain. At the same time, I am very definitely not eager to have any changes -- additions, deletions, or alterations -- introduced to the files and passed off as reflecting my work. (To be quite clear, I have absolutely no problem with another scholar taking my data and modifying it so long as he/she makes it clear that there have been modifications, what they are, and who is responsible.) Even though this particular entrance structure may not be such a hot topic that I should worry about the mis-use of the files, I am determined that anyone interested in the entrance structure be able to obtain the fruits of my best efforts to study, document, and explain it. I believe that depositing the files in an active, self-conscious archaeological archive is the only way to guarantee that. The archive chosen -- the Archaeological Research Institute at Arizona State University -- provides that kind of self-conscious care: "At ARI, the data and artifacts are curated in perpetuity, and are accessible for research, publication, exhibition, education, and other related purposes." (See "ARI as an Archaeological Repository," at archaeology.asu.edu/about_ari/repository.htm; last accessed 21 January 2009.) In addition, it is an archaeological archive; so, when data migration becomes necessary, the personnel will know how to deal with the files better than would a business-related archive.

Some have suggested that having data available on the web will ultimately create many copies and that, as a result, there need be no self-appointed archival repository. The existence of many copies will provide insurance against loss. However, I am not only concerned about loss. I am also concerned about the equivalence of what remains to what I originally created. An "automatic" process does not guarantee that; indeed, it seems to guarantee the opposite -- an unknown number of competing versions, none of which can be certainly related to the original.

Others have suggested that the Internet Archive at www.archive.org providees a painless version of archival preservation by automatically backing up materials that are placed on the web. However, not only does the Internet Archive fail to perform necessary data migration (so that, for instance, a user may obtain my AutoCAD file in the version required for his/her use 20 years from now), it also preserves multiple undifferentiated copies of the files. In fact, as an experiment, I searched the Internet Archive for the model of the older propylon that was available on the CSA web site before being permanently archived at the ARI. I found a copy of the file that was on our web site (a zip file that contained both a text file with information about the AutoCAD file and the AutoCAD file itself). The copy was a 1999 version of the AutoCAD file; the text file was of the same vintage. I do not want a user to access the older propylon data with those files. They are not the most up-to-date, accurate, and complete versions of either the AutoCAD file or the accompanying information. The ARI archives have both the definitive AutoCAD file and the most thorough information about it.

So a thought process that began with a sorry tale of woe for professional photographers moved me back to one of the most important but least understood reasons for data archiving. For any scholarly data, there should be one definitive source for anyone who wants access. Fewer than one, of course, means that data are lost; more than one, however, may have an equivalent result in the sense that the original data set may not be discernable, and the original scholarship may be thoroughly obscured.

Writing a piece such as this concerns me since it can make me seem unyielding and somehow stuck in a time and place that have been passed by. Those hard at work on web resources today do not seem to attend to these kinds of concerns. As a consequence, these warnings, which I consider important, may be ignored as outmoded. Therefore, I wish to close with a plea to any reader who wishes to dispute the foregoing. If you see ways to provide certain preservation of archival data -- preservation that provides certain access to authentic, original data sets -- without self-conscious archival repositories, please help me and our readers by sharing your views.

-- Harrison Eiteljorg, II

An index by subject for all CSA Newsletter issues may be found at csanet.org/newsletter/nlxref.html; included there are listings for articles concerning the use of electronic media in the humanities and the Archaeological Data Archive Project and issues surrounding digital archiving.

Next Article: E-Publising: The Second Edition

Table of Contents for the January, 2009, issue of the CSA Newsletter (Vol. XXI, no. 3)

Table of Contents for all CSA Newsletter issues on the Web

Propylaea Project
Home Page

CSA Home Page