The Archaeological Data Archive Project

Harrison Eiteljorg, II

Important notice: The CSA web site was re-designed in August of 2010. Some documents then available were out of date; so they were not included in the re-design and were not updated. This is one of those documents. Information about dates of posting and revision remains here, but there will be no revision of any kind after August, 2010.

Pages concerning CSA archival activities may seem to imply that CSA continues to do archival work. However, all archival projects under CSA's oversight have been terminated. Scholars interested in archiving digital files may wish to contact the Archaeology Data Service in York, England; Archaeological Research Institute at Arizona State University; or the Digital Archaeology Record.

The quantity of archaeological information now in computer form is large and growing at an accelerating pace. Furthermore, much of that information can only be fully expressed and understood in its computer form. CAD models, for instance, simply cannot be put on paper and made equal to their electronic originals. There is no paper analog to the CAD file; there are only drawings that represent individual snapshots of the infinitely more complex whole.

Much of the archaeological data on computers, however, is on old punch cards, on tapes or disks attached to university main frames, on floppy disks in drawers, and on hard or floppy disks of machines which are obsolete. Those files may have been lost or forgotten; they may be on media no longer accessible. The magnetic tracks themselves may even have weakened so much that the data can no longer be read.

Of course, there are also many records on disks that are still actively used. Indeed, many archaeologists are gathering data into computer form every day, both excavation data and files resulting from scholarship in the laboratory or office. But is it only a matter of time - perhaps a decade but probably less - until those files join the others in limbo unless care is taken to maintain them. Data files simply cannot be left unattended if they are to continue to be useful. Yet those data will be missed; either they will be gone forever, in the case of original excavation information, or they will be gathered again when another scholar needs the same information.

There is also a problem with data quantity, particularly for excavation data. It is one thing to preserve catalog files and object registries. It is quite another to preserve all the files regarding conservation and preservation, lists of terms used, and the like for a major excavation.

Whether the data in question are original excavation records or collections/compilations resulting from scholarship in the laboratory or office, they have three things in common. One, they have been created at some cost - in money, in time, and in effort; two, they all have potential value to the archaeological community; and three, their value is as dependent upon their availability as their quality. Records that have been created and preserved are of value only to those who have access to their contents. These valuable records must be preserved, and they must be made accessible. The first task, preservation, is necessary in order that the second one, accessibility, can be attempted, and accessibility is required if we are each to be able to benefit from the labors of our intellectual forebears. [Note: Of course, not all scholars agree that we have a duty to make our information available to others. A colleague with whom I debate this point regularly maintains that her data files should not be made available to anyone who happens to have a computer but only to scholars whose reputations are known to her. In other words, she believes that she owns her data. I think most of us here would be uncomfortable arguing that we own our data or that we should be able to deny access because we fear what use may be made of them. But there are many who do believe that they do own - and therefore may assume responsibility for granting or denying access to - the information they have gathered. For those of our colleagues who believe they own their data, this project may seem a threat, but we must persuade them that the benefits to the discipline of open access outweigh the perceived risks.]

There is now an urgent need for safe, secure, long-term storage of the machine-readable data from excavations and from other scholarship of our colleagues. Once stored, the data should be made accessible, but the urgent task is to act now to preserve files before more are lost. To preserve such data and to provide the mechanisms for making them accessible, the Archaeological Data Archive Project (ADAP) was established. Files containing data that are relevant to archaeology will be preserved in the archive, without regard to the cultural, chronological, or geographic area involved and without regard to the type of data (text files, database files, CAD files, GIS files, drawings, illustrations, etc.). Those files will also be made accessible to the public for examination or downloading. The ADAP has been started with a certain sense of urgency at this time in order to prevent the loss of machine-readable archaeological information which is at risk, in particular irreplaceable excavation records. Therefore, there has been an emphasis on archiving such records. However, newly created data files from excavations and from secondary scholarship are equally important, if not in imminent danger, and will be eagerly sought as well.

This is an enormous undertaking, and it will not have an end. There will always be more data to add - whether new finds, new collections of known material, or old information belatedly put into computer format. Its enormity, however, means that we should begin now; otherwise the problems and difficulties will only grow while we ponder. Because we are at the very beginning of the project, this is the time to ask hard questions about the processes and procedures that will be used, to make certain that we are not heading down dead ends, to be sure that the archive will be as useful as the visionaries may hope. Therefore, my purpose here is principally to elicit responses to our current plans so as to refine and improve them - and to make clear where we need more guidance or support at this early stage in the project. So I will begin with a list of areas of concern we have identified; then I will discuss our intentions in each of those areas, with a few digressions, so that our plans may be evaluated and improvements suggested.

* Governance
* Types of data to be accepted
* Media for data input and distribution
* File formats for data input and distribution
* Other requirements for data to be accepted: data about data
* Checking of the files at the archive
* Physical and virtual archives
* Data integrity
* Data safety and longevity in storage
* Protection of restricted information
* Additions or alterations to existing data
* Spreading the labor
* Archiving non-computer data
* Data access systems
* Data standards - authority lists, field names, controlled vocabulary, etc.
* Access costs and other funding issues

* Governance - The ADAP will operate as a unit of the Center for the Study of Architecture (CSA). This is a convenience, since CSA has already obtained status as a public charity in the U.S., and the Director of CSA is the Director of the ADAP. An international steering committee will guide the work of the Director and of the ADAP. Members of the committee will be drawn from archaeological organizations, from academic institutions, and from the working groups that will do most of the scholarly work of the archive (see below for a discussion of the working groups). The independence of CSA and the ADAP from academic and national organizations should help to assure scholars that access to the archive will be open and free of parochial interests. Back to list..

* Types of data to be accepted - The potential data types are almost unlimited. Archaeological data may be in the form of text files, traditional database files, CAD models, GIS tables, drawings, photographs, or spreadsheets. Eventually, we can expect full-motion video, renderings, and a host of data types we have not yet imagined. Although there may be some difficulty with providing ready access to all file types at the present, it is important that the archive be able to store all data of importance for the field. Therefore, all the data types enumerated will be accepted, without reservation, and we will accept all those data types for which there is a reasonable demand.

There are also differences between data collected in the process of excavation and data collected or compiled in the lab or office. Both types of scholarship produce valuable information; both types of files should be accepted. There is, however, an important distinction between raw data gathered by an excavator and collections or compilations of information done away from the field. Once lost, the data files of the excavator cannot be recreated. The files which result from later scholarship can, catastrophes aside, be recreated if necessary. Therefore, when dealing with excavation records, our archival responsibility may outweigh some of those concerns which might weigh more heavily in the evaluation of other submissions. The same sense of urgency may apply to some forms of laboratory work, especially those involving destructive analysis. As noted previously, excavation records are also voluminous, so voluminous that it may eventually seem necessary to discard some parts of the data and retain only the files that include information about objects and excavation units. Ideally, however, all computer records from an excavation should be kept (as all paper records are); a major purpose of the archive is to eliminate any debate about what files to save by providing storage for all the files from an excavation. Back to list.

* Media for data input, storage, and distribution - Files will be accepted on any media that can be dealt with by the ADAP. Files sent over the Internet, of course, eliminate the media problem altogether. PC disks, Mac disks, and Sun (UNIX) disks or tapes are nearly as easy to deal with. We have had offers of transfer assistance for data on KayPro disks; other CP/M disks can be dealt with similarly. Data on Apple II disks can also be transferred. Data from other disk types, other tape formats, and the like can be accepted only over the Internet at this time. Should there be sufficient demand, however, other media could be supported, and we have had offers of assistance from people familiar with these problems.

The Internet is also the most desirable mechanism for the distribution of files from the archive. However, files may also be distributed on disks. At present, disks would be Mac, PC, or Sun UNIX disks. There are no plans to distribute large numbers of files on CD-ROM, since we will not be attempting to package specific groups of data for sale, and CD-ROMs are not very economical in low volume production. Of course, they are also slow. More important, CR-ROMs are static presentations of data, frozen at the time the data are encoded on the disk. The archive, on the other hand, will be dynamic and open to change.

The accessible files - those actually up on the Internet - will be on tape or disks at the ADAP office. The physical media chosen for that ephemeral storage is not an important matter except as it affects speed of access. Back to list.

* File formats for data input and distribution - What file format(s) should be accepted and/or encouraged? What file format(s) should be distributed? There is an inherent conflict between the file format most easily shared (ASCII in many cases) and the file formats most desirable for use (Oracle, Informix, or .dbf, in the case of database systems, for example; .dwg for CAD use, etc.) If data files are archived in their native format, they are more useful - but only for those who have the appropriate software. On the other hand, anyone can use an ASCII file, but the usefulness of the file is severely limited. Since the data are likely to be submitted in many different file types, depending on the preferences of the scholar gathering the data, it seems pointless to try to impose a standard for each data type, but there must be some sense of order. (Note that Unicode or the ISO standard will almost surely replace ASCII; as that happens, there will be required adjustments. But simple character files will remain the easiest to transport and to use.)

Our current requirement for traditional database files is that they be submitted in their native format, comma-delimited ASCII, and .dbf, assuming that the native format is useful on a UNIX machine, a Mac, or a PC. This requirement is based on the assumption that ASCII and .dbf are the most easily created from other formats and that ASCII provides a kind of lowest common denominator while .dbf provides the most common personal computer DBMS format. Of course, if data files are otherwise in jeopardy and translation is impossible, they will be accepted in the native format only. (That might be the case if the files had been created in the past and were no longer being used or maintained.)

For CAD files, the native format and .dwg or .dxf will be requested. The .dwg is AutoCAD's format, .dxf is a standard interchange format used by nearly all CAD programs, as well as rendering programs. Since the transfer between .dxf and .dwg is very easy, there is no need for both. Drawing files will be accepted in their native formats and any of the widely used illustration formats, if the native format is unusual. GIS files (those files which are not simply database files) will be accepted in their native format and ARC/Info format.

Text files will also be accepted in ASCII. Indeed, ASCII will be sufficient for most purposes. If individual authors believe the simple- character version to be inadequate, other formats will be accepted in addition to ASCII (not in place of ASCII). We do not expect to place stringent requirements on secondary text file formats and currently plan to accept WordPerfect, MSWord, and Nota Bene files.

The archival copies will be made in the native format and the "standard" commercial format for that file type (e.g., .dbf, .dwg, and ARC/Info). However, since file formats will certainly pass out of use and be replaced, the archival files will be converted to contemporary formats as necessary (that is, when the original format goes out of use). Archival copies in the original format will only be retained if they can be used by software still in use or there are unique qualities which can only be preserved in the native format. When the software responsible for the original format has passed out of use, the archival copies will be converted to the formats then deemed most useful.

In addition to the file formats actually maintained at the archive, every effort will be made to provide translation facilities for other common file formats. Back to list.

* Other requirements for data to be accepted: data about data - I will discuss later the important issue of data standards. It is clear that standards will become more and more important if we are to be able to access the data effectively, but it is not possible to apply standards retroactively when they have not even been established. For the time being, therefore, we will not expect incoming data to conform to any arbitrary standards. What we will require is a thorough description of the files in question. For traditional databases, that will mean complete explanations of file structure, relationships between and among files, authority lists, vocabulary, etc. For CAD files, we will require complete lists of layers, with all explanations and distinctions, as well as information about the methods by which the survey information was acquired. Similar requirements will be applied to other file types. In short, we will require that the files be useful, and without such supporting information they could not be. (A scholar wishing to contribute data may write to the archive for a set of description requirements for his/her data files.) Back to list.

* Checking of the files at the archive - In addition to making archival copies of the files, ADAP personnel will check the files for errors. It is too early in the process to determine the extent of the checking, and that will certainly depend upon the file type. It will also depend upon the importance of the data (measured in the demand for access, I assume), since resources will be limited. The checking process will be noted so that users will know what checks have been performed on any file(s). Back to list.

* Physical and virtual archives - There is no particular virtue to a single physical repository for data files versus a multitude of different repositories connected by the Internet. Modern systems like gopher and WWW make it easy to locate files from any number of physical sites. However, there are significant practical difficulties with data storage that may make fewer repositories more desirable than many. Not only do magnetic files decay over time, file formats may become obsolete, as may disk formats. Dealing with these problems in a few places, each devoted to the archival tasks, is much easier than dealing with them in various institutions, each of which has many other duties competing for attention and resources.

One large repository should also have the capacity to approach the problems regarding access to disparate file types that would be impossible for many smaller ones. Those practical considerations suggest that we should try to archive at the ADAP as many of the files as possible. Nonetheless, some institutions may prefer to discharge their archival responsibilities within the institution. If so, the data may be effectively included in the archive through the use of pointers of one sort or another. Many governments are also working on archives of archaeological information. The information in those archives need not be duplicated, but those archives generally exist for more narrow purposes than the ADAP. Most are not prepared to house the entire data set from an excavation; nor are they interested in all data sets collected by scholars in the course of their work. The ADAP will maintain pointers to such national (or regional or multi-national) archives so that the search and access processes may simplified. (It seems possible to me that, in the case of excavation data, the national archives may often contain those parts of the excavation records they deem necessary while the ADAP stores the remainder.)

It may be desirable to have multiple versions of the repositories, both to spread the risk, however minimal, of catastrophic loss and to make the data more readily available. Until more data files have been gathered, it is difficult to know whether multiple repositories will be needed, but it seems likely that, eventually, there should be three physical repositories located on different continents at the least. Back to list.

* Data integrity - The archival duty requires certain procedures and precautions. All files will be archived in two identical copies. File comparison programs will be used to make certain that the two archival copies are identical and that they are identical to the originally contributed file(s). One of the archival copies will be stored on site, and one will be stored elsewhere. A third copy will be made; that will be the only copy available on-line or for copying to a disk. No other copy will be made at the ADAP until the on-line copy needs to be refreshed.

In the event that a scholar becomes concerned about the treatment of data taken from the archive, it will be possible for a copy of the archival file to be made for the original contributor. That file could then be used by its contributor as the starting point from which to replicate procedures used by another scholar. This precaution may be especially desirable for those who contribute data and are concerned that they may be misinterpreted, though the best guaranty of proper understanding is a full and accurate explanation of the data in the required descriptive statement. Back to list.

* Data safety and longevity in storage - Data will be refreshed as necessary while in the archive, but the most stable medium available will be used to minimize the problem of data life. Optical media seem the most stable at the moment, but expert advice will be sought before committing to a specific archival medium. Back to list.

* Protection of restricted information - Some data items may be inappropriate for access by the public. Geographic coordinates for sites which might be looted, for example, should not be available to all. It may be reasonable to limit access to other data items. Appropriate limits may therefore be placed on the access to such data items. In the case of site locations, for instance, permission of proper authorities or the excavator might be required for access. The access system must permit access to public information without compromising the restricted-access data. Back to list.

* Additions or alterations to existing data - The archive must not be seen as a static repository of dead information. Not only must new information be added as generated, but new interpretations of data in the archive should also be added. For instance, new interpretations of the stratigraphy of a site may be generated long after the excavation has been finished, and those new interpretations should be made a part of the archive. Similarly, alternate reconstructions of structures can be added to CAD models without changing the original files. These kinds of alterations and additions must be possible both without risking the integrity of the original files and in ways that make all relevant information accessible as a unit, as if it were part of a single data set. Back to list.

* Spreading the labor - This work will require the labor of many. Not only will we need the assistance of computer specialists, but many different archaeological specialties will be involved.

To make the task more manageable, working groups will be established to deal with contributions of information from specific areas. The groups will define their own areas of expertise, and they will be formed over time in a way which prevents overlap but encourages the groups to divide themselves into smaller groups as necessary. The relations between and among the groups may be hierarchical in some cases. (A working group devoted to the prehistoric Aegean is being formed now.)

The working groups will apply their specialized expertise to their areas. Working together and with the ADAP personnel, they will apply their combined expertise to larger areas so that standards will be built from the bottom up, not from the top down.

* Archiving of non-computer data - The ADAP cannot archive non- computer data. It can and will, however, assist in the conversion of information into formats appropriate for the archive. This is, of course, of particular relevance to information from excavations. Back to list.

* Data access systems - No data will be accepted for the archive unless the contributor has agreed that the data will be made public by a specified date. However, that date may be some reasonable distance in the future so as to permit, for example, publication by an excavator before the data are made available to others through the archive.

Access to the data will, at the outset, be simply access to the files, by anonymous ftp, for instance. The files will be made available over the Internet or on disk. In the long term, however, an access system must be created to permit cross-file searches, extensive use of look-up tables, dictionaries, authority lists, and the like. One of the advantages to starting the archive now, before we have been inundated with data files, is that we will have a better chance to develop access systems slowly and gradually, adjusting them to meet the demands of new contributions and new users as the archive develops. An attempt to define such a system before seeing the contributions would, I think, be doomed.

The individual working groups will play a central role in defining the access systems and standards for access. They will need to consider the access systems and procedures as they evaluate data submissions. In order to access the data, there must be some index to indicate what data are available, where, in what form, etc. Therefore, there will be a database file to serve as such an index. The file will be available via anonymous ftp at the minimum and will include all necessary instructions to permit individuals to access files listed in it. This database will include files in the archive as well as others known to archive personnel. Back to list.

* Data standards - authority lists, field names, controlled vocabulary, etc. - I have saved this, the most difficult question of all, to the last. What kinds of standards will be imposed on data to be included in the archive, and how shall such standards be imposed?

The working groups will be charged with most of the scholarly questions, and the ADAP with most of the technical ones. But the technical ones - file formats and media, for instance - are the easy ones. The scholarly standards are much more problematic.

The gradual process of gathering information will help with the development of standards. It will not be necessary to try to develop an a priori set of standards, in my judgment; instead, each working group will be able to approach its own area of expertise with relatively simple standards and to refine those standards over time. As new submissions bring up new issues, the standards will be extended, modified, and adjusted. Similarly, as technical possibilities change, there will be revisions. As a result, it may be necessary to convert data already in the archive. That may seem inefficient, but I believe it is important to make the standards reflect the requirements of the data. If we try to create standards which cannot be adjusted or modified - even totally reconsidered if necessary - we are all too likely to find ourselves making the data fit inappropriate standards. If that happens, the integrity of the data will have been compromised.

The working groups will work with other organizations like the Getty Art History Information Project and the Council for the Preservation of Anthropological Records in developing standards. The ADAP will coordinate those efforts when they affect more than one working group. Back to list.

* Access costs and other funding issues - Specific costs for accessing data have not been determined, though it is clear that the cost of supplying data on disks will be higher than the cost for data over the Internet. We have not determined whether we will create a membership system for unlimited data access, but that is a promising model. Back to list.

In conclusion, I believe that ADAP archive will be valuable to scholars in many ways easy to imagine today and in ways we cannot yet dream of. But its value will depend on a spirit of cooperation and of common purpose on the part of many - those who contribute their data as well as those who contribute their labor. To that end, we are doing our best to make this an open, consultative process, and we are eager to have the advice and suggestions of our colleagues. I look forward to your help with this effort.

Harrison Eiteljorg, II
CSA/ADAP

About this document:

Title: "The Archaeological Data Archive Project"
File name: adaplond.html
Author: Harrison Eiteljorg, II, c/o Center for the Study of Architecture (CSA), Box 60, Bryn Mawr, PA 19010, (e-mail to user-name nicke at (@) domain-name csanet.org; tel.: 484.612.5862)
Revision history: This paper had minor modifications in 1994 following the London Conference. Posted April, 1996. Modifications in the appearance only were made in July, 2000. In July 2001, further modifications in appearance were made and citation permission granted. Very small changes were made in August 2010, none of consequence.
Review process: As posted, this document has undergone no peer review.
Paper publication history: Text of paper prepared for British Academy Seminar on the Problems and Potentials of Electronic Information for Archaeology, London, June, 1994. Publication was said to be pending at the time of posting, but publication of the conference papers has not occurred.
Internet access: This document is maintained at csanet.org by the Center for the Study of Architecture and Harrison Eiteljorg, II. Note that there may be changes in computer addresses that are beyond the control of CSA.
Long-term availability: This document and/or its successors will be maintained for electronic access for the foreseeable future. When it is no longer directly accessible, it will be maintained in the CSA archive.
Citation permissions and copyright information: This document is copyrighted by CSA. When this document was originally posted, citations were not permitted since its publication was pending. However, the proposed volume never appeared. Citations with proper bibliographic information are, therefore, permitted.