Aggregation for Access vs. Archiving for Preservation

Harrison Eiteljorg, II

(See email contacts page for the author's email address.)

The first article in this issue of the CSA Newsletter concerns the Open Context initiative, an ambitious project to gather and make available archaeological data from a wide variety of projects in a single, searchable resource. The second article is about a different but equally ambitious initiative, Digital Antiquity, and its repository, the Digtal Archaeologial Record (tDAR), for digital files produced in the course of archaeological investigations or research. These two projects are very important initiatives, but their success will be clear only after some years. If scholars contribute project data in sufficient quantities to either or both, one or both - and one certainly hopes both - will have made an important contribution to the archaeological community.

Readers will note that these two projects illustrate two threads in the archaeological community concerning the longevity of data. One thread takes the primary issue to be aggregating data for access; the other takes the primary issue to be archiving data for preservation. While these are obviously not mutually exclusive notions, they motivate different people and suggest different processes.

The Open Context initiative takes its primary duty to be providing archaeological data via the web to anyone with access to a computer and the internet. The data will be translated into and made available in the form(s) chosen by Open Context, not in the scholars' original forms. That permits Open Context to supply individual data items instead of whole files produced in the course of a project; it permits the data to be aggregated for searching. Preservation of the data, to be sure, is included in the plan, but the preservation will be in the schema used by Open Context, not the archaeologists' original schemata, and preservation will, most likely, not include all the data recorded by a field project.

The Digital Antiquity/tDAR project, on the other hand, takes its primary duty to be the preservation of digital files produced in the course of archaeological fieldwork so that access to those files may be guaranteed. Those files - or files in other formats intended to mimic the originals - will be preserved via archival processes. Access will be to an entire file, not to individual data items from a file. As I read the information on the Digital Antiquity/tDAR web pages and the article here, the project does not take data aggregation to be the primary goal, though methods for aiding in the aggregation of data will be part of the long-term work.

Both projects are critical for the long-term health and advancement of the discipline. Archaeologists need access to large quantities of old data, and access, at a minimum, means that old data must be preserved. For the near term, it seems to me that data preservation is the primary need because there are too many data files at risk; their loss would be tragic - and would render too much good field work useless. In the long run, however, aggregated data will provide more opportunities for large-scale work on objects - artifacts or ecofacts - from many different projects.

The limits of the preservation-by-archiving approach are obvious and important. Absent tools for aggregating data, users must have at their disposal all the necessary skills to use any file downloaded from the archive. Even when only a given project's files are needed for the research aims of the user, thus eliminating the problems of data aggregation, the skills required may put important impediments in the path of that user. Such a user may need to be able to use database, GIS, and CAD software, for instance. Until archaeologists are routinely taught to use all relevant software - a time I do not expect to see - this means that only a team can reasonably expect to use the files provided via an archival process.

There are also significant problems with data aggregation. Using a schema such as ArchaeoML for data storage seems to provide excellent access to objects. It does not seem to me, though, that there is as much utility to finding common excavation contexts from multiple projects. Further, by requiring that the context data be taken from its original schema and put into a neutral one, it seems likely to me that the information about a particular excavation - stratigraphic analyses, sealed deposits, and the like - is less likely to be well understood by a user of the data. After all, the whole of an excavation is not well described by the isolated data items of individual trenches, loci, lots, or features; the whole is understood best by trying to understand all the relationships between and among all the parts. It seems to me that the schema selected by a scholar to record an excavation is the best way to understand that whole and that data aggregation here simply complicates the task. At the least, the original files should be retained along with the data extracted from them for aggregation.

There are also three types of data that, in my view, present significant problems for aggregation: photographs, GIS data sets, and CAD models. The problem with photographs for me is not a problem with the images but with the data about them. Information about many project photographs can surely be put into a neutral data system, but, at least for Open Context, the system is based upon artifacts, excavation units, and people. The more objects or other primary items of interest in a given photograph, the more problematic the data storage about the image. For example, each object in a photo must be labeled in a way that requires more specificity than a photographer necessarily records (e.g., a group of items from a burial may have only that information - objects from burial x - in a photo log), but data about any of the objects in Open Context should lead to the group photo, making it necessary to label each artifact. Other kinds of photos also seem problematic. What does one do with a photo of the area to be excavated taken prior to the start of the first season or the similar one I once took of the field prior to excavation, showing piles of wheat from the field being winnowed by the local farmers? Or one of people working in a trench to lay out the boundaries of a sondage, the point of the photo being the use of geometry to lay out the sondage, not the excavation unit that resulted? How does one reasonably store information about a photo of the 2009 crew, absent those who were sick on the appointed day? The latter may seem only cumbersome, since all the crew will be in the people file, but what about local workmen, the project cook, or visiting scholars who happen to get into the photo, none of whom need appear in the data elsewhere? From my perspective, a project database that describes each photograph is to be preferred to a neutral schema that may seem to satisfy all needs but may really not - and would most likely omit all kinds of technical data about a photograph that a photographer might record, e.g., camera, original file format, lens used, exposure settings, manipulations of the image in secondary processing, etc. Again, O see a clear need to retain the original photo data file, whether or not the information is also moved into a second, generic system.

GIS and CAD present one common problem. Both are essentially visual systems. As a result, extracting the data may well preserve those data, but to what purpose? How does one use either a GIS data set or a CAD model without the visual component? There are certainly some things one may do with the data and no visual component, but there are many more for which the visual is critical. Thus, at best one must move data from the GIS data set or CAD model into the new schema - and then move it back into a GIS format or CAD model to use it, risking errors in translation on both ends. This is a greater problem with CAD because there are proprietary differences from CAD program to CAD program that impact the nature of the stored data. GIS, on the other hand, seems to have better and more robust neutral expressions of the data.

As just pointed out, CAD programs differ significantly in the ways they record information. For instance, some programs will permit a user to define a surface with four bounding points, even if those points do not lie in the same plane. In reality, then, the surface is not a single surface, but, at best, two surfaces meeting on one of the diagonals. Nevertheless, some programs will treat the surface as a single entity and define it as if it were a surface. Others will not, requiring the user to make two triangular surfaces meeting on a diagonal. Furthermore, a given simple block, depending upon the way it has been surveyed and put into the model, may consist of only one side of the block (if, for instance, it is still immured in a wall), more than one, or all six if it has been removed from a wall and fully studied. Each side recorded, however, may have been recorded as a surface (two in some systems, given the problem of all points lying in the same plane), as four lines defining the edges (with each end of each line meeting properly at a corner, not just coming close, but meeting), or as a single connected series of line segments defining the edges (there may, in turn, be four segments, with the last ending where the first began, or three segments and a specification that the figure is closed on itself, making the last line the result of that specification, not the explicit creation of a line segment). If all sides of the block have been fully studied, it may also be treated as a solid object in a CAD program. The chosen possibility may be forced upon a scholar by the program or selected by the scholar from available alternatives. It is surely possible to move the data, despite all those kinds of combinations and permutations, into a neutral format, because each CAD entity can match a data type that includes all necessary specifications, but it is impossible for me to imagine usefully accessing the resulting data items without putting them back into a CAD format - and having a great deal of documentation as well.

The foregoing suggests to me that these two approaches highlight the difficulties of both preserving and aggregating data and that - as a result - we who will depend upon the success of these and similar projects must not, to borrow a trite, old phrase - put all our eggs in one basket, Aggregating data by extracting information from field work data files has some clear benefits and some obvious problems. Preserving original data files also has both clear benefits and obvious problems. Therefore, I hope that the two threads of archaeological data preservation - aggregation for access and archiving for preservation - will merge into a single, cooperative series of enterprises, each working to solve the problems for which it is best suited. This may require that funding agencies see the problems as I do - problems requiring more than one approach and risking serious duplication of effort if those approaches are followed as if each were the only path. Not optimistic on that score, I remain hopeful.

-- Harrison Eiteljorg, II