Vol. XVIII, No. 1
CSA Newsletter Logo
Spring, 2005

XML, Databases, and Standardized Terminology for Archaeological Data

Harrison Eiteljorg, II

In the last issue of the CSA Newsletter Dr. William Kilbride wrote about the need for better mechanisms to help archaeologists create data sets that are already in the correct form for long-term digital preservation. He discussed, without advocating it, XML's actual and potential use for that purpose. ("Past, present and future: XML, archaeology and digital preservation," XVII, 3 [Winter, 2005].) Readers will recall that Mr. Kilbride showed how XML uses simple ASCII text to record data and, at the same time, to label the data in an unambiguous way. The result is data in a form that can easily be parsed, even without special software. As Mr. Kilbride pointed out, for preservation of and access to archaeological data to become a reality, "it ought to be easier for archaeologists to create data in formats that are already fit for preservation. XML provides an example of such a format and how, in practical terms, the archaeological community can fit a common format to its purposes."

XML is based on the idea of tagging data with labels to clarify meaning and remove ambiguity. The tags themselves must first be defined so that the XML documents can be meaningful and unambiguous. For example, an XML document structured to include information about pottery sherds found in a given context would define the tags that specify all the recorded characteristics so that the tag <context> certainly indicates the context label used by the excavator and the tag <count> certainly indicates the number of sherds found, not the extrapolated number of whole pots indicated by the number of sherds. Each tag must be defined well and fully enough that there is no danger of ambiguity or confusion.

XML tags may be defined either within a document or in an external reference document. Either is possible. As a result, a document can be truly self-defined or defined by the community in which it is used. Of course, if the definitions are internal only, human intervention is necessary to understand the tags that give the document its full meaning. Accessing XML data automatically, without human examination of the document itself, requires the application of a standard, community-wide set of definitions for the tags used.

As community-wide standards for XML tags are required for automated data access, so are such standards for document types. That is, the specifications must include definitions for the documents in which the defined tags will be used. A document type might be defined for ceramics, as suggested above, with the sherds and vases carefully connected to contexts by the context tag, implying another document type for contexts. If this begins to sound like a database design, it is. In fact, the home page for XSTAR, an XML system currently being designed for archaeology, says this about its XML definitions, called ArchaeoML: "An ArchaeoML document type, with its particular configuration of elements and attributes, performs the same role in XSTAR's native-XML database structure that a relational table, with its particular group of data fields, would perform in a relational database structure. . . . To avoid confusion, it is worth emphasizing that ArchaeoML . . . defines a general-purpose database structure. That is what is needed to create an integrative and efficiently searchable digital resource." (http://oi.uchicago.edu/OI/PROJ/XSTAR/DocumentTypes.html; accessed 15 March 2005)

As I read this, ArchaeoML is an attempt to define a standard data organization scheme to be used by archaeologists to provide access to data; the standards consist of a group of XML terms and document types that, taken together, should be used to record all data from excavations. The benefits of such an approach are obvious: the common terms and documents enable automated searches across archives and data sets. The disadvantages are also obvious: the common terms and document types have not been developed, much less accepted, by the community that is expected to use them; the data structure implied by the terms and document types has not been vetted by the wider community;1 the tools to put such an approach to work are not widely available; tools to put this (or any other data structure) into practice are available if the starting point is database management software rather than XML; implementing the common data structure and terminology is not easier in XML than with database management systems.

XSTAR uses JAVA-based software and an XML-based data organization plan (neither complete at the time of this writing) for recording, storing, and retrieving archaeological data. Its developers aim to introduce a common archaeological recording system for excavators in order to foster easier access to data. To do so, however, XSTAR has chosen XML, not just common terms and organizational rubrics, as a necessary component. Leaving aside for the moment the possibility of a common recording system, what requires XML if a common recording system is to be used in archaeology? In fact, nothing. If a common recording system can be used, it can be implemented with XML, database management systems, spreadsheets, word processors, or virtually any other underlying computer technology. Indeed, common terms could be used on paper files and card catalogs. As it happens, only one of the computing technologies is both well-suited to the task and widely available in forms that have been thoroughly tested, are robust, and can be obtained from various vendors at reasonable prices -- database management systems.

Given the obvious applicability of database technology to archaeological needs and its relatively long history of use in the field, what is it about XML that makes it seem so desirable?2 XML's strongest asset is that the data in an XML document are stored in a non-proprietary format, using nothing but text, and the data can be read without the aid of special software. That seems a very compelling argument indeed. However, one must ask how valuable this is to real users. That is, can one really use simple text editors to access reams of data about an excavation or a material collection? Imagine reading through documents from an excavation in order to locate information for all contexts considered to belong to the same phase, or searching a large document with pottery data for all examples of a given style. Repeated searches would be required with a word processor or simple text editor. While such searches are indeed possible, they would take so much time and be so error-prone as to be nearly useless. Therefore, real use of XML data requires specialized software for accessing the data or for translating the files into a more useful format, specialized software that is still rare and expensive.

As an archival aid, XML's non-proprietary ASCII-only format would seem to be an advantage as well. However, simply storing data tables in tab-delimited ASCII files for archival purposes accomplishes the same goals -- preserving data in non-proprietary, human-readable forms that can be translated into more useful formats. In addition, tab-delimited ASCII files can be read and usefully accessed by various kinds of software.3

XML proponents see other advantages. They believe the data structures offered in an XML environment have an important advantage -- the use of hierarchical rather than relational models. However, this advantage has been claimed despite the fact that hierarchical systems were tried and discarded years ago in database systems.4 Proponents also see ease of access from outside the system as a critical advantage of XML. However, ease of access to XML-based data is actually a function of the common terms and data organization, not the use of XML. Access to data in a good, modern database is equally easy if the structure and terms are known, and a well-defined protocol (SQL) has been in use to request data from such databases for many years.5 Finally, some proponents may see XML as the technology of the future and believe archaeologists should adopt it sooner rather than later. That represents a reading of the tea leaves that is not universal. If XML becomes the dominant form of data storage, it will be because businesses have adopted it. (They represent the market for software producers.) Indeed, some businesses have, but more use XML only as a data transmission protocol, leaving the data in traditional databases on corporate computers.

Some proponents of XML may not appreciate fully the nature of archaeological data and therefore underestimate the problems of organizing archaeological data at the outset, making XML seem more useful than it is. Archaeology is unique in that, even after years of experience in the field, an excavator cannot prepare an a priori system for data recording and be sure that it will escape adjustments in the field. Archaeologists are uniquely at the mercy of what has been left to be found, and archaeologists are uniquely unable to repeat their experiments; so they must make the best of the data they obtain, even when the data are clearly problematic. Doubtless the practitioners of every discipline conceive their own discipline to be unique, but the problems of data organization in archaeology have long been recognized. In a 1994 article in the CSA Newsletter a relevant example of a database system was cited: it had been prepared in advance for use by 26 different scholars who were to work simultaneously in Britain. The scholars were given the same database structure, the same software, and training in the use of the software and data structure. Nevertheless, the resulting databases were different, even to the point that fundamental concepts were differently treated. ("CSA Database Project," VII, 3 [November, 1994] -- accessed 16 March 2005.)

Taking an optimistic stance for the moment and assuming that a common data structure for archaeology can be developed, what are the requirements for such a common structure? First and most obvious, the structure must be developed by the archaeological community in a way designed to make certain that the result is applicable in diverse settings, has broad support, and can be used by a very wide range of scholars. Second, it has to be used and tested in ways that make it possible for anyone considering its use to see exactly how it works in a given context, presumably a similar one. Third, implementing the data structure should be possible with available, robust, reasonably priced tools and minimal training.

As I see it, XML is irrelevant to the first two requirements above. Developing a common data structure does not depend upon any particular computer technology; it depends on agreement in the scholarly community of use. Developing good examples does not require XML; database management systems can execute virtually any data organization. On the other hand, XML is an impediment to the third requirement. Database management systems provide all the crucial capabilities with many available, robust, inexpensive programs; XML systems, in contrast, are relatively scarce and expensive.6

Having argued that XML is not a necessary part of a common data organization, I must return to the more basic question. Is a common data organization possible? For an excavation, I believe not. However, it seems to me that it should be possible to develop common data schemes for artifacts of various types, making sure that all data tables for those artifact types use the same terms in the same ways. Similarly, it may be possible to construct a common set of terms for contexts. Other possibilities may also await, but all of these common schemes require the development procedures mentioned above: development by the community, use and testing in open and accessible examples, and the production of good, useful, illustrative examples. Indeed, various initiatives exist at the international level, for instance ICOMOS and FISH (Forum for Information Standards in Heritage), and the Guides to Good Practice of the Arts and Humanities Data Service in England (to which this author has contributed) have been aimed at establishing standards, albeit at a more general level. There are also standards for artifact descriptions that have specific artifact types as their subjects.

Were the archaeological community to attempt to create standards for artifact descriptions or to encourage/enforce the use of existing standards, various archaeological organizations from many countries would need to be involved, as would archival organizations. In addition, the process would require a kind of public-relations campaign to persuade scholars, especially those who are excavating or operating surveys in the field, of the value of common terms and data structures. Finally, I think timing is critical; a long, slow, contemplative process of standards development, despite its intellectual satisfaction, will rob the work of its most important potential advocates, the scholars who must put standards into practice in their projects. Those are the people who have the ability to make the standards useful -- or irrelevant, and they will not be patient with a long, drawn-out process. If they are to participate in the development of such common terms, I believe they will expect useful results without delay. The alternatives are simply not reasonable: stopping projects to wait for the development of terms or constantly re-working data to reflect new terminological standards.

Despite the foregoing and the existing attempts at standards-making already in process, I am not an optimist about the likelihood of developing standard terms. The example cited above, where careful preparation of British scholars came to naught, suggests one problem: getting people to use a system designed by others is possible, but getting them to use it correctly is much more difficult. In addition, I think scholars who perform excavation or survey work in the field -- particularly those who are academics and teach when not in the field -- will not willingly spend their very limited time on the tasks of data standardization. Few of them see a need for such data standards since they are focused on their own data, not the need to use the data from other projects. If those scholars do not participate in the standards development process, other archaeologists who are willing to devote their time to the task, with the aid of the archival community, will be left to do the job. Then I would expect the archaeologists who work in the field to honor the standards by ignoring them. I hope I am wrong. I see no carrots that will bring the use of standards, but there are some large sticks that may work. Archival organizations, funding agencies, and governments hold those sticks; if they use them wisely (and, at least in the case of governments, how likely is that?), useful standards may eventually be developed and put to use.

In the meantime, scholars have the database management tools they have long used or new XML tools for data storage. As the state of the art now stands, I see no advantage whatsoever for XML but many advantages for good database management systems. I can also find numerous vendors of database management systems and very few vendors of XML database systems. XML's time may come for archaeology, but we have a while to wait. Not only would using XML now require spending relatively large sums for software, training, and development -- sums that almost certainly must reduce project budgets for other work, but this seems a good example of the leading edge in technology being, as it is often called, the bleeding edge.

-- Harrison Eiteljorg, II

To send comments or questions to the author, please see our
email contacts page.

1. ArchaeoML has been defined, according to the Web site (but, alas, on a page no longer available), by a group of 23 scholars. Most are from than the University of Chicago; fewer than half a dozen other institutions are represented. In addition, fewer than half of the members of the group seem to be archaeologists. I am aware of no broadly advertised meetings held to involve the archaeological community as a group in the development of this proposal, nor am I aware of any formal attempt to involve members of the broader archaeological community in some other fashion. Return to text.

2. It is possible that the proponents of XML believe a new technology is required as a tactic, thinking that scholars would be unable to introduce a common data structure when using the "old" database technology. It is certainly true that most excavators already have implemented databases in the field, and the individual implementations are, to say the least, idiosyncratic. However, the tactic fails if the technology proposed is not a true advance. Return to text.

3. The Archaeological Data Archive Project used tab-delimited ASCII files, with documentation concerning data structure in text files, as the archival documents of last resort. All archived database files were transferred to and retained in tab-delimited ASCII format. Return to text.

4. XML documents are similar to data tables; their content and form, including attributes, are defined. However, the recording system in an individual document is hierarchical in the sense that attributes are grouped in hierarchies that are critical to the effective use of the data.

An XML data structure would remove the need for some data tables required in a database environment. So-called repeating fields can be handled in a hierarchical XML scheme without difficulty but not in a good relational database. This is a (very minor) advantage for XML. Return to text.

5. An SQL request to a database returns a table rather than the text string that would be returned by an XML query. The table can be generated with column labels and formatted as simple text with tabs to separate the columns. Such a table can be read by virtually any software -- a text editor, a spreadsheet, or a database management system -- and, by virtue of the column headings, is as self-documenting as any XML document. Such a table is also much smaller than a comparable XML document, unless data are returned for a single object, because labels need not be repeated in a table, as they must be for every data item in an XML document. Return to text.

6. The Arts and Humanities Data Service has recently reviewed XML editors, but the state of the art is so low that these are simply editors to permit people to type text and include XML code. Entering data from an excavation is a much more complex task that will require programs with very sophisticated procedures. (See Thijs van den Broek, "Choosing and XML editor" -- http://ahds.ac.uk/litlangling/creating/information-papers/xml-editors/index.htm -- accessed 2 April 2005.) Return to text.

For other Newsletter articles concerning issues surrounding digital archiving, issues surrounding the use and design of databases, or the use of electronic media in the humanities, consult the Subject index.

Next Article: A New Review Series in the Newsletter

Table of Contents for the Spring, 2005 issue of the CSA Newsletter (Vol. XVIII, no. 1)

Master Index Table of Contents for all CSA Newsletter issues on the Web

CSA Home Page