CSA Newsletter, Fall '05:

Vol. XVIII, No. 2

Fall, 2005

XML: The Way Forward

Tyler Bell

It was with some concern that I read the rather bleak picture Dr. Harrison Eiteljorg, II, painted of what he saw as the potential of XML in archaeology ("XML, Databases, and Standardized Terminology for Archaeological Data", XVIII, 1 [Spring, 2005]). Viewing XML to be a near-ideal solution to so many of the fundamental problems encountered by archaeologists in the collection, management, exchange and preservation of their data, I thought XML required an apologist at CSA, and Mr. Eiteljorg's point needed a counterpoint. In writing one my aim is not to provide an 'XML Primer', but instead offer a non technical overview of XML that may better highlight some of XML's various benefits as an applied technology.

It is important to understand that XML is an elemental technology used for a broad range of purposes: in much the same way that the same pen can be used to write a comedy, haiku, or shopping list, XML can be used to exchange, format, search and transform data. For archaeologists, XML is a simple and convenient solution to the intractable problems of working with cultural heritage data, and is in fact the only technology that gives us the means to share information between interoperable technical platforms (raise your hand if you are using a Mac to read this), convert information between different data formats (Access 97 or Access 2000?), document information in a structured and established information standard, and preserve information in a long-term, vendor-neutral format.

Although I tend to call XML a 'technology', it is more accurately described as a standardized form of encoding upon which so-called 'XML technologies' can be built; in other words, there is nothing overly technical about an XML document, but it is so standardized and well-conceived that diverse technologies have been created that search, present and transform XML data across a range of platforms and computer environments.

This standardised form of encoding is (at its most basic) a series of 'tags' that provide clear and explicit meaning in textual 'mark-up'. Take, for example, these tagged values:
<environment>wood</environment>
compared to:
<material>wood</material>

Here the term 'wood' has been given different meaning by the context of its surrounding tags. Although there is nothing semantic in XML technology itself, it does provide a well- defined mechanism for specifying meaning. One of the advantages of XML is that there are no pre-defined tags - you create them to suit your own purpose. This set of names of tags, and their order, is effectively a set of rules, defined by a document called a 'schema.'¹

Being able to create your own tag sets and rules allows a very flexible tool that can be used to mark-up every conceivable type of text-based data, and do so in nearly any language. This is one of the primary differences between XML and database technologies: the XML tag sets must conform themselves to common rules, which ensures that the information contained within an XML document can be understood and processed without recourse to a specific computer program or hardware platform.

Common and Conflicting Standards

XML is of course more than a simple annotation tool - it allows us to create data standards that embrace the diverse manner in which archaeology is recorded and practiced, without dictating the use of specific computer hardware or software. With XML we can now turn our attention away from the eternal debate on the relative merits of the PC and the Mac, and address instead, as a professional community, the quality of the information we are recording.

The success of an XML standard is best assured when it is the result of a well-considered process, created by a body representing the community it serves. In this light I am particularly optimistic about the future of MIDAS XML, the data standard for Cultural Heritage Management created by the Forum for Information Standards in Heritage (FISH) in the UK. MIDAS XML is a rule set that defines how information for 'monument inventories' is to be recorded and -- most importantly -- exchanged. MIDAS itself is in fact a mature data standard that pre-dates XML, first created by English Heritage but now employed throughout the entire UK and further abroad. The original MIDAS was the result of communal discussion and consensus, and MIDAS XML is designed explicitly to be extremely accommodating to the various ways in which the MIDAS standard has been implemented. Only XML can accommodate diverse working practices while at the same time standardizing how the data are recorded.

I do not believe that there will never be one unifying data schema for archaeology: the profession is too disparate, the data too diverse, and the discipline too (frankly) undisciplined in the way it collects and stores information.

You will have noted above that I said that anyone can create an XML schema; this is true, and this freedom certainly creates the opportunity for conflicting standards. Is such an accessible technology simply a recipe for archaeo-anarchy? Not at all, because there is an entire subset of technologies that are designed to facilitate the process of data transformations from one XML format to another. While the process still entails that a 'mapping' between schemas that is created manually by someone familiar with both; the mapping itself can be read by most XML software applications for large-scale data transformation. ArchaeoML, which Mr Eiteljorg references, can be converted into any other XML format, and the other way around. The problem of conflicting data standards, while it still exists, is no longer the insurmountable problem it used to be.

DIY Data Schema

XML makes all of these wonderful things possible because schema document the kind of information being stored (e.g. "text"), their relationship to other data (e.g, "a property of the artifact"), and a description (e.g. "the type of artifact, usually drawn from the MDA Object Type Thesaurus"), all in machine-readable and human-readable form.

The machine readability is important, and this is where the comparison of database dumps documented by ASCII (text) files to XML simply falls over: text documentation of the kind to which Mr Eiteljorg refers is not standardized, is frequently undocumented, and therefore almost entirely unreadable by machines. Machine processing is utterly essential to the long-term preservation of archaeological data because only machines (not graduate students) can realistically retro-process the thousands of archaeological layers on a site which give a single artifact its context seventy-five years after the excavation finished.

It is also worth noting in this context that the hierarchal and 'tagged' format of XML makes for better long-term data storage, encapsulated self-documentation, and data exchange. Take this example of a 'description' element from MIDAS XML for a hillfort in the hypothetical County of Westshire, UK:

<description source="Westshire HER" audience="general" preferred="true">
       <full>
             The hillfort at Westbury, overlooking the Easder valley is one of the
             best preserved in Westshire. Excavation of the North rampart in 1965
             suggested a construction date for the outer defences in the late 2nd
             century BC, though there is evidence for Bronze Age occupation near the
             summit of the hill. Known as Weasburgh in the Anglo Saxon Chronicle, though
             no traces of Saxon occupation have been found. </full>
       <summary>
             Iron Age hillfort with traces of earlier occupation
       </summary>
</description>

Here the 'description' element contains two other elements, 'full' and 'summary,' which in turn contain the complete (full) and abbreviated (summary) descriptions of the site. In addition to these sub-elements, XML elements can also contain properties, the 'source,' 'audience,' and 'preferred' labels in the 'description' tag, which comment on the contents of the entire element. Here, the XML properties are telling us that the description provided is derived from the WestshireHistoric Environment Record (HER), is intended for a general audience, and is the preferred description of all that may be recorded for this site. An equivalent database table in contrast, might be able to capture the same data, but would not be able to indicate natively the semantic relationship of the properties to the elements they are modifying.

Still not convinced? Take a look at the following example where we record how the site is identified and named:

<appellation>
       <name type="current" preferred="true">
             WESTBOROUGH HILLFORT
       </name>
       <name type="historic" preferred="N">
             WEASBURGH
       </name>
       <identifier namespace="WESTSHIRE HER">
             WE76598
       </identifier>
       <identifier namespace="EH HOBUID">
             67854
       </identifier>
</appellation>

Here we see that the monument is known by two names and has two different identifiers; the use of properties allows us to comment on the currency, preference and source of these identifiers (the XML schema allows its authors to indicate whether one or many names or identifiers are allowed in this context). The database equivalent, restricted to rows and columns in a table, does not permit the same depth of information to be stored in a single file with the same ease.

A final example of how flexible and self-contained XML data structures can be: here we record the names of the places where the monument can be found ('namedplace' is just one element of the MIDAS spatial schema):

<namedplace>
       <location type="county" namespace="EH_CDP98">
             Westshire
       </location>
       <location type="district" namespace="EH_CDP98">
             South Hams
       </location>
       <location type="civilparish">
             Chivelstone
       </location>
       <location type="locality">
             Easder Valley
       </location>
</namedplace>

The increasingly XML savvy amongst you may have noticed that the above 'namedplace' element permits a very loose definition of a 'location': instead of dictating elements called 'county', 'parish', and 'district' we have defined a loose 'location' element and provided a property ('type') allowing us to define what kind of location is being described. Why? Geographical units are too diverse to be standardized -- cadastral units are commonly used in Europe, while a US and a UK 'county' are completely different entities. Instead of making semantic compromises such as "a UK County = a US State", or attempting to include all known contemporary and historic geographical units across the globe, we can create one element and modify it by its property value. When a new geographic type needs to be added, we can add that value to the list of allowed types, without the need to change the data structure itself. Can we do this with databases? Yes. Can we do it with the same ease and elegance? Not even close.

More Than Data

I will conclude by highlighting the fact that XML is more than a data storage, retrieval and preservation mechanism; it is increasingly becoming the language of the next-generation Internet. Why should archaeologists care? Because the entire software and information management industry upon which we rely is moving away from the idea of fragmented, inefficient traditional information sets (my database, your website) to uniform and open architectures and repositories: instead of an image database for your project, think instead of an entire image database for your institution that all projects employ, upon which you build your own, project specific interface. The advent of XML as the new grammar in information management is best exemplified in the state of Massachusetts' recent decision to endorse open standards to move away "from siloed, application-centric and agency-centric information technology investments to an enterprise approach where applications are designed to be flexible, to take advantage of shared and reusable components, to facilitate the sharing and reuse of data where appropriate and to make the best use of the technology infrastructure that is available.² It is a sound, open and long-term policy that is demanded, yet unparalleled, in the archaeological world.

I am all too aware that I have only glossed some of the issues involved with XML, standards, and the adaptation of technologies in the archaeological community, and have clearly covered none of these issues to the extent that they deserve. However, my primary purpose was to stick my foot in the door before it was closed prematurely on XML, and to provide an indication of its capabilities, hinting how XML will, without any exaggeration, change the way that material culture is recorded, presented, studied and preserved.³

-- Tyler Bell

To send comments or questions to the author, please see our email contacts page.

1. I use the term 'schema' here as a convenient shorthand for the three primary XML schema languages: Document Type Definitions ('DTD'), XML Schema ('schema') and RelaxNG. Return to text.

2. The State has made recent headlines due to its rejection of the Microsoft Office document formats due their concerns with long-term preservation and Microsoft's patent strategy, opting instead for an open, XML-based format. Their Enterprise Technical Reference Model (ETRM) is worth investigating in this context. Return to text.

3. The real value of Archaeological Standards is being realized by all archaeologists, not just information specialists. Readers may be interested in the impact of XML based technologies in archaeology, to be addressed at this year's CAA meeting in Fargo, ND. Return to text.

For other Newsletter articles concerning the issues surrounding digital archiving, the use and design of databases, or the use of electronic media in the humanities, consult the Subject index.

Next Article: Web Site Review: Choma, Data Publication System for the HacÄ±musalar Archaeological Dig

Table of Contents for the Fall, 2005 issue of the CSA Newsletter (Vol. XVIII, no. 1)

Master Index Table of Contents for all CSA Newsletter issues on the Web

CSA Home Page