Digital Data in Archaeology: The Database

Harrison Eiteljorg, II

(See email contacts page for the author's email address.)

The use of digital technology is now at the core of archaeological fieldwork; so I claimed in the last issue of the CSA Newsletter. (See Eiteljorg, "Digital Data in Archaeology," CSA Newsletter, April, 2012; XXV, 1.) Having asked for comments from readers and received none, I think I can make that assertion now without reservation. However, usage in practice has sometimes lagged behind the best practices found in other fields, especially in business where authority is more easily asserted. Because digital technology is so important to archaeology, I have often been concerned about the problems associated with digital technologies in archaeological settings; our usage in archaeology has too often seemed not to reflect the best of standard practices. As a result of my on-going concerns and the article restating my view about the centrality of digital technologies in the field, I was beginning again to muse aloud about the use and misuse of digital technologies in archaeology. One of our CSA board members, Richard Hamilton, persuaded me to take a relatively slow and methodical approach. As a result, I began to look at databases in use in archaeology in order to try to assess the ways scholars in our world are putting digital tools to work, setting aside the use of GIS, CAD, and other technologies for the moment. The use of databases seems the right place to begin since databases are so central for data storage, so widely used, and so well-established as a technology.

With the above as an introduction, I would suggest that each reader think a bit about what makes a good database before reading further. You might look at any number of publications on the issue, though I would obviously recommend Archaeological Computing. I started there myself, with the database chapter, to be sure that my views of the databases to be examined were based upon real criteria and not just ad hoc reactions to specific examples.

I started my examination with a database that had been archived. It was a rather simple database of objects from museum collections, mostly from the nineteenth century. The project dated back to just before the turn of the millennium, with the data entry process beginning in the year 2000. There was a primary table, and there were four related tables. All could be downloaded in a neutral format for examination, though the original work was done in MS Access. (Three of the four subsidiary tables were used simply to provide lists of materials, collections, and periods. The fourth subsidiary table provided information related to the archival number of the item.) The metadata accompanying the files were excellent, providing all that was needed to use the tables; this, of course, should be true for preserved data tables in any archives. I examined the data with considerable interest — and ultimately considerable distress. Although there were some questions in my mind about the way data were organized (using two fields for a description, for instance, when the entry in the first one became too long ¹ and using a question mark within a field as an indicator of uncertainty), those were relatively minor. The real surprise was the quality of the data.

The weight field often was empty. There was no explanation for this, and it is not clear why weight was not recorded regularly. Also, though weight was always in grams when recorded, it was always stated in the text form "27.7g" (and necessarily treated as a text field in a database), making it unnecessarily difficult to treat the number as a number in any database operation (e.g. arranging items of a given type and class by increasing or decreasing weight). The same system of including units explicitly was used to record dimensions. This seems to be a design issue; weight (and measurements), with the measuring unit made explicit in accompanying metadata, should be simple numbers in every entry.²

Only a bit later did I discover a serious problem with one of the secondary tables. This was the materials table, specified in the metadata as containing the list of all materials available for use in describing objects in the primary table. As it turned out, though, many of the materials used to describe items in the primary table were not included in the materials table. That is, the database was constructed with a table to contain all possible materials for use in the primary table (according to the metadata included), but, in fact, data entry personnel apparently felt free to enter materials that were not on the list. Thus, "Bone" and "Bone ash" were acceptable entries; they are found on the list of possible materials and in the primary table as the materials of which certain objects were made. However, "Bone/coral" and "Bone/horn" were applied to objects despite not appearing on the list of acceptable entries. This should not happen, and there are ways to prevent such entries; however, it should not be necessary to prohibit data entry personnel from entering disallowed terms.

A second database from the same archival repository seemed to have similar problems with empty fields and ill-defined ones, though it was more complex, using more tables in a more complicated organizational schema. The second database appeared to be more recent, but it was not clear when it was constructed. Given what I had seen of the two archival databases, it seemed that databases currently in use should be examined as a better sampling of current practices.

An email to various colleagues yielded several databases for me to examine, databases actually in use in the field today. In my email I promised anonymity to the contributors; so the language in the following may seem a bit tortured as I try to write about what I found without identifying the people, projects, or institutions. I must also make it clear that I have assumed, with no process to make it certain, that these databases are representative of those in use in the field today.

It has been impressive to see how carefully the databases in question were organized. The number of different tables was often surprising, and the careful organization was extremely impressive. Indeed, I found that the complexity of the data organization was such that it often took some serious time and effort for me to fully comprehend the overall plan. These databases showed skill, experience, and a sophisticated sense of design and organization.

The actual data, which I was able to examine in most cases, were less impressive. It is difficult to generalize, but there was a sense that the data-entry process had somehow not measured up to the quality of the designs of the databases. A few examples must suffice. (Note that I will use the old-fashioned term, field, where some might use the newer term, column. Similarly. I may use record where others might use row.)

The first database

Empty fields are inevitable, but when more fields in a given table are regularly empty than not, it seems that something has gone awry. Has the database design become divorced from the reality of the fieldwork, or is the designer neglecting to include automatic entries that should be used as defaults — or both? (A more benign explanation may apply in some cases. Entries may have been incomplete, waiting for a later time for completion.) In one example, for instance, there is a field to indicate whether or not a process has been checked. This is a yes/no field with only a checkbox available, but not one of the more than 11,000 entries has an entry for this field. Surely "no" is the default answer — and could thus be entered automatically and over-ridden only when the process has been checked. The same table has a field for the person who entered the data, and there is no entry in nearly 5% of the records. The same table has no data recorder too many times (too many being more than once since the data recorder should not even need to be entered — the database system should be able to do so automatically). Another table in the same database has a field for the date of recording of the data (another piece of information known by the computer and not needing entry by the data-entry person), and that field is empty for about 10% of the records. The same table has a field for category and one for materials, but the entry for category is often a material (e.g., "metal," "stone," "lithic," "bone," "ceramic"), and a material may or may not be listed in the materials field in those cases. Another table has no excavation date for a good number of the excavation trenches — and no trench supervisor or excavators for almost a third of them.

A second database

Another case where the excavation date is empty more than 10% of the time (and some of those records have trench supervisors listed; so that the missing date seems more likely to be data entry error than anything else). The date of data entry is missing in this table about 25% of the time; oddly, there is a date of modification for each entry, often the same date as the date of entry. (It is possible that the date of modification was entered automatically; if so, the date of entry should have been handled automatically as well.) Workmen are identified by one name only, often a first name. Assistants (when named) are identified by an initial and last name.

A third database

Date of closing of a locus is missing for between 5 and 10% of the entries. There is what can only be described as a random listing of assistants, sometimes first and last names, sometimes first initial and last name, sometimes just two initials (no periods), as if a nick-name.

A fourth database

In one ceramics table in this database the excavation date is not there for more than 10% of the entries; trenchmaster is missing for more than 20%.

General principles

I should, at this point, restate that the complaints made about these tables relate to issues of general significance. They are not trivial complaints raised only in the course of doing this analysis. In fact, it seems appropriate to discuss these databases in terms of some of those relatively abstract principles.

I offer here a digression to illustrate one of those abstract principles. You can tell when you're talking on the telephone with someone who is filling out a database as you talk. Certain questions must be answered, even if the answer is, "I don't know." or "No opinion." The people to whom you are talking cannot continue without an answer to the question they have asked, and it is often difficult to keep your temper under control while the person on the phone insists on an answer to a question you don't want to answer. It can be an unresponsive answer, but it can't be a missing answer. Nor can it be the data gatherer's answer. It must be yours. Why? Because the statistics to be generated require that everyone answer the relevant questions — all of them. And because the person on the phone cannot go on to the next question until he/she has an answer to the present one. The system has been designed to require data entry for all the required questions. It is not difficult to do that, and database preparation for those telephone surveys is thorough. Indeed, that is just the kind of control that we might like to have in archaeology, but we rarely do.

A null field is nothing more than an empty field in a table. Thus, the null field is just the result of no data being entered in a given field, the not-permitted missing answer of the telephone survey just mentioned. While this seems a trivial matter, it is not. In many databases having entries in every field is important for statistical purposes, as mentioned. But it is also important in a more general way. After all, what does the absence of data mean? Does it mean that the data-entry person does not know the information required? (If so, was that because a paper form was not filled in properly or because the data-entry person was supposed to know on his/her own but did not?) Or does it mean that the data-entry person could not read the handwriting of the excavator? Or does it mean that the information was not recorded, though it should have been? Or does it mean that the information is somehow irrelevant in this context? Or does it mean that the information was not clear enough to be recorded? Or does it mean. . . ? Of course, a user has no way to know. That is why a null field is generally considered improper. Please don't misunderstand; I am not suggesting that there must always be a useful entry in every data field. I am suggesting that if there is not a useful entry, there should be a way to understand why there is no such entry.

As noted, multiple databases examined in this process had data fields for the person responsible for a given procedure — sometimes data entry, sometimes excavation, sometimes other matters. In the actual data files, however, the responsible person's name is often missing. This is true across multiple examples. As noted above, there are also instances of names being only initials and therefore useless for a long-running operation. That may not be terribly surprising, but more basic information is often missing as well. In one database, stratum was more often empty than not. In another, a specific table was most often empty for the items' form, type, and observation. In another, the dates of conservation of items were recorded approximately 10% of the time; the conservator perhaps three quarters of the time. In that same database, but another table, findspot was often empty. In another database, fields for the clay content, silt content, sand content, matrix, compaction, color, and surface of stratigraphic units were generally empty. In virtually all of these instances the question that naturally arises is that gnawing one: "What does the absence of data mean?"

Many more examples could be cited. The important point here is that some of these problems arise from database design, others from sloppy data entry. It is simple, nearly to the point of trivial, to have a database program automatically enter today's date in a given field, for instance. It is also simple to provide a default entry (e. g., "not recorded") so that the data entry process provides that entry when no data are available. Similarly, the data entry person's name should be "known" to the computer so that it can automatically be added when required. Finally, it should be noted that a check for null fields at the conclusion of a data-entry process is not particularly difficult. (I should point out that one table had a "0" for each item's drawing number, indicating the absence of any drawing of that item, I assume. The same table had many other fields where no values had been entered. In other words, the designer prepared an automatic entry for drawing number but not elsewhere.) It must be said here that some designers would argue that a default data entry (e.g., "not recorded") may just encourage data-entry personnel to pass by the item without bothering to enter anything. Others might argue that automatically entering anything in a given field is useless if the item may be entered in some secondary data-entry process.

Another of the criteria listed in the database chapter of Archaeological Computing was the absence of abbreviations. Abbreviations naturally lead to confusion. Nevertheless, they were found regularly in these examples. Abbreviations for people are bad enough, but one table of ceramic types had only abbreviations for ware, form, period, and type (an exception here &mdash "none" was spelled out when used, and one form for a pottery import was spelled out, "conical cup"). Some abbreviations used in the tables were relatively obvious, and a separate table held the complete terms. However, that means that an extra table has been created simply to permit abbreviations — abbreviations that are not useful — in the primary table. Abbreviations may be necessary in projects with personnel using multiple languages, but users should not see or need to use the abbreviations; they should see the full terms in the language of choice.

There is simply no way to understand, though, the use of full names in one field of a table, first names only in another (apparently the same people), some odd mix of first and last names in a third, and finally another field with names that my be either first initial and last name or first and last name. Of course, none of the fields always had content. In fact, I suspect that the examples of missing or confusing names reflects the fact that the database designers asked for names to be entered where having those names served no real purpose. This may be sophisticated design run amuck.

Among the criteria for judging a database is the obvious requirement that information must be unambiguous. I admit that this requirement covers a multitude of sins. However, ambiguity in these examples seems too obvious and too frequent. In one of the databases, for instance, a table contains a field for the frequency with which certain categories of artifacts and ecofacts have been found in specific stratigraphic units. The frequency entries seem limited to "rare," "medium," and "frequent." Of course, absent documentation of the database (not provided here), there is no definition for these terms. The same database, in another table, has a field called class. Entries in that field include the following: "impasto," "bucchero," "metal," "metal slag," "spool," "votive," and "amphora." How, I must ask, can one have bucchero and amphora in the same field? What exactly is a class here? (Different objects may fall in different type designations, making the class within the type seem ill-defined. But here we see two ceramic wares and one ceramic shape in the same field. This is unacceptable.)

The question mark occurred in many instances as a kind of modifier of a term, as in "casket ?" from the archived database. But what does that mean? Documentation might explain such a usage but in the one example with documentation, it does not, and in the other instance there is no documentation. While many might think they understand ("The question mark means uncertain, of course." or "The question mark means likely but not certain." or "The question mark means the handwriting from which the data entry person was working could not be read with certainty." or "The question mark means the entry was supplied by a non-expert and not a team member whose interpretation is to be trusted."), this is the very definition of ambiguity to me, absent metadata to explain. At the very least, a user should know whether the question mark refers to the interpretation of the artifact or some data-entry difficulty.

Some of the problems found here are database design issues. Those design problems often arise from starting with paper forms and building a database as if the paper forms were the proper starting point. Such an assumption saves time, but a good database must be designed according to the needs of the data, not extant paper forms. Basic questions will likely be missed by the database designer who starts with paper forms.

A good many of the problems found in the actively-used databases reflect data entry failures. In some cases, they are problems with the data-entry process (e.g., requiring a data-entry person to enter his/her name or the date), and in others they indicate problems in persuading the data-entry personnel to provide good and useful entries. In both cases, though, I think they must be seen as failures of the larger system in which archaeologists work.

It must be said that the foregoing is one person's view. Not all would agree with those criticisms. Some of the things about which I complain would be ignored by others; some about which they might complain have been missed by me. But nobody thinks ambiguity is acceptable, and nobody wants to see null fields in abundance. Similarly, nobody wants to see data recorded for no purpose or a design that does not permit searching for items by mathematical criteria (e.g., all items with a weight or length greater or smaller than a specified number) when appropriate.

"Aaarrrggghhh!" Charlie Brown's cartoon exclamation was not really used, but, when I corresponded with my colleagues about the problems I had found, "Aaarrrggghhh!" was the senitment. The problems with data entry fairly tumbled from my colleagues' mouths, expressed with the exasperation of those who have tried hard to cure the kinds of difficulties I have recounted here. However, not all the problems are data-entry problems. Some are design issues, reflecting poor choices by the database designer.

In general, we seem to be seeing data items required despite the fact that they may serve no purpose (all those names not supplied unambiguously), too many fields that either have limited functions or otherwise need some automatic entry to indicate that the information is not relevant in a particular instance, a certain sloppiness in data-entry habits, and a kind of generalized failure to examine the results of field work to see what is going well and what is not.

The first drafts of this article concluded with an attempt to analyze the finds more carefully and then to draw some lessons from that analysis. I have rejected that approach in keeping with the sentiments expressed in the previous article — the hope of reader response and involvement. It is my hope that readers will take the opportunity to respond with their own views of these matters. What should be done to prevent the kinds of problems illustrated in these examples? Who is responsible? Who can make the important differences? What are those important differences? Which of these problems are really important, and which are the kinds of inevitable difficulties that must be accepted in any sophisticated database? Readers' answers to those questions and others will be more varied and more useful than mine alone — if readers respond. (If you don't respond, of course, I will be obliged to offer my own, more limited comments.) So don't just sit there. Send a response to me at (user-name) nicke at (domain name — @) csanet.org. Don't wait; do it now.

-- Harrison Eiteljorg, II

Notes:

1. Microsoft Access, the database program used, permitted long note fields as of at least the version introduced in 1997, and this database was produced beginning in the year 2000. Return to text.

2. I should note that FileMaker — the only database I use regularly and the one with which I checked — permitted me to declare the weight entries to be numbers and then ignored the text and ordered the items properly according to weight. The same procedure, when tried on the length measurements, did not work because some measurements were in meters and others in millimeters. I cannot say what might happen with a different database management system, but I can say that there should be no need to make a user adjust the database in order to obtain proper functioning. All weights and measures should be in the same unit of measurement and entered as numbers. Return to text.