Vol. XIII, No. 1
CSA Newsletter Logo
   
Spring, 2000

Archiving Data Serves Multiple Purposes

Harrison Eiteljorg, II


Archiving is not only a way to protect data and to fulfill professional obligations. It is also a way to check data for completeness and accuracy. That has been brought home to ADAP personnel recently by work on two archival projects.

One of the projects includes are large numbers of images that are to be stored in two formats. A search of the files showed that some of the images do not exist in both formats, something the project director did not realize. There are also some missing captions. In neither case is the loss particularly important, but the missing files would not have been noted at all without the archival work. It is now possible to make clear in the documentation that certain files are not available so that future users of the data will not waste time looking for them. This kind of benefit from archiving of data is not obvious, but it is nonetheless significant, especially since the project director(s) are unlikely to give serious consideration to the data files after they are deemed ready for the archives.

Another project has bigger problems with data tables, and they are of two general kinds, one having to do with the way the data tables were prepared and copied at the very end of the project and the other having to do with the ways the data were originally organized and recorded in data tables.

At the conclusion of the project, the data tables were prepared in two forms. One, DBF, is a standard database format; the other, called delimited ascii, is a common way of putting tabular data into simple text form but with markers to indicate the individual data items. Unfortunately, the last step of such a process - comparing the resulting tables - was apparently not taken. Two pairs of supposedly identical files have quite different numbers of artifacts recorded. In one instance, 681 records of artifacts are shown in one file and 803 in the companion, supposedly identical, file; in the other problem instance, there were 505 records in one file and 592 in the other.

Here again, the archival process showed up problems that can be corrected now but that, had they remained undiscovered, might have been very difficult to correct. Archival personnel might have made the corrections in this case, but they could not have been certain that their choices were correct. Since the problems were detected at this stage, when the project director is still familiar with the records, it should be a simple matter to make the necessary changes.

The problems created by the way the data were originally organized and entered into the tables are another matter. Some simple choices were made early in the data-gathering process that will create problems for anyone trying to use the data. For instance, a quotation mark is used to indicate inches, and there are places where individual words or phrases in data entries are enclosed in quotation marks. Depending on the kind of data transfer one is doing, the use of quotation marks within data fields can corrupt the data transfer; so quotation marks (or inch marks) should not have been used. Those problems can be fixed fairly easily; they are the kinds of simple errors that come from inexperience with moving computer databases from format to format.

Another problem was created in many of the tables by the use of alphabetic characters in the same column as numeric dimensions; the alphabetic characters were used with the dimensions were uncertain or missing. Consider the following example: a hypothetical data table for pottery that includes a measurement (in cm.) of the diameter of each pot and that has been constructed to illustrate this issue. There are four versions of that table shown here; the first three use a question mark to indicate uncertainty about dimensions. In addition, the first table leaves a blank for missing dimensions; the second uses a question mark; the third uses a zero. The final table uses a zero for missing dimensions but a separate column to indicate uncertainty.

As individual tables expressing the information, each is useful, and the differences are not significant until one attempts to manipulate the tables in some way. For instance, one might want to average the diameters; that requires that the numbers be separated from other characters before the calculations can begin. While a computer can do that, most users of the database would not be able to write the necessary code. Only the last of the samples presents all numeric information separate from alphabetic characters so that the dimensions can easily be used, for instance, to calculate averages or standard deviations directly from the table. (In addition, that form of the table would permit the calculation to exclude all the zero entries and/or all the uncertain entries.)

There is also a problem with the confusion of numeric and alphabetic data entry if one wants simply to order the entries. Each of the tables is shown (on the right) with the entries ordered by diameter as they would be in a database. As is obvious, the entries are not in numeric order except in the last example. In all the other cases, the system must order the entries alphabetically, not numerically, because is must assume that mixed alphabetic and numeric columns should be treated as alphabetic. (The order is not, strictly speaking, alphabetic. Items are ordered according to their position in the ascii character code table used for digital data; for that reason, the blank entry and the question mark yield different results. In addition, some systems will make a distinction between a space and no entry; others will not.)

 

Example Table - Version I (no entry when data are missing)

Original Entry OrderRe-Ordered by Diameter
IDfabricdia. (mm.)ht. (cm.)context       IDfabricdia. (mm.)ht. (cm.)context
1gritty yellow   88220lot 2       5gritty gray318?lot 4
2gritty yellow   95280lot 2       3gritty buff105310lot 3
3gritty buff105310lot 3       6gritty gray5598lot 4
4gritty buff92?270lot 4       1gritty yellow   88220lot 2
5gritty gray318?lot 4       4gritty buff92?270lot 4
6gritty gray5598lot 4       2gritty yellow   95280lot 2

 

Example Table - Version II (question mark when data are missing)

Original Entry OrderRe-Ordered by Diameter
IDfabricdia. (mm.)ht. (cm.)context       IDfabricdia. (mm.)ht. (cm.)context
1gritty yellow   88220lot 2       3gritty buff105310lot 3
2gritty yellow   95280lot 2       6gritty gray5598lot 4
3gritty buff105310lot 3       1gritty yellow   88220lot 2
4gritty buff92?270lot 4       4gritty buff92?270lot 4
5gritty gray?318?lot 4       2gritty yellow   95280lot 2
6gritty gray5598lot 4       5gritty gray?318?lot 4

 

Example Table - Version III (zero when data are missing)

Original Entry OrderRe-Ordered by Diameter
IDfabricdia. (mm.)ht. (cm.)context       IDfabricdia. (mm.)ht. (cm.)context
1gritty yellow   88220lot 2       5gritty gray0318?lot 4
2gritty yellow   95280lot 2       3gritty buff105310lot 3
3gritty buff105310lot 3       6gritty gray5598lot 4
4gritty buff92?270lot 4       1gritty yellow   88220lot 2
5gritty gray0318?lot 4       4gritty buff92?270lot 4
6gritty gray5598lot 4        2gritty yellow   95280lot 2

 

Example Table - Version IV (zero when data missing plus certainty fields)

Original Entry OrderRe-Ordered by Diameter
IDfabricdia. (cm.)certain
dia.
ht. (cm.)certain
height
context  IDfabricdia. (cm.)certain
dia.
ht. (cm.)certain
height
context
1gritty yellow   88y220ylot 2  5gritty gray0n318nlot 4
2gritty yellow   95y280ylot 2  6gritty gray55y98ylot 4
3gritty buff105y310ylot 3  1gritty yellow   88y220ylot 2
4gritty buff92n270ylot 4  4gritty buff92n270ylot 4
5gritty gray0n318nlot 4  2gritty yellow   95y280ylot 2
6gritty gray55y98ylot 4  3gritty buff105y310ylot 3

 

The problems with the first three of these data tables are typical problems with database design. They do not corrupt or damage the data, not do they prevent use of the data. They do, however, make it much harder to use the data effectively.

The problems encountered in these two archival projects were of two very different sorts. Inconsistencies in data creep into any project, and they can generally be corrected without too much difficulty - if they are found and found early enough in the history of the project to be repaired. The fact that both of these projects exhibited such inconsistencies, though at quite different levels of significance, shows how common such problems are. Preparing the data for archiving should always help with such difficulties.

Poorly constructed data tables, on the other hand, are difficult to repair and illustrate the need to employ archaeologists skilled with database design on any project so that such problems are not created. However, it is equally important that project directors be in a position to critique the work of any excavation personnel; they must not permit themselves to place uncritical reliance on computer personnel any more than they would on the pottery expert or the site architect.

-- Harrison Eiteljorg, II

To send comments or questions to the author, please see our email contacts page.


For other Newsletter articles concerning the ADAP; issues surrounding the use and design of databases; or the use of electronic media in the humanities; consult the Subject index.

Next Article: New Domain Name for CSA Web Site

Table of Contents for the Spring, 2000 issue of the CSA Newsletter (Vol. XII, no. 3)

Master Index Table of Contents for all CSA Newsletter issues on the Web

Return to CSA Home Page