Vol. X, No. 2

   
Fall, 1997

Database Design: It's Never As Easy As It Looks

H. Eiteljorg, II


I recently used Michael Adler's data tables from the Appendix of The Prehistoric Pueblo World, A.D. 1150-1350 to illustrate some points on database preparation. The tables had been prepared as text tables, not database tables, and they had been converted to tables for the Web from straight text files in 1996. Since that time, I had made simple data tables from them, and those tables were available on the Web in DBF or ASCII form; at least one of those file formats can be used by most database systems.

A small sample of entries from one of the Web tables is shown in Table 1.

Table 1

ID District Site Government ID No. No. of Rooms No. of Kivas Beg.Date End Date

12E17

Rioabajo

Tio Bartolo

LA31736

65

 

1000

1100

12E18

Rioabajo

Pueblo Presilla

LA31720

115

 

1300+

 

12E19

Rioabajo

Presilla Del Sur

LA31741

50+

 

1100

1300

12B5

Galisteo

Colina Verde

LA309

   

1275+

1300+

12B4

Galisteo

Las Madres

LA25

47+

 

1275+

1370+

12B9

Galisteo

Pueblo Colorado

LA62

   

1275+

1600+

12B3

Galisteo

Galisteo

LA26

   

1275+

1700+

This treatment of the information is easy to understand and very straight-forward. The site name is the local name, the government ID number is the number applied by the government, and the rest of the information has been supplied by the individual scholar who compiled the information for the geographical area in question. Thus, Tio Bartolo, in the Rioabajo District, has 65 rooms, an unknown number of kivas, and was inhabited from about 1000 to 1100. Las Madres has more than 47 rooms, was first inhabited sometime after 1275 A.D., and was no longer inhabited sometime after 1300 A.D. We have no difficulty understanding that the plus sign used with the number of rooms, the beginning date, and the ending date indicates that the numbers are not absolute but indicate a threshold.

When I put this same information into a true database, rather than a Web table, nothing much changed. Since the Web tables had been divided by geographical region and were combined into a single table, though, I did add another field to indicate the region (these sites, for instance, are all from the Southern Rio Grande region). The original information was not altered. Having the data in a true database instead of a simple Web table, I could use the information as I think most of us would - to order or group the entries according to criteria such as the date of first habitation, number of rooms, or number of kivas. There were some difficulties, however.

Any category with a plus sign in an entry-to indicate that the number shown was not absolute but a threshold-had to be treated by the computer as containing text, not numbers. An entry such as 49+ is not a number and must be treated by the computer as the characters 4, 9, and +. So all entries for number of rooms, number of kivas, beginning date, and ending date were treated by the computer as groups of letters and numbers. Their meanings are not lost to us, but the computer does not see them as quantitative indicators. So, when asked to put the entries in order by site beginning date, the computer does not understand that there are numbers involved. It orders the entries alphabetically, not numerically. That is not a problem when the entries contain numbers that are all of the same magnitude. In this case, for instance, all the beginning dates are in the second millenium A.D.; so they will be put in the proper order, whether by alphabetic or by numeric sequencing. However, the computer would put the entries for numbers of rooms in the following order, lowest to highest, if going alphabetically: 115, 47+, 50+, 65. The numeral 1 is placed before 4 when alphabetizing; so 115 comes first. The computer would thus put 115 before 4 and 40 if it were alphabetizing a range of numbers, not to mention 2, 20, 3, 30, 5, 50, . . . .

The same problem causes even more acute difficulties if we want to find entries in a date or numeric range. Say, for instance, that we want to select the sites with more than 40 rooms. The computer, treating 40 as coming after any entry beginning with 1, 2, or 3 in an alphabetic arrangement, would omit all sites with entries for numbers of rooms from 100 to 399, from 1,000 to 3,999, from 10,000 to 39,999, and so on. Furthermore, since 4 comes before 5, 6, 7, 8, and 9 alphabetically, the computer would include any sites with 5 to 9 rooms.

How do we correct this problem? As I suspect most readers have realized, it is not particularly difficult. First, we have to split the information into two parts, the number and the plus sign (the indication that the number provides a threshold). Second, we have to tell the computer that the numbers themselves should be treated as just that, numbers, and that they should be searched according to numeric, not alphabetic, criteria. That is not a problem when we separate the numbers from the non-numeric character. (For a similar problem and solution, see "How to Transfer Your DOS Database . . . in 659 Easy Steps" in this issue)

The new data table is shown in Table 2, with four new data categories, a plus sign relating to number of rooms, another for number of kivas, another for beginning date, and another for ending date.

Table 2

ID District Site Government ID No. No. of Rooms + No. of Kivas + Beg.Date + End Date +

12E17

Rioabajo

Tio Bartolo

LA31736

65

     

1000

 

1100

 

12E18

Rioabajo

Pueblo Presilla

LA31720

115

     

1300

+

   

12E19

Rioabajo

Presilla Del Sur

LA31741

50

+

   

1100

 

1300

 

12B5

Galisteo

Colina Verde

LA309

       

1275

+

1300

+

12B4

Galisteo

Las Madres

LA25

47

+

   

1275

+

1370

+

12B9

Galisteo

Pueblo Colorado

LA62

       

1275

+

1600

+

12B3

Galisteo

Galisteo

LA26

       

1275

+

1700

+

This database could be printed to look just like the original, with the number of rooms for Presilla Del Sur, for instance, shown as 50+, but the actual data entries would place 50 (rooms) and the plus sign in separate categories to indicate that that number is a threshold, not the actual number.

No information has been lost or removed in this process, but the indicator that the entry is the threshold number (the plus sign) has been separated from the number itself. Now when we order the entries according to the number of rooms or the date of first habitation, the order will be correct,(1) and when we search for all sites with more than 40 rooms or a beginning date after 1150 A.D., we can be confident that the search will not miss any of the entries that should be included.

This rather simple example of a data table illustrates how important the design of such a table can be. The original table we posted on the Web for downloading was much more difficult to use than the new table that we have created. Now both are available at the ADAP Web site (http://www.csanet.org/archive/adap), but the revised version is much easier to use.

A myriad of similar design issues confronts the designer of a typical database, and the problems for recording field data are often so severe that there are more-and more complex-design questions than one might expect. That is why it is so important to involve people with database experience in the construction of databases for recording archaeological data.

-- H. Eiteljorg, II

To send comments or questions to the author, please see our email contacts page.


For other Newsletter articles concerning the Archaeological Data Archive Project, the use of electronic media in the humanities or issues involving the use of databases, consult the Subject index.

Next Article: Where is the Information?

Table of Contents for the Fall, 1997 issue of the CSA Newsletter (Vol. X, no. 2)

Master Index Table of Contents for all CSA Newsletter issues on the Web

Go to CSA Home Page


(1) It is actually possible to perform the searches required and to order the entries as we want without separating the numbers and the plus signs, but the programming required involves much more computer skill than should be needed by users of archaeological data sets. A good data design should not require a significant level of computer skill on the part of users. Return to body of article.