From the sorting table to the web: The NPAP research data portal for ceramics

Vladimir Stissi, Jitte Waagen, and Nienke Pieters

(See email contacts page for the author's email address.)

Introduction

This paper comes as a response to the publication of "Digital Data in Archaeology: The Database," by Harrison Eiteljorg, II, in the CSA Newsletter, XXV, 2; September, 2012. In this paper we present the database developed by the research group "New Perspectives on Ancient Pottery" (NPAP), of the University of Amsterdam, in 2007-2012. In presenting our project database, including its aims and user experiences, we want to tackle some of the questions posed by Mr. Eiteljorg.

Background (by Vladimir Stissi)

The project New Perspectives on Ancient Pottery (NPAP), a large research project on ceramics from three different archaeological fieldwork projects, commenced in 2007. The project has been funded by the Utopa foundation and the University of Amsterdam. The NPAP involves six PhD dissertations and postdoctoral research on pottery-related regional research. This research has been centered on three ongoing projects: excavation at ancient Satricum, Italy; survey and excavation on the island of Zakynthos; and survey and excavation around ancient Halos, both in Greece. The project’s research questions involved the study of large assemblages of ceramic finds in many different ways, as well as comparisons between materials of different periods from several sites. We knew of no existing database that could cope with such wide-ranging research; so we decided to develop a new one. The new database was planned to be flexible and universally applicable, both when entering data and querying them; in addition it was designed in such a way that its contents may be easily incorporated and presented in an open and user-friendly web environment so that anyone interested can view and perhaps use the data.

Designing this database proved to be a long and complex process in which several of the pitfalls discussed in the CSA Newsletter article previously mentioned came up. A pitfall appeared, for example, when basic terminology and definitions were discussed. Similar or even identical objects and phenomena are labeled in many ways by archaeologists that work with them, and there is also the reverse problem: the same word can indicate various things. For example, terms like slip, glaze and gloss can have very different meanings depending on who is using them. In the database, specific choices had to be made, and this entailed, amongst other things, the process of formalization and justification of labels, that is, making strict definitions of the content of fields and making sure that like goes with like and that descriptive categories are split up into their basic components. Thus the black and the gloss of a black gloss item end up in different fields, together with some fields defining the gloss more precisely. As one imagines, specialists of different categories of material are not happy to part with the categories they are used to work with, or even their labels. To help coordination between these specialists and communication with users, encyclopedic lists describing all fields and terminology as used have been made by the project team. This is basically a users’ guide for the database, which can also be used as a handbook for fieldwork pottery processing.

But some more babies had to be killed. Not every ceramic feature can be included, even in a quite universal database like this. Where to stop? This is basically the problem of making categories and deciding their level of detail. Experts accustomed to meticulous documentation of objects and making extensive catalogues find it difficult to accept that a database is not a catalogue in the descriptive sense, but a tool that can help answer research questions through quantification of data, which means categories have to be clear-cut and countable (and easy to define and enter). It was only with great pain that some experts accepted our conviction that not every decorative motive or detail (e.g., varieties of a base shape) could be included in the database; although this is theoretically possible, it is not very useful to create fields for categories of data that could have 15 different values but occur only 30 times. On the other hand, very frequent phenomena, which can be covered by a few values, are not always easy to capture accurately and simply, especially when visual information like colours and decorative patterns is involved. It was quite a struggle to bring together as many wishes as possible, and one of the results is that the database that came out still covers an enormous number of fields — in fact, so many that research aiming at deciding which database fields can be disposed of without effect has become part of the NPAP aims. A related, difficult issue is that any rich and complex database is very labor-intensive, both when entering and when querying. Moreover, such a database may only become useable on a relatively long term, possibly by people who have not done this work themselves. Of course, in the end all this is a matter of balance. We hope to have reached that.

Database (by Jitte Waagen)

As follows from the above, the main challenge of the NPAP database has been to allow integration and analysis of detailed ceramic data of large assemblages from different archaeological research projects. To match this goal and related research questions, requirements have been set for the capabilities and functionality of the database. In this section, we present a succinct overview of this general functionality as a framework for development and then turn to more specific design issues. After that, we address some issues raised by Mr. Eiteljorg in the article previously referenced. We consciously leave out a detailed description of the software and scripts used, since this is not really relevant to the discussion.

The database was designed to respond to a series of essentials for the project research. One of them was facilitating the entering of a large amount of information of either batches of objects or individual artefacts (sherds and pots). This information comes from archaeological fieldwork projects, both excavations and field surveys. It was deemed important that these data not be disassociated from those projects; therefore, original project identifiers should be preserved for the material in the NPAP database. Also, the database was designed to provide for the recording of all types of ceramic objects, not exclusively potsherds, since this proved essential for addressing specific research questions, especially those concerning the manufacturing process. Accordingly, we created a general descriptive framework for the properties of ceramics. This enabled comparisons between different kinds of objects (e.g., sherds and figurines) as well as between similar objects of different date or geographic origin. For our purposes, the same terminology should be applied to the surface treatment of, for example, a figurine from Hellenistic Halos in Thessaly, Greece, and the surface treatment of Archaic pottery from Satricum, Lazio, Italy. The descriptive framework was also designed to include a solution to deal with biases, uncertainty and the ambiguity of empty fields (is information unknown or simply not recorded?). Moreover, quickly processing substantial amounts of material demanded that the recording interface and its functions be optimized for fast entry. Finally, the multi-user setup of a database fit to store data of many projects requires a profiling strategy, permitting part of the user interface to be customized depending on the specialist working with it.

It may be clear that the requirements to record a range of different specialist information (including typological information for a broad range of ceramic artefacts, as well as subsets of data on varied topics such as decoration and petrography) involves a certain complexity of the database schema. In fact, the NPAP database consists of 44 separate tables used for entry, normalized almost without exception up to fourth normal form. Of these, one-third is associated by one-to-one relationships in order to reduce the size of the main parent table storing the object data, as well as to avoid unwanted empty fields (you can omit, for example, descriptions of decoration for undecorated objects). All tables are related using independent primary keys to ensure efficacy and simplicity. Original project identifiers have been incorporated as attributes in various tables, permitting links to the projects, but they have not been used as primary keys.

The first step towards creating a general descriptive framework that could be used for all ceramic objects was to agree on a vocabulary shared by different specialists over different material classes. In addition to the methodological approach of formalization and justification described in the first section, three techniques of database design have been instrumental: atomicity, lookup lists and required fields. Atomicity refers to splitting up descriptive terms into their smallest possible components, for example, the pot shape bell-krater should be split into a specific shape, krater with a specific shape type, bell. Rigorously applying atomicity, formalization and justification of descriptions, allows for a comparison of artefacts that transcends the borders of conventional schemes. However, it is not only the grammar of descriptions that must be similar, but also the actual words. To facilitate the use of the same descriptive terms anywhere in the database, we created a total of 80 tables that appear to a user as lookup lists. To reduce redundancy and avoid data inconsistency, the description of all database fields that involve similar terminology, such as marking location and decoration location, come from one central lookup list. To force database users to record the values that are considered key attributes for comparison we have introduced required fields. Finally, to enable a comparison between the atomized descriptions of the NPAP descriptive framework and conventional labels such as black gloss, the latter have been incorporated as attributes as well. In this way, for every entry in the database, the atomized descriptions may still be compared with conventional classification systems, allowing an evaluation of the strengths of both systems. This conversion of atomized descriptions into more conventional ones is accomplished automatically, without user action.

Whereas information extracted from detailed pottery analysis is influenced by personal — possibly subjective — judgments and involves a variable degree of certainty, data in a database can appear objective and certain despite its similar origins. Database design must therefore incorporate some strategy for dealing with uncertainties and possible biases. In the NPAP database, this issue is handled in a rather straightforward way by applying three complementary techniques: automatically registering the database users, incorporating the notion of doubt and explicitly differentiating between unknown values and unclear values. The notion of doubt is not expressed in any fine-grained system of degrees of certainty, but is recorded as a Boolean (true/false).¹ In case the specialist simply does not know a specific value, he or she can select unknown from the lookup table; in case the specialist is not able to provide a probable value, the option unclear is available. Following this logic, all fields with no value at all (null) can be interpreted as irrelevant to the recorded item. The truth of this admittedly depends on the rigor of the users, and this remains a potential weak spot.

Finally, to deal with the requirement to facilitate fast entry, forms were optimized for specific users and uses. Where the actual database schema may be an intricate composition of interrelated tables reflecting the inherent complexity of the dataset, forms are used to provide a comprehensible tool to view and record data without the interference of that complexity. For regular entry, we designed a form with a tab for each subset of specialist information to keep the interface as clean as possible. In addition, we designed fast entry forms which present a summary of the most-used and required fields for every single record. This saves the time required to go through separate tabs when quickly processing a considerable number of objects. We also created two critical functions to avoid redundant clicking: a copy function and a batch-processing function. The copy function appears on the main form, as well as on all tabs, to allow duplication of all values of the last entered record. This eliminates tedious clicking in cases where, for example, a user is processing a large group of sherds only the measurements of which are different. The batch-processing function is another solution to the same problem. It simply allows the user to enter a group of objects, complete all common fields for those objects, and then ‘explode’ that group into single records to allow further individual editing. As its functional opposite, it is also possible to merge a set of records based on a common identifier. These functions combine to make entering groups of objects with many similar properties more efficient.

Additionally, a series of quite common form functions has been created to enhance clarity, completeness and consistency of our data. For administrative purposes the specialist name, entry location and date/time information are automatically stored whenever a user makes a change to any table. Input masks have been created to check consistency of quantitative data. Cascading selection boxes have been created to limit the choices in lookup lists, for example, where a specific surface treatment technique would imply a specific subset of possible values in the material lookup table.

As for documentation of our database and related scripts, all tables, fields and values have been amply described and will be published in a users’ guide to accompany the database. The actual code has been commented upon in detail in the scripts themselves. The application and its software architecture, processes and user interface will be documented following the principles of Unified Modeling Language (UML). Using this modeling language, we created graphs and diagrams showing how all parts of the system actually interact.

In the last part of this section we turn to issues raised by Mr. Eiteljorg in the CSA Newsletter article. It is our contention that for decent database design, the designer should have studied the basic theoretical principles of database normalization and optimization. These principles are comprehensible and straightforward; the entity-relationship model is fairly easy to learn, but also the 5 (or 6 if you will) rules of database normalization do not demand more than what is essentially a bit of abstract thinking and some effort to understand the implications of those rules. There are many well-written texts addressed to the non-specialist archaeologist, of which the CSA’s guide Archaeological Computing is an excellent point of departure. The responsibility to produce a decent database therefore lies with the designer, in the same way the pottery specialist or the petrologist is expected to master his or her field as well. In our view, the more difficult issues concerning database design are of a more methodological nature: does the database actually reflect the complexity of the archaeological problem for which it is designed? How to deal with ambiguity, uncertainty and formalizing descriptions?

One of the most obvious problems is presented by classification systems, as has been dealt with throughout this article. In summary, the question is how to apply a generally valid taxonomy to conventional object classifications that are regularly a mixture of categories related to appearance, function, material or a combination of those. In approaching this issue, we consider our database schema an experimental design that does not claim to provide a definite solution. We do, however, aim to make a valid contribution to the relevant discussion.

Another problem concerns table relationships and the right location of descriptions in archaeological databases. Although we archaeologists prefer to deal with whole entities of material remains (pots, tombs, structures), what we mostly find and study are fragments of those remains (sherds, traces, building materials). This may present tough problems in a normalized database because that database requires strict, often hierarchical, dependencies: a sherd belongs to a pot. The characteristics of the pot are the ones you want to describe, but 99% of the time you only have a collection of sherds (possibly) related to one pot. This results in problems concerning the place of descriptive terms (do I describe the color of the pot or the color of the sherd?) and, depending on the solution of choice, also in problems with redundancy (do I describe every sherd of a pot as being colored white?) and quantification (do I group all white sherds in one record and what does that record represent?). Similar problems may arise in cases where multiple associations are possible. Take for example a database recording the stone fragments that may belong to several groups of statues spread over three terraces: you may be able to determine which statue a fragment belongs to; you may only be able to identify the series of statues a fragment comes from; or you may only be able positively to identify which terrace the fragment must be associated with. Given that all terraces, series and statues are entities with attributes, you must first have a system that allows relating the fragments to different possible parent entities, and, second, have a rigorous approach towards describing and quantifying all those remains.

In the case of the NPAP database, we resolved this issue by using one record for either a batch of material or an individual item, and we then used the batch/merge functions to be able to transform our data between the two. This way we keep all our related tables and attributes connected to the same main table. Pots can be entered as a parent entity in a one-to-many relationship with the main object table. We do immediately recognize here that this is not a definitive solution; its main drawback is the care needed in querying for quantitative data. Moreover, we are aware of other potential solutions to such problems. However, we have not seen many examples of archaeological databases which deal with this issue adequately and therefore share our own solution in the hope others will also do so.

A user's perspective (by Nienke Pieters)

As a PhD student, I use the NPAP project database to document and analyze ceramic data from the Zakynthos Archaeology Project, one of the sub-projects within NPAP. The database facilitates fast entry of the stylistic, morphological and fabric characteristics for each ceramic object. The ceramic assemblage I study comes from 28 survey sites that range in date from the Prehistoric to the Roman period. At the moment about 10,800 ceramic fragments have been entered in the database. As I am normally dealing with very badly preserved survey material, I use only a few of the entry fields available in the database. When I am dealing with diagnostic features, however, typology and surface treatment fields are entered as well. Most of the time diagnostic features are not preserved, and I only enter the object type, object part, preservation, fabric color, fabric code, metrics and chronology. Survey ceramics often can not be dated specifically, and fortunately the NPAP database allows for a broad dating system such as Archaic-Hellenistic or Neolithic-Greek Iron Age. When required, especially for the diagnostic fragments, the chronology can be more specific with a prefix, such as Early, Middle or Late, to indicate phases of the main period, and a further refinement within Early, Middle and Late is possible. In addition, absolute dates can be added besides relative ones. For me, all this is very useful because the fragments which are dated specifically fit into the same chronological framework as broadly-dated specimens, and the overall system remains coherent.

Not a database specialist I started to use the database with only basic knowledge. However, I soon realized the need to be aware of how the database worked "behind the screen," to understand how tables are related to each other or to reference tables and what those relationships mean. This decreases the possibility of lost data, wasted time and frustration. Apart from stylistic and morphological features, I pay special attention to the fabric of each ceramic object to better understand ceramic production on the island through time. The fabric tab in the database allows for a detailed description of the fabric for each individual fragment. Fields like colour, individual inclusion characteristics, and sorting are only few of the possibilities. To arrive at a simple and practical labeling system however, I use a coding system that brings most these properties together in a code: A code might, for example, look like this: III-P-S-A-Ve-31. One problem I encountered when incorporating the fabric code field in queries, was that using the hyphen, "–," in text fields could frustrate querying. Since the hyphen is also an SQL operator, it may be interpreted in an unforeseen way in certain query constructions, yielding unexpected results. In ArcGIS, I stumbled upon similar problems, which temporarily impeded me from visualizing fabric distributions in a spatial environment. Being aware of such problems beforehand makes it much easier to deal with complex, mass entry.

Planning well beforehand is an essential quality in the design of a database, but creating a database flexible enough to cope with issues that may arise as research evolves is also very useful, as I came to know from personal experience. To better understand erosion processes in relation to the archaeological record, I decided to document the measurements of each ceramic object. After the entry of metrics for 8,000 objects, I realized this was not feasible for all the material, simply because it is too time-consuming. In response, we used the entered data to deduce – using a Jenks natural breaks algorithm – a new entry system with 5 size categories. This facilitates fast entry much better. This would not have been possible if the NPAP database were static, and it should be noted as one of the merits of its functionality. (The labour spent on the 8,000 objects already measured was not wasted. Those measurements provided the base data for generating the 5 size categories we now use.)

Another very useful facet of the database is the direct link I am able to make with ArcGIS. For my research it is very important to show spatial distribution patterns of shape, style and fabric features. The field project identification allows for a direct link with the tract database used in the fieldwork project. With a make-table query I can create a new table that shows, for example, how many individual sherds per survey tract belong to fabric A, B, C etc. and in which chronological periods they were used. This new table can immediately be joined to the density tract map available as a shape file in ArcGIS in order to show a diachronic pattern of fabric groups.

Brief conclusions

In conclusion, it should be clear that creating a database of the scope and goals set out by the NPAP project poses serious challenges. To stand up to these, a considerable amount of time for discussion and technical development has provided the essential background for the design of a complex database. The detailed descriptive framework of the NPAP database requires a fine-tuned database model, properly designed to keep our data as unambiguous, complete and consistent as possible. In due time, we hope to bear the fruits of using this descriptive framework to see how much we have achieved in dealing with the complexities of the archaeological ceramic record. We hope that this is a welcome contribution, even if not the final word, and we look forward to the response of the research community.

-- Vladimir Stissi, Jitte Waagen, and Nienke Pieters

Notes:

1. This is because indicating a degree, such as a "30% certainty that this sherd belonged to a krater," is fairly subjective. Furthermore, it risks creating patterns in the database that say something about the specific specialist rather than the actual data. Return to text.