CSA Newsletter, January, 2009: Open Context: Developing Common Solutions for Data Sharing

Open Context: Developing Common Solutions for Data Sharing

Eric Kansa and Sarah Whitcher Kansa
(See email contacts page for the author's email address.)

Overview

Many collections created in ongoing archaeological surveys or excavations are managed by small research teams or small institutions with little capacity to develop their own web-accessible database solutions or other public access to the data files created by the project. Without greater access and dissemination, many of these collections remain obscure, ignored, and under-used. Thus, there is a significant need to find ways to facilitate sharing such research content.

Fortunately, the declining costs of Web-related technologies, especially the growing power and maturity of open-source code libraries, are starting to make it more feasible for even small and under-funded institutions and individual projects to publish their collections online. Omeka www.omeka.org, an open source collections dissemination project, is seeing widespread use to help small museums establish a Web presence and share their collections.

While Omeka represents an excellent solution for creating online exhibitions, it lacks some key features needed to support research-oriented applications.¹ Researchers often create much more complicated datasets, usually with recording strategies and terminologies customized for very specific agendas and questions. Because researchers create more complex and individualized kinds of data, and because they often need greater precision in querying and analyzing these data, generalized solutions for Web-dissemination of this type of content are very difficult to find. The Open Context project (also open-sourced) represents an attempt to address this issue by providing researchers with a very generalized tool that supports demanding types of interaction with data. The Open Context project, funded by the William and Flora Hewlett Foundation, was initially launched in 2006 to demonstrate the potential to provide direct access to cultural heritage datasets. Its development is ongoing and has recently seen significant revision and progress, thanks to funding from the National Endowment for the Humanities and the Institute for Museum and Library Studies. This article describes some of the conceptual, practical, and technical rationales driving developments of Open Context over the past two years since its initial public release. (See www.opencontext.org/ for the current home page and testing.opencontext.org/sets/ for the beta version of the coming and improved home page.)

Traditional Dissemination Practices and Digital Data Publication: Old Dogs, New Tricks?

Field research generates a dizzying amount of information, painstakingly gathered by teams of people over many years of a project's life. Technological advances in the past two decades have escalated both the quantity and the range of information gathered in the course of field studies. Now, in theory (and often in practice), every moment of fieldwork can be documented, using a combination of media including cameras, video recording, laser scanners, GIS, global positioning systems (GPS), etc. Back in the lab, specialists add further documentation with scanned drawings, digitized maps, spreadsheets of analytical notations, and measurements. The continuing decline in storage costs and the growing sophistication of database systems help fuel this drive for more complete and thorough recording. In almost all cases, these new approaches produce richer and more comprehensive documentation than was previously feasible with traditional paper and photographic recording techniques. This digital documentation, coupled with digital communication via the Internet, permits far more rapid and comprehensive dissemination of research data, data that are extremely rich and complex.

Despite the potential of the Internet and in spite of the increasing popularity of less formalized (and much faster) means of disseminating information, such as blogs, the traditional method of synthesized print publication often persists as the primary means of disseminating scholarship. Thus, the ever-increasing scores of primary data on which the syntheses are based remain hidden on computer hard drives, facing greater risk of loss than their paper catalog predecessors. One of the primary reasons for this is that the scholarly infrastructure for sharing content has not kept up with developments in technology. Scholars lack places to house their digital content and means to share it and compare it with that from other projects. The potential is there, but (except for the very few well-funded projects with the resources to put their own materials into the public realm) the tools and other resources are not. In the end, and having spent a great deal of time collecting potentially valuable digital research content, scholars are still left with their printed syntheses in hand and the foundational data inaccessible. Sadly, this situation hasn't changed much in twenty years.

So Much Archaeology, So Little Time. . . So Little Funding. . . So Little Expertise

This failure to adopt new ways of disseminating primary data stems from a number of concerns, including lack of time, funding and know-how, and, perhaps most importantly, lack of proof that the effort will be worthwhile to scholarship. Thus, while digital documentation and storage costs are in sharp decline, many scholars lack the means (and often, the incentives) to share their field research easily.

Even for those scholars who have already embraced digital dissemination, many questions remain about how disparate resources can and should be brought together. Among the primary technical and conceptual issues in sharing cultural heritage content is the question of how (and to what extent) to standardize to codify our documentation. Researchers in the humanities and social sciences typically work in very decentralized institutional settings and within very different research traditions. Time and budgetary constraints further inhibit the development of widely adopted recording and data management standards. As such, scholars generally lack consensus on standards of recording and tend to make their own customized databases to suit the needs of their individual research agendas and theoretical perspectives (see also Denning 2004; Hodder 1999; Boast et. al 2007).

Even if we find solutions to documenting the diversity of archaeological content, the size and complexity of archaeological databases create challenges that even expert metadata (information about information) documentation cannot solve. Large archaeological databases often include hundreds of thousands of individual records created by multidisciplinary teams, all in complex inter-relationships. If a dataset needs to be downloaded and deployed on appropriate software, it will still be very difficult to use even with adequate documentation. Once deployed, the data are so complex that users will have to familiarize themselves with a project's database organization and interface to make use of the information. The steps involved in downloading and deploying such databases require too much effort for casual browsing and searching. Thus, making datasets available for download (even with adequate metadata) is not an ideal solution for archaeological communication if the data are not easily digestible by others.

A more ideal solution is to serve archaeological databases in dynamic, online websites, thus making content easy to browse and explore. Unfortunately, this typically requires complex and expensive custom web development. Thus, only a handful of very well funded projects offer access to databases of primary results via the Internet. The enormous and incredibly rich Çatalhöyük database http://www.catalhoyuk.com/database/catal/ represents just this kind of project-specific data sharing. Its extensive catalogue of excavated contexts and finds facilitates analysis and collaboration among the project's large team of specialists. While this is a fundamental contribution to scholarship, Çatalhöyük's system is not readily scalable; if other projects seek to adopt Çatalhöyük's online database to share their own content, they would have to conform to its recording system.

Most archaeological projects take place in smaller research programs with less funding and technical support than Çatalhöyük. These smaller projects have little capacity to develop their own customized, web-accessible database solutions. They may develop rich bodies of documentation, but without Internet dissemination much of this material will never see publication simply because this vast amount of content cannot be accommodated by print publication. The paper format is simply not up to the task. Therefore, the thousands of bones, seeds, potsherds, lithics, and other artifacts and ecofacts that are analyzed and recorded, as well as the maps, photos, and log entries associated with a typical project, almost never see publication beyond summarized forms.

In the end, scholars who have completed a project, no matter how they have organized their data, will find it challenging to make the raw data available in an integrated and intelligible fashion for use by others unless they are prepared to spend considerable amounts of time and money building a customized, web-based system.²

Open Context: Lowering Barriers to Access with Common Tools

Sufficiently generalized tools are needed to meet broad, through not necessarily universal needs. Researchers also demand more precision in finding, aggregating and summarizing content. The majority of commercial services including GoogleDocs, Flickr, and others are insufficient for many research purposes. However, the need for greater precision must be balanced with capabilities and practical issues. Some Semantic Web approaches are too complex and difficult to implement in the more decentralized and less standardized realm of field archaeology. Approaches that permit precision in information access without demanding too much formalism can help ease adoption. In this vein, the Archaeological Markup Language (ArchaeoML), developed by David Schloen (2001) for the University of Chicago OCHRE http://ochre.lib.uchicago.edu/ data management project and implemented in the Open Context (www.opencontext.org) data publication system represents one solution.

Open Context was developed as a means to provide an external (and externally-funded) system to provide access to primary data from multiple and diverse projects. Open Context provides a place on the Web for researchers to publish structured data along with textual narratives and media (such as images, maps and drawings). Content currently in Open Context ranges from archaeological field project data to geological science and zooarchaeological datasets. Open Context also provides access to museum reference collections to help facilitate the identification and comparison of specimens recovered in field projects. From one common portal, primary content from multiple projects can be searched, summarized, refined, exported, and commented on. Open Context employs a generalized structure, carefully constructed so that contributors are not obliged to conform to overly rigid standards or change their research design or language. Additionally, this generalized structure removes the need for projects to develop specialized software in order to share their research results. Thus, Open Context's approach significantly reduces the costs of data dissemination, and in doing so, helps improve the likelihood that a dataset will be shared.

Figure 1 - A view of Open Context, showing how the faceted-browse tool
at the left can be used to hone in on items of interest. Here, a user has
narrowed a search to all items in a certain project tagged
by a specific user.

Open Context offers a "faceted-browse" tool, which allows users to click through the system in an informed way, intelligently refining their searches as they hone in on the information they seek. This point-and-click interface enables users to incrementally filter their views with very sophisticated queries without requiring them to fill out complex forms or worse, type in a SQL query. Each item in Open Context contains contextual and descriptive information and can be linked to other items by the contributor or through user-generated tags; users can group items of interest into sub-sets for their own use -- and, by sharing the tagging information, permit colleagues to access and build upon those sub-sets. To help make the content easier to use with other Web-based applications, Open Context is now being upgraded to improve performance and scalability and to make all data available to a variety of common formats, including Atom/GeoRSS and KML (a specific subset of XML). This will make Open Context data easy to combine with GoogleMaps, Google Earth, Flickr, and other popular online resources.

Figure 2 - A detailed view of one item in Open Context, with images and
all related descriptive information, including people and user-generated
tags associated with the item.

Each item in Open Context has human and computer-readable copyright and attribution information attached to it. To make citation easier, Open Context provides a Cite Item button to get an easy-to-grab citation for the individual item being viewed. Each item in the database has its own unique web address; so you can cite the item in a publication, for example, and people will always be able to return to it. Bibliographic information stored in Open Context is also expressed in a standard that is readable by the open source Zotero www.zotero.org citation management tool. Zotero enables researchers to capture bibliographic information automatically when they use Internet-based library and digital repositories, including Open Context. These citation features help encourage data longevity, access, and reuse. Of course, like every other scholarly media service, the longevity of the Open Context's URLs can't be guaranteed indefinitely. However, if Open Context is no longer available, its data are archived in open and non-proprietary data formats by the Internet Archive. Since citation information is included in this archival copy, it will still be possible to retrieve a bibliographic reference to a specific record from Open Context, even if the active website has gone defunct.

Finally, all of Open Context's collections are continually indexed by commercial search engines such as Google so they can be found "by accident." A large fraction of Open Context's users discover content through Google, while many others find content via Open Archives Initiative web-services that link Open Context to many networked digital library systems.

Future Scenarios on the Success of Data Sharing

One of the ways that Open Context is currently proving most useful is in linking digital data with print publications. This fits with the more traditional approach to publication to which the archaeological community is accustomed and offers a compelling reason for scholars to share their primary data, by having it repeatedly referenced from the synthesized printed publication. With unique URLs for each item, specific citations to Open Context can be made from the printed text, making Open Context a sort of extensive digital appendix with unlimited page numbers. Although this not the most exciting or interactive type of use of the system, it may be the use that best serves researchers. Thus we see Open Context working in conjunction with print publication, acting as an enhancement to the printed syntheses and a place where all the primary field data can be housed and accessed and commented on. In effect, this would give a new level of transparency to research and would help raise the bar for how scholarship is judged and valued by the community. Such transparency, instead of being a threat to scholars, in fact can serve as a more honest testament to the quality of their research. Sharing demonstrates that a scholar's contributions are supported with a solid foundation of high-quality evidence and documentation. Over time, publication without full disclosure of underlying evidence will no longer be considered sufficient.

Until we reach that point, however, perhaps the greatest barrier to digital dissemination of primary data is the current lack of reference implementations; that is, projects with research outcomes that demonstrate the usefulness of sharing primary content. Scholars will only invest time and resources into organizing and sharing their primary content if the reward for doing so is clear and significant. That reward will be different to every scholar: some scholars seek the increased visibility to their research that digital dissemination offers; others seek access to data from other projects; still others see making their primary data as a basic responsibility of the discipline. The benefits of these motivations have yet to be clearly demonstrated.

Furthermore, it remains to be proven whether any solution can be found to deal with the staggering diversity of needs within the archaeological community relating to the use of digital technologies. Some of these needs are technological, but they also include workflow patterns, incentives and reward structures, collaborative processes, and competitive concerns. Pooled resources offered by systems like Open Context may be one solution to this problem of diverse needs. However, a greater understanding of the diverse needs in archaeology is essential to the successful development and deployment of computing systems to enhance research and preservation of the growing bodies of content that we produce. In January 2009, the Alexandria Archive Institute and the School of Information at the University of California at Berkeley will begin a 2-year NEH-funded collaboration exploring how open technologies can best meet the needs of the diverse communities of scholars working with cultural heritage content (see www.ux.opencontext.org ). This study will draw on the experiences of representatives from multi-organizational and interdisciplinary stakeholder groups, including academic researchers, heritage managers, and specialist communities. Results of this study will guide further developments to Open Context to create simple and easily customizable tools for users with diverse needs to share primary content and to integrate it with other online collections and resources. The study will also produce examples of different collaborative uses of pooled primary content. Success in these two areas - simple, customizable tools and tangible examples of the outcomes of data sharing - will help narrow the practical and conceptual divide between traditional dissemination of research and the potential offered by recent technological advances.

References and Related Publications

Boast, R., Bravo, M. T., and Srinivasan, R., (2007), Return to Babel: Emergent diversity, digital resources, and local knowledge, The Information Society Journal 23(5).
Denning, K., (2004) "'The Storm of Progress' and Archaeology for an Online Public," Internet Archaeology 15. intarch.ac.uk/journal/issue15/denning_index.html
Hodder, I., (1999) "Archaeology and Global Information Systems," Internet Archaeology 6. intarch.ac.uk/journal/issue6/hodder_toc.html
Kansa, S. Whitcher and E. Kansa, (2007) "An Open Context for Near Eastern Archaeology," Near Eastern Archaeology 70(4): 187-193. www.alexandriaarchive.org/publications/KansaKansaSchultz_NEADec07.pdf
Kansa, E., and S. Whitcher Kansa, (2007) "Open Context: Collaborative Data Publication to Bridge Field Research and Museum Collections," International Cultural Heritage Informatics Meeting (ICHIM07): Proceedings, J. Trant and D. Bearman (eds). Toronto: Archives & Museum Informatics. www.archimuse.com/ichim07/papers/kansa/kansa.html
Kansa, E. (2007), "An Open Context for Small-scale Field Science Data," International Association for Technical University Libraries Meeting (IATUL07). www.lib.kth.se/iatul2007/assets/files/fulltext/Kansa_E_full.pdf, http://www.archimuse.com/ichim07/papers/kansa/kansa.html
Kansa, E. (2007). "Publishing Primary Data on the World Wide Web: Opencontext.org and an Open Future for the Past," Society for Historical Archaeology, Technical Briefs 1(2). http://www.sha.org/publications/technical_briefs/volume02/article01.htm
Kansa, E. (2005). "A community approach to data integration," Geosphere 1(2) www.gsajournals.org/gsaonline/?request=get-document&doi=10.1130%2FGES00013.1

1. We are currently actively developing a plugin for Omeka, which will make Omeka and Open Context interoperable and easy to integrate. Return to text.

1. See the review of the International Dunhuang Project in the September, 2008, CSA Newsletter for a discussion of the costs associated with providing a customized access system: csanet.org/newsletter/fall08/nlf0802.html. Return to text.

-- Eric Kansa and Sarah Whitcher Kansa

An index by subject for all CSA Newsletter issues may be found at csanet.org/newsletter/nlxref.html; included there are listings for articles concerning the use of electronic media in the humanities.

Next Article: Successful Final Scaffolding Experiment for Survey Work

Table of Contents for the January, 2009, issue of the CSA Newsletter (Vol. XXI, no. 3)

Table of Contents for all CSA Newsletter issues on the Web

CSA Home Page