Project Publication on the Web — III

Andrea Vianello, Intute, and Harrison Eiteljorg, II

(See email contacts page for the author's email address.)

In the last two issues of the CSA Newsletter the authors have discussed important questions that scholars must ask before beginning to use a website as a publication medium for a project: impediments to doing so and motivations for doing so. For those scholars who have decided that the impediments are not daunting and have determined their particular motives, it is now time to dive more deeply into the topic. How should the committed scholar proceed?

The first step was mentioned in the last article. That first step is the selection of a Digital Information Officer, also known as the chief geek. It is our firm conviction that the selection of this member of the team is of critical importance because that person will be the one who drives the decision-making process for matters digital. It will be up to that member of the team to ask the right questions about the computing side of the project, propose useful solutions (generally more than one per question), and implement the decisions. As we said then, this person should be brought onto the team at an early stage so that questions that may not appear to require the chief geek's input are not answered too abruptly.

The following are the major questions we see as being the first to be answered after the chief geek has joined the team. Not all of them are relevant to the issue of using the web to publish project results; so not all will be dealt with here. Those marked with asterisks are both important and relevant to the use of the web as a publication medium. While the others are also important, they are not relevant to the use of the web as a publication medium.

What hardware and software will be used?
What procedures will be in place for data backup and protection, on site and during the off season?
What portion of the data recording will be done on paper and what portion directly into digital media, with what kinds of equipment, processes, and procedures?
What will be used to supply standard vocabularies — for finds, for excavation areas, for various digital data sets, and so on.
What base digital or paper data (e.g., maps) will be required? What requirements for them will exist, who will decide, and where will they be obtained?
What procedures will be in use to assure proper version control of all data?
*How will data be archived? In an independent archives or locally?
*Will access to data be provided to website users in a fine-grained fashion (access to individual data items) or will access only be to complete files?
*Will preliminary data be posted on the web for access by people beyond the project team?
*If access is to complete files (whether only complete files or complete files and individual items), in what formats will those files be made available?
*What kinds of files are expected? Databases, CAD models, GIS data sets, images, scans and/or PDF files of various documents, audio files, video files, HTML files, VRML files, some other contemporary file type, or something not yet even conceived?
*Given the assumption of digital photography by many, how will photographs be treated? That is, will the project expect copies of any and all photographs taken within the project boundaries, however defined? If not, how (and when) will the choices be made? What file formats will be used for images at what stages in the process, and what information will be expected about any photograph?
*What forms of indexing and/or searching will be developed?
*Will there be a private section of the website for team members?
*Assuming use of the website for various functions (e.g., volunteer information) during the life of the project, will all organizational web pages be archived, some portion, none — with all revisions, last revision only, major revisions only, . . . ?
*Will preliminary discussions and analyses, comparable to field-season reports, be posted on the web for access by people beyond the project team?
*Will web pages aimed at the general public be a part of the site from the beginning or only when analysis has been completed? (Note the assumption — in keeping with Peter Young's comments in "Bridging the Communication Gap: Should academics go public with what they know?" from the January, 2011; XXIII, 3; CSA Newsletter — that there will be information intended for the public.) What archival procedures will be in place for the pages intended for the general public?

In this article, we will focus on the last eleven of the above questions. Not all of these issues will be germane to all scholars embarking on the construction of a web site as the project publication medium. Some will not wish to use the site for administrative matters at all. Others may choose to omit all preliminary publications from the web site. Nevertheless, we will endeavor to provide some thoughts about all these issues. Note that the issue that may be most contentious -- final reports and analyses -- has been omitted here. It will be the subject of our last article in this series, also in this issue, but as a separate article.

How will data be archived?
We begin with the question of data archiving. Will the archiving be done locally or at an archival repository? We strongly recommend the latter. No matter how strongly the project directors may feel, they cannot ensure long-term care of the files after moving on, whether to another project, another institution, retirement, or a more permanent new resting place. A repository, however, should be able to guarantee, at the least, that the files presented to them will be cared for and/or given over to some other institution for proper care. We believe strongly that the archival repository should be willing to preserve the original files as well as any derivative files or exported data. The files actually produced by the project represent more than the data; they are also the best evidence of methodology, and their organization can shed important light on the finds.

Sadly, we cannot today make the claim that archival preservation is either universally available or easily obtained. In fact, it may be impossible to find a suitable repository for the files. As a result, it may be necessary to prepare archival copies of files and store them locally until a suitable repository can be found. In such an eventuality, funds should be retained to pay for archival preservation when it becomes available.

Will access to data be provided to website users in a fine-grained fashion (access to individual data items) or will access only be to complete files?
Another basic matter is the question of fine-grained data access: will website visitors have access to full files only or to more fine-grained data such as information about individual artifacts? This is a more difficult question to answer than one might expect. The extra work required to provide access to individual data items is enormous, especially when one considers all the work required to enable access via a variety of criteria and to virtually any individual item in any file. That is a huge time and money commitment. On the other hand, it is foolhardy to assume that the typical archaeologist can access data files simply by downloading them for use on the desktop. The CSA survey of computer skills in the field (from 2001) is now woefully out-of-date (see "Computing in the Archaeological Curriculum;" Fall, 2001; XIV, 2), but the basic thrust remains: relatively few scholars have developed the skills required to use complex relational databases, much less GIS data sets, CAD models, or even complex spreadsheets with unusual statistical procedures. For those who lack those skills, access to the files alone is of no value. Thus, there are strong reasons to provide only access to files and equally strong reasons to provide access to finer-grained data items. This suggests that the prudent course is to find an archival repository that will both archive the full data files and excerpt the data for finer-grained access. Were such a repository available, that would be our recommendation. However, we are not aware of a repository that has offered both to preserve all data files and to provide access to the data in a fine-grained system. While it may be reasonable to assume the existence of such repositories eventually, a project would be ill-advised to build its plans based on the assumed existence of such a repository in the future.

It is tempting to leave this discussion and simply move on, offering no opinion for today's project directors. However, we believe there is a prudent course. That is to offer only complete files. If and only if the project has an unusually large budget for this work and a serious interest in the processes and procedures — both intellectual and digital interests — of creating fine-grained access systems, may the choice to provide fine-grained access as well as the original files be justified. Note that we do not offer a choice of fine-grained access only. We believe firmly that the original files must be preserved for access by any scholar wishing to understand fully the methods as well as the results of a project. If they must be preserved, we see no reason not to make them available. Fine-grained access can be provided at any future date, but providing access to the full file systems of a project cannot be reconstructed from the data extracted for fine-grained access. (Note, however, that we have not spoken to the timing question here.)

The issue of file formats for those files to be shared is important here. In general, the formats supplied should be the most widely used non-proprietary formats available. That does not mean that the original files — in whatever formats — cannot be made available as well. In fact, we would recommend that the original files be made available whenever possible along with equivalent files in non-proprietary formats. When non-proprietary formats are not available (e.g., for CAD files in the DWG format), there is little choice but to make them available in the format the project has. This is one of the reasons an archival repository is a good idea. The people who operate such repositories will be better able to deal with these questions than the best of the chief geeks for individual projects.

Will preliminary data be posted on the web for access by people beyond the project team?
Let us now turn to the timing question. It is easy to say that neither files nor fine-grained access should be made available to the public until the project has been completed so that the data can be properly checked and verified. However, there will be preliminary reports and other documents that rely upon the data files available at the time of writing. To the extent that such documents utilize the data available at the time of their writing, how does one justify keeping those data under wraps at that time? This is not an easy issue to resolve. One solution is to prepare a system that permits data files to be accessed but places limits on proper use, verifies the state of the files accessed, and makes certain that the latest files are always available. The idea, in the long run, is that scholars will use the latest and best files, knowing that they will be available and that they need not obtain data from any other (older or less reliable) source. The problem remains, however, that preliminary information will be in the public domain and may generate arguments and disagreements. We do not see this as a problem with a clear solution, but the larger (and therefore longer-running) the project, the more necessary it will be to provide access to data files before they are complete. Keeping those files under wraps until completion for such projects may effectively amount to never releasing the data at all.

Those inclined to risk the adverse impact of making preliminary files available should make every effort to verify the accuracy of the files made public before releasing them. In many instances, that will be sufficient. But it will not be easy. (It is prudent to check data files in each off-season span anyway. So this is simply another reason for checking data files between work seasons.)

The temptation for those who do not wish to take the risk of making preliminary files available is to delay the day of reckoning, the day when the files are finally deemed ready for sharing. It is difficult, after all, to let the information become public and to risk the inevitable: having your errors pointed out. Thus, the project that decides to delay release of data files should impose a deadline of some sort so as to prevent unending delays.

What kinds of files are expected?
The file types to be expected should be unrelated to the website in the sense that the determining factor should be the project needs rather than the needs of the website. However, if more file types are generated, more forms of access will be required for those files on the website. Therefore, the chief geek should know at the outset if, for example, video files are to be created and shared. (Note that some file types may be added to the mix after the project has begun. That should not happen without an understanding of the impact on the website and the involvement of the chief geek.)

Given the assumption of digital photography by many, how will photographs be treated?
The next question to be considered concerns photographs. It is a unique area because digital photographs can be taken by so many different people, using different cameras and ancillary equipment, relying upon different standards, with different aims, and under vastly different conditions. Given the variables and the need to provide good information, not just information, we believe it is reasonable to require photographs to be taken in the TIFF or RAW format and migrated to the TIFF or DNG format for archival preservation. Those are uncompressed formats, meaning that no information has been lost due to some file-compression routine (as with JPEG, for example). This requirement does not mean that such uncompressed formats will be used on the web where they introduce considerable difficulties. Rather, we believe that the TIFF or DNG images should be the basic images, perhaps available only on request, from which all those used on the web or elsewhere are derived. Those images, having the best record of the scene, objects, or people photographed, should be the basic archival photographic sources.

Photographs should have accompanying information. For every photograph a photographer, the date, and something about the subject(s) should be known as the bare minimum. For photographs of work in progress, information about the viewpoint should also be included, and time of day would often be helpful. More information would be desirable, e.g., camera, lens, and various technical bits, but these items may be safely omitted. Imagine, however, having a photograph from a casual visitor to the project in the project storehouse and not knowing if the project has permission to use it? Similarly, one might imagine a photograph of staff members from a given season without all names or without a date. At some point in time, presumably rather far in the future, such a photograph will become all but useless. (It should go without saying that photographs should include scales and color charts when appropriate.)

Photographs may present problems because of the number of people on a project who may be taking photographs for their own purposes. Should those photographs be included in the project corpus? We do not think this is a question susceptible to a one-size-fits-all answer. A policy should be in place, and there should be some simple procedure for downloading photographs to the project's master storehouse (with requisite information about photographer, camera, lens, etc., as determined by the project staff), and care should be exercised to be sure the procedures are followed. In addition, nobody should pretend that all photographers will know and use the project's preferred standards. Therefore, there should be a process for accepting photographs that are considered important but that do not meet project standards. (It may be argued that there should also be a blanket permission form providing the project with the right to use any photograph accepted into the storehouse.)

What forms of indexing and/or searching will be developed?
The next issue may seem odd. We believe that there must be well-planned and executed forms of indexing of the entire web site for users. In the age of Google® as a verb it is unusual to think about indexing a single site. Surely one may use Google to obtain the necessary information about web pages with specific terms; indeed, all standard web pages should be searchable for individual words so that Google searches will work. However, a project website will consist of far more than text-based web pages. There may be data tables (in a variety of potential formats), GIS data sets, CAD models, audio files, video files, images, or . . . . In addition, there will be relationships between and among files and file types that users should understand. Therefore, a user, arriving at the site's home page will require considerable help in navigating the site and determining where any given piece of information may be found. A full-text search facility should be made available to readers of the site. Full and careful explanations of the site's organization should be available, and all files that are not directly searchable (e.g., images or audio files) should be tagged with keywords that can be searched. Furthermore, human-readable formats should be preferred wherever possible (e.g. for databases), to facilitate indexing, as well as to maintain accessibility and to simplify long-term preservation. Planning this in advance is necessary so that all the pieces can be fitted together well.

Will there be a private section of the website for team members?
This seems a simple question, but, given the maintenance requirements of the site as a whole, should time and effort be spent to aid communication beyond the simple use of email (including a private discussion list)? If so, this portion of the web site should have firm parameters so that it does not become too large and time-consuming. That is a persistent problem if the private area includes, for instance, documents being prepared by multiple authors and involving frequent changes. It would probably be better to use something like Google documents or a more competent (and complex) cloud-based system for such cooperative editing. We would also recommend that, whatever is on the private portion of the site, nothing be archived absent some specific reason, in which case an ad hoc choice about which version(s) to archive will be required.

Assuming use of the website for various functions (e.g., volunteer information) during the life of the project, will all organizational web pages be archived, some portion, none — with all revisions, last revision only, major revisions only, . . . ?
Portions of the website serving mundane functions such as travel planning, information about visas, information about accommodations, and so on will also present archiving questions. What documents of this type should be archived? We would suggest that none of these documents must be archived but that it could be useful to provide a general description of these kinds of documents that served organizational needs. Such a description might include screenshots of documents, perhaps one per type (e.g., one travel page, one visa page). The added burden of dealing with such materials may be such that the project would prefer to have a completely separate website for organizational materials so that they can be clearly separated from the beginning and ignored for archiving (e.g., project-name.org for the basic web site and project-name-admin.org for the administrative materials).

Will preliminary discussions and analyses, comparable to field-season reports, be posted on the web for access by people beyond the project team?
Preliminary reports. There will be preliminary reports if the project is on-going. This is a given. In that case, the real question is how to handle them on the web site. We believe they belong there, even if they are prepared for another venue (a conference paper or a lecture open to the public). In that case, the question is about dealing with the very nature of a preliminary report. Information will change, and analyses will change. How does a project permit — even encourage — change without seeming to be unreliable and inconsistent? The best choice is to explain the changes in clear and consistent ways, but that is not so simple as it sounds.

One approach is to add notes to the original documents (perhaps color coded?) so that new readers of old documents will not be unnecessarily confused and so that people re-reading the older documents can see clearly where and how changes have been brought into the discussion. This is time-consuming and organizationally complex. Therefore, an alternate might be to include, in the introduction to any new document, notations of changes from previously published works, leaving the original works untouched. That risks allowing the originals to seem still to be authoritative, however. Ideally, there should be some notation in the originals to point out changes, even if that notation is only a link to the updated document and some standard note (e.g., "New information has obliged us to revise statements made here. See . . . .") While the authors believe that some form of notation in old and outdated documents is required, the bare minimum is a more complex indexing form than the norm, one that includes information about documents that have been updated or are themselves updates of other documents.

It may be argued that this kind of revision is so familiar to scholars that it is unnecessary to change practices with a movement to the web as the publication medium. Notations of change are certainly not required in the print world. On the other hand, most authors, when proposing any important change of interpretation, will sketch out the background in the course of discussion, complete with bibliography. We are simply arguing here that some such awareness of the potential for confusion be built into the system.

Will web pages aimed at the general public be a part of the site from the beginning or only when analysis has been completed?
What about the public portions of the website? How extensive should they be? Should they await the completion of the project? Should they be archived?

These are all interesting questions that cannot be given a simple answer. The portion of the site devoted to the general public should not, in our view, be so much a section of the website as individual pages that are aimed at the public — and marked as such in menus or indices — and that lead both to other pages intended for general audiences and to the scholarly portions of the site. The general public should, when so inclined, be able to dig as deeply into the material as any scholar, but the starting points will be different.

The more complex the project, the more complex these pages will need to be. At the least, we believe that Peter Young (see reference above) had it right; the general public should be able to determine the aims, methods, and rationales for the project. We must be able to explain what we do, how we do it, and why we do it to an audience broader than our colleagues. At the end of the day, the support of the wider public is required.

As to the question of waiting until the end of the project to put up web pages for the general public, we think that is a mistake. The more important and far-reaching the project, the more desirable it will be to speak to the public from the beginning. That, however, may raise some difficult archiving questions. If the web pages intended for the public are prepared early in the history of the project, they will surely change — and probably change often. Should those various versions be preserved? It seems to us that they should. They may well represent a very interesting and useful version of the history of the project. Therefore, we favor preserving all versions of these documents except those with only trivial changes (easy to say, harder to define, but basically non-trivial changes are those that impact meaning), in which case the last version should be the one preserved. Once the project and the site are complete, there should be some narrative to explain the evolution of these documents. That would aid both scholarly and general readers who wish to understand the project as whole.

It would be foolish to pretend that the foregoing discussion has been more than an outline of the issues to be confronted. The individual problems discussed, however, should provide a handy outline for the chief geek and others as they approach the work of publishing a project on the web.

-- Andrea Vianello and Harrison Eiteljorg, II

For additional articles in this series, please see "Project Publication on the Web — IV," in this issue at csanet.org/newsletter/fall11/nlf1103.html; "Project Publication on the Web — Addendum," by Eiteljorg at csanet.org/newsletter/winter12/nlw1205.html; and "Project Publication on the Web — Addendum II," by Vianello at csanet.org/newsletter/spring12/nls1204.html.