CSA Newsletter, Winter, 2007: Image Repositories: Works in Progress

You need an image of the Pantheon, of a Maya site, of Çatalhöyük, or of the Blue Mosque. Where do you go, and how do you search? Flickr and Google are both possibilities, but so are more academically-oriented web sites that provide a broad range of slides, including ARTstor® with its images gathered from university slide libraries. The choice of web site and the nature of the search will depend to a great extent on the nature of your needs. The more scholarly your needs, the more likely that you will need a carefully selected web site. On the other hand, a general view of a monument or a site may be relatively easy to find almost anywhere.

I wanted to see what kinds of access was available for those of us who might want very specialized searches, partly because I have been convinced for some years that old and ill-studied images may provide a fertile field for archaeologists and partly because my experience with image collections has been so poor. My examination of a variety of web collections was sorely disappointing, both because of the limited kinds of searches I could perform and because of the paucity of data transmitted about the images I did find.

I looked at several academic web sites. I also tried both Flickr^TM and Google^TM Image Search.

Problems arise with Google and Flickr, of course, in the area of quality control. They do not attempt to monitor what they index (Google) or what they hold for others (Flickr). In Google's case, that means that the images found may not only be of limited interest, but they may also be tiny, too small to be useful. Google does not -- surprisingly to me -- use image quality, measured simply by the number of pixels, to rank the search results, though they do index and report image size. Google does make it possible to select only the large, medium, or small images, but medium images included images as small as 75 pixels by 75 pixels, and, again, the finds were not arranged by size. I could find to be no way to arrange them by size.

Google also presented problems with search quality; one does not know precisely how their searches are conducted, but I was both shocked and concerned that our own resources at the propylaea.org site were not found when I searched for images of the Propylaea. There are many photographs on the site, all high-quality ones (at least as to pixel count though my own bias would lead me to argue that they are qualitatively superior). All the pages on which the photographs are shown have the word Propylaea on them not once but twice, in the paragraph before and the paragraph following the image. While I cannot know how the search algorithms function, the simple fact that these images were omitted casts serious doubt on the quality of Google's search results for images. (I checked to be sure the Google robots/crawlers had visited the site, and it seems clear from text searches that they have visited and indexed the site.) How can one rely upon a system that misses these images?

In Flickr's case, of course, the universe is much smaller. The images at issue are those that photographers have chosen to share at Flickr's site; therefore, the expectations must be adjusted. Image quality is generally not a problem, though; Flickr's users do not tend to put small images, measured in pixels again, on the site. The bigger problems are (1) the variety of image types available, which range from the sublime photo of a great building from a good angle in interesting light to the ridiculous photo of a tourist standing in front of a building and effectively hiding it, and (2) the use of keywords for searching and the limitations imposed by that search process. As a user, one has no way of knowing what keywords have been chosen by the person posting the images, not to mention how they may have been spelled. (The fact that many photographs are the products of tourists also creates problems with accuracy, of course. Images of the Parthenon and the Erechtheum were both shown as images of the Propylaea.) A search for Propylaea shows many more results than a search for Propylaea + Athens + Acropolis; the limited search removed from my selection set images of various entrance structures from a wide variety of places, not to mention the Brandenburg Gate in Berlin, but it also eliminated many images of the Propylaea on the Acropolis because the photographer did not bother with the name of the city or with the term Acropolis. Searches for Propylaia found images that did not turn up with a search for Propylaea, but using Propylaeum seemed to produce the same search results as using Propylaea. The larger the number of images to be searched, the more problems are created by the use of keywords. What might be called false positives - results that meet the search criteria but not the user's needs - can overwhelm the user; missed images may be the very ones needed.

More academic sites generally do not rely on keywords, at least not exclusively.

Of the more academic sites, one operated by the Australian National University provided a geographic access system similar to the one used by CSA for the Bryn Mawr College Lantern Slides. Access began at the country level and worked its way down a hierarchy to the individual photograph. This is a good, simple way to provide access; it scales reasonably well up to some undefinable point when the number of images is too large, but neither the lantern slides site nor the Australian one included any direct search mechanisms. Being smaller, the Bryn Mawr College Lantern Slides system can use a simple text search because all the images are indexed on one web page. It was a bit sobering, looking more dispassionately at the lantern slide site, to find that the information about the images is not as full and explicit as it should be. For instance, when the photographer is unknown, that fact is not shown; there is simply no name. The Australian site is even less helpful, providing no information about the images at all other than the location of the subject. Whereas information about the photograph may be unexpected with Google or Flickr, an academic site should provide complete information about the images as well as their subjects. An academic user should know as much as possible about the images: sources, film, camera, lens, photographer, date, and so on.

The access system for the web site at the Lamar Dodd School of Art and Art History at the University of Georgia uses a variant of the geographic approach. The largest categories are cultural groups. They are subdivided into progressively smaller groups, often with geographic groupings along the way. Once a grouping is small enough, thumbnails of all that make up the group are provided along with captions. Very little information about the imagery is provided. While there is a nod to database-style categories in the presence of VRA categories as a label (see below for more about the VRA), I was unable to find an image with any information to accompany the label. Searching is effectively via keywords because there is only a simple box for typing a word or phrase, and only one word or phrase may be typed. Boolean searches are not available.

The Insight-software-driven web site available at http://cidc.library.cornell.edu/adw/adw.asp showcases images from the Andrew Dickson White Collection of Architectural Photographs at Cornell. Here is an excellent example of the data that should be attached to images. The structure of the data shows through, to me at least, with a clarity that shows the intelligence underlying the design. (The quantity of data also makes it apparent that gathering and recording the information is daunting.) For each photograph there is information about both the subject itself and its photograph so that the title creator may indicate the architect of the structure or the photographer -- or even the name of the artist who created a detail in the scene. Similarly, other categories of information are applicable to both the subject of the photograph and the photograph itself; thus, there can be both a date for the creation of the subject and a date for the creation of the photograph. Each can be found separately. The categories are numerous and well-designed, and it should be possible to create very sophisticated searches. Alas, it is not, because the searches can be based upon only one term. Furthermore, alternate spellings are not supported, making some searches problematic. (Is it Mnesikles or Mnesicles? For me, it's Mnesicles; for the site it's Mnesikles.)

Even the careful data structure of the Cornell site is not perfect. The location of the object photographed is not well specified. There is only place as a category, not country, city, or any other more explicit place name. On the plus side, it should be noted that there is good information about the images themselves: one example showed that the image was from a ca. 1860-1895 albumen print that had been scanned at 4000 x 3119 resolution in the SID (or MrSid) format.

I have saved ARTstor until now despite the fact that it is clearly the elephant in the room. With a half million images, a great many academic supporters, and The Andrew W. Mellon Foundation as its founder, ARTstor has not only many assets but a presumption of success.

I was able to use ARTstor both in the CSA offices and on the Bryn Mawr College campus. On the campus computers it was very slow; in our offices it was glacial. While the lack of speed may not be a determining factor, it is hard to imagine the repository being useful once a large number of academic institutions are using it at the same time. This is not a huge surprise since the software system has been developed specifically for this project rather than in a commercial environment. (I do not mean to suggest by the foregoing that commercial software is always -- or even regularly -- superior, but it does generally reflect a well-considered appreciation of users' priorities.)

Assuming that the speed problem can be overcome, how is the searching? It is better than the other examples I looked at; it is possible to search for a title, a creator, and a keyword at one time. Not only that, using Mnesikles as a search term for creator, it managed to find both Mnesikles and Mnesicles. Despite the database-style searches, the information provided about images did not reflect a good underlying database design. There were a few specific categories, but virtually everything was either repeated in or lumped into a general category (naturally, since the keyword search needs to check all information categories). Continuing the trend, location was a catch-all category that cannot, without significant change, be searched effectively by a user. More worrisome was the almost total absence of information about the images as opposed to the subjects. Drawings were simply unattributed, as was a photograph that had obviously been copied from a book. (The photo spanned two pages in the original, and the seam was clearly visible.) The only image I found to have good information about the image itself was one from the Cornell project discussed above.

The great advantage of ARTstor is the support of The Andrew W. Mellon Foundation. That support makes it possible to hope that ARTstor will ultimately develop into the kind of resource that can be effectively used by scholars and students. It does not seem to be that resource yet.

Anyone who has attempted to put a wide variety of materials into a database will recognize the problems confronted by these image repositories. Those who catalog a collection are unlikely to find all the data a user might want in the existing catalog, if there is one -- and at least as unlikely know enough to supply all that data without spending a great deal of time on research. Indeed, some of the information may not be available at all. If the data categories required are fewer, on the other hand, entering all the appropriate data is a less onerous task, but the user may not be able to find an image, not because it is not there but because it cannot be properly identified.

Many who try to catalog difficult and disparate collections opt to use keywords because the use of keywords is a great deal simpler and because it permits what is known to be entered relatively easily, regardless of whether the categories of information known about image A (e.g., name, creator, location, materials, photographer, film, date) match the categories known about image B. But keyword searches have huge problems, as discussed above with Flickr and Google. If a user does not know what keywords are likely to be used -- not to mention required -- using them is problematic to say the least.

So there is an insoluble difficulty here. Good, extensive information categories require data entry personnel who know more than can reasonably be expected, not to mention the time to gather and enter all the data; our own experience with the Bryn Mawr College lantern slides made that clear to Susan Jones and me. But keywords are simply not satisfactory for users. As collections presented on the web grow in size, these problems become truly insurmountable. Small collections on the web can be organized in a variety of ways without losing their utility; that is not true of large ones.

If simply getting the images into a public space for use is the overriding goal, then using keywords may be a defensible tactic. If, on the other hand, good access is part of the aim, keywords will not do. Conversely, if a collection can be put up on the web over the course of a decade or more, it may be acceptable to demand good, careful data entry. If the images need to be seen more quickly, though, such an approach to the accompanying data is unrealistic.

Perhaps there is not a middle ground of compromises but an approach that assumes a need for both speed and quality without the accompanying assumption that both must be obtained simultaneously. Suppose a keyword system were employed as a first stop with a collection being digitized for the web, but suppose further that the keywords were actually entered as information in categories in a good, comprehensive database. Suppose yet further that the data entry process were not seen as complete until that database had been populated fully, even if that took some years. In the end, we could have a superior database with information about images in appropriate categories so that excellent searches could be constructed; in the meantime, the images would be available, if only via keywords extracted from the database. (This seems to be what is being planned at the University of Georgia, given the inclusion of VRA categories.) The final system would be even better if it included both alternate spellings and alternate labels via a look-up chart of some sort AND an opportunity for users to augment that data by adding information based on their own knowledge and experience.

Constructing the database side of such an image repository is not a task for the faint of heart, but experts have already done a great deal of work on this. For instance, the Visual Resources Association (VRA) web site has an excellent set of data categories based on the Dublin Core at http://www.vraweb.org/vracore3.htm (probably the base for the Cornell web site discussed above), and this is a relatively old web page. The Getty has also produced a suggested set of categories, Categories for the Description of Works of Art, which is more extensive and available at http://www.getty.edu/research/conducting_research/standards/cdwa/index.html. (Note, however, that the sets of data categories suggested by the VRA and the Getty do not include detailed information about location but only the general category place; there is no specific entry for country or city, not to mention more precise locations, in either set of categories. For works of architecture I do not believe this is appropriate.) The task of determining the categories required will thus be arduous, and there will be critics standing on the sidelines proclaiming the faults in any set of information categories chosen. Nevertheless, in the long term the results could be superior AND of real value in the short term. The future holds many image repositories for us; that is not in doubt. The question is not, "Will there be places to find good images?" but "Will we be able to find the images we want?"

-- Harrison Eiteljorg, II