Susan C. Jones and Harrison Eiteljorg, II
(See email contacts page for the author's email address.)
Do you use Google? Do your students? Dumb questions, aren't they? Of course you and your students use Google. Who doesn't?
The more important question is this: “Do you understand the limits of searching with Google?” While it is a more important question, we do not think it can be honestly answered because Google will not disclose the ways in which it determines what does and what does not appear on a given list of search results. [Note: The Google searching about which this article has been written is searching for web resources not searching for publications via Google Scholar, which is still in beta form.] Many have begun to question the use of Google – not because the searching algorithms are not public but because the results are so often frustrating. Resources that should be found are not; resources that should not be included in a given search are nevertheless there. As we have both read and experienced, for instance, a search for a simple term may or may not include the obvious: the term itself as the domain name, with .org, .com, or . . . .
We have searched for Propylaea and for Propylaea + Athens + Acropolis and been shocked to find that, in the first search the CSA Propylaea Project web site appears near the top of the list but in the second it does not appear in the first 20 results. Other recent attempts to perform searches were also frustrating, and it seemed that some form of experiment might offer some clarification of the issue.
For the experiment we determined to see which of the recently reviewed web sites could be found by searching with relatively obvious search terms. The terms (any multi-word phrases enclosed in quotation marks is considered a single term) were tried in a variety of orders, and other searches were tried, all in an effort to see if a pattern could be teased out of the experimental results.
We tried to construct searches for these sites, all save the last, the Athenian Agora site, ones that had been reviewed in a CSA Newsletter issue.
We instructed Google, via its preference settings, to show us twenty sites on a page and to suppress nothing because of questionable language. We chose the twenty-hits-per-page preference so that we could examine only a single page for the specific site we wanted to find -- on the assumption that the normal process for most searchers stops after a page or two of results at the default ten-hits-per-page setting. People will often look deeper into a given set of search results, but that happens mostly when the earlier results seem unpromising, and we did not want to make judgements about the sites found. Our aim was simply to see if a known site of quality would would turn up as one of the top finds if we were trying to locate information about a specific area of interest, in each case choosing the search terms so that we could expect to find a specific site from out list.
It should be noted that the absence of any given web site from a set of search results does not mean that the results will not lead to the site in question via links or a series of links. We did not attempt to perform more sophisticated analyses or to look closely at the results that did turn up. We only checked to see if the web site we thought was a good and useful one was included in the first twenty results.
In order to spare the reader great pain, information about the actual searches (terms used, ordering of terms, and results) may be obtained from this downloadable file. (The file is in tab-delimited ASCII format and may be imported into virtually any spreadsheet program, a word-processor, or a database. Your browser will probably display the file in a new window or tab, requiring you to save it in order to open it or import it into another program. If you must indicate a character encoding scheme to open or import, choose the UTF-8 option.) We will summarize here the results without including all the excruciating detail.
The Prehistoric Archaeology of the Aegean was a particularly important example in the sense that it is a unique resource about a reasonably confined subject, and it is an excellent one. It was therefore with some surprise that we realized how hard it was to find. The finds were more interesting than many, however, in that the "right" terms worked very well. That is, Aegean was key. Using it with archaeology and prehistoric -- in any order -- brought the resource to the top of the list. Similarly, using Bronze Age in place of prehistoric kept the results good (though, oddly enough, there was a trivial difference between the results of Aegean "Bronze Age" archaeology and the other versions of the same terms in differing orders. On the other hand, using Greece instead of Aegean reduced the likelihood of a find, making the results rather mixed. It is, in the end, the absence of this resource from the listed sites when the term Greece is used instead of Aegean that makes clear the problematic nature of Google searches. If someone using Google already knows that the term should be Aegean, is that person the one for whom these searches are most useful?
The TAY Project Portal -- The Archaeological Settlements of Turkey -- regularly came up on our first page of hits when we searched for it. When using Turkey settlements archaeology as the search terms, no matter the order, the TAY site came up as the first find. Leaving out settlement changed things, but only to the extent that the site slipped in the rankings, still showing up on the basic page. Among our choices, only the search for prehistoric Anatolia archaeology left this site off the list of included ones -- and other orders of the same terms placed the site on the list. Happily, various searches without the term settlement were successful in turning up the TAY site.
Çatalhöyük's web site was another success story. Without using the appropriate diacritical markings for the Turkish name, the web site turned up as at least third on the list and usually first when using Catalhoyuk or Chatal hoyuk, with or without Turkey, neolithic, excavation. Best of all, the search terms neolithic Turkey excavation, in any order, put the Çatalhöyük web site first on the list.
Similarly, the Ostia site was found with Ostia whether or not harbor or Rome was used with it and regardless of the order. Using Rome and harbor placed it first or second, depending on the order. The searches did very poorly, though, when ancient Italy harbor was used. No matter the order, the web site did not turn up (though the CSA Newsletter review did turn up in one of the searches, and many web site about Ostia were found). Changing the search terms slightly -- to "ancient Italy" harbor also failed to turn up the Ostia site, again regardless of the order.
We expected the Athenian Agora web site to be easy to find as well. However, the results were surprising. Agora turned up a page at the Stoa web site, but nothing closer. Agora Athens turned up the home page of the excavations at the seventh place on the page, and reversing the order moved it to the fifth position. Making the serch for ancient Athens Agora placed the web site we were looking for on that basic search page in the fourteenth position. However, the same terms in any other order did not place the excavation home page on our search page at all. Nor did the use of Athens with ancient and market, in any order. Making it "ancient Athens" with market, again in either order, also failed to get the Agora excavations page.
Finding the Hadrian's Villa web site was also problematic, though the results were better than those for the Agora. A search for "Villa Adriana" put the web site at number 1. "Hadrian's Villa" kept it on the first page of 20 relevant sites but moved it down the ranking to number 12. Adding Tivoli moved it up to number 8; reversing the search terms, oddly enough, removed it altogether from the list of sites, and making the search for Tivoli Hadrian Villa also yielded 20 results without the Villa Adriana site. Searching for Hadrian Villa Tivoli did yield refults that included our site, at number 13.
The Traditions of the Sun web site provided interesting results, and it was difficult to decide on search terms. The results were interesting because they regularly included relevant NASA pages, and NASA was a contributor to this project. Searching for Archaeoastronomy "Chaco Canyon" Yucatan and for sun archaeology "Chaco Canyon" Yucatan yielded lists of sites that included the Traditions of the Sun site (as number 17 in one case and 7 in the other). However, searches for archaeology astronomy "Chaco Canyon" Yucatan in four different orders turned up the CSA Newsletter review and NASA links but not the actual site. "Chaco Canyon" archaeology astronomy and all the possible orders of those terms produced lists without the site in question.
Now we turn to the International Dunhuang Project. The results here were perplexing. Searching for just Dunhuang yielded a list with the site home page positioned at number 7. Adding cave or caves, with or without quotation marks made things worse, not better, with an article from the site but not the home page appearing in most instances, no listing at all in one instance. Dunhuang China and China Dunhaung produced no pages with our site listed; using China, Dunhuang, and cave or caves, with or without quotation marks to combine words into a term, produced poor results, never including the home page and about a third of the time an article from the site. Silk Road, with or without quotation marks, produced better results. Searches with that term and Dunhuang (with or without cave or caves) reliably produced results including the home page of the International Dunhuang Project. On the other hand, using China with Silk Road or "Silk Road" produced search pages without the home page or any article from the site. Surprisingly, archaeology Silk Road and Silk Road archaeology produced search results including the home page (at position 12 or 13), and Buddhism Silk Road and Silk Road Buddhism put the IDP home page at position 3.
The last of our searches stem from the combined review of two Egyptian sites, The Oxford Expedition to Egypt, Scenes-details Database and Digital Egypt for Universities. This was the most difficult challenge and the least informative -- because the sites are more particular as to material and/or audience and because there are so many sites about ancient Egypt. It is by no means a given that these two sites should have been among the first found, certainly not in the way the International Dunhuang Project or the others already discussed should have been high on their respective lists. Searches using various combinations and orders of "Old Kingdom," ancient, Egypt, tombs, pyramids, paintings, and scenes as the key terms produced only one set of results with either site; "Old Kingdom" scenes Egypt produced a set of search results with the scenes details database as the first item on the list. None of the other searches produced a set of results including either site.
It is impossible to summarize the results in a rigorous way, and the specific results change from day to day anyway. Some things do seem to be suggested. For instance, the Çatalhöyük web site was found rather easily; that probably reflects the known Google use of links pointing to a given site as key determinants of importance. Since the archaeological site is so well known and since there are many different groups interested in it and the finds from it (including, for instance, those interested in issues surrounding mother goddesses), the web site is surely very often referenced at other sites.
The Athenian agora and Villa Adriana sites seem to indicate that such specific sites can be found easily -- if and only if one knows the proper terms, and the proper terms may simply be the terms used in the home country of a site. That is, market did not perform well to find information about the agora; nor did Hadrian's Villa work as well for the Villa Adriana as Villa Adriana. In the case of the Ostia site, ancient Italy was not a good substitue for Rome, but that is a terminology question, not a language issue. Similarly, the superb resource for the Greek Bronze Age, Prehistoric Archaeology of the Aegean, was found rather easily with searches including Aegean, but with few of the searches formed without that term, one hardly familiar to the naïve user.
It seems that the searches for the other sites produced more erratic results. If there is a general rule, it would seem to be that there is no rule. The results for the International Dunhuang Project show that most clearly. The matter of most concern to a user is the apparent irrationality of some search results, perhaps the most ridiculous of which is the difference between searching for Propylaea and searching for Propylaea Athens Acropolis. The first of those searches finds the CSA Propylaea Project and puts it third on the list. The second one places it twenty-eighth. Presumably this is the case because, while Propylaea and Acropolis appear in the first paragraph, Athens does not make its appearance until the third paragraph. (Athenian appears in the first paragraph.) To prove the point, a search for Propylaea Athenian Acropolis was initiated. Unexpectedly, it moved up the project home page only to number 21; a secondary page made it into the top 20. Changing the order of the terms to Propylaea Acropolis Athens, on the theory that the least specific term, Athens, should be last, moved the CSA Propylaea Project home page further down the list, not up. That was truly perplexing. Perhaps it suggests that the cross-linking is more important than anything else, though that is a bit of a leap.
So it seems that being able to predict what will be found and what will not is as difficult as Google wishes it to be -- a legitimate concern for them since they do not want site managers to be able to game the system. Unfortunately, that means it is also difficult to have any sense of security that a Google search will find the best resources. That means, in turn, that the user must beware. A Google search will provide a list of sites that meet the criteria; it will not necessarily rank those sites accurately, either as to relevance or as to quality. The fact that quality is not a factor should surprise no user. Algorithms that depend entirely on machine-directed searching and indexing cannot be expected to determine quality. Relevance, on the other hand, should be better. For instance, it seems that if, as often happened, the CSA Newsletter review of a web site turned up in response to a search, the site itself should come first in the list. That did not occur regularly.
Neither you nor your students are going to stop using Google. Neither are we. However, we all need to use it with more circumspection. This experiment has not been either deep or carefully designed and monitored. Nevertheless, it has been good enough to show how skeptical we who use this all-too-common search tool must be about the quality of the results.
-- Susan C. Jones and Harrison Eiteljorg, II
An index by subject for all CSA Newsletter issues may be found at csanet.org/newsletter/nlxref.html; included there are listings for articles concerning the use of electronic media in the humanities.
Next Article: What Data Will Be in Use Tomorrow?
Table of Contents for the January, 2009, issue of the CSA Newsletter (Vol. XXI, no. 3)
Table of Contents for all CSA Newsletter issues on the Web
CSA Home Page |