Evolving Web Standards: a Blessing and a Curse

Harrison Eiteljorg, II

(See email contacts page for the author's email address.)

Alexander Graham Bell invented the telephone in 1876. In 1878 he set up a telephone system in New Haven, CT.¹ In the century-plus since then we have moved from a system dependent on people to accomplish the intermediate routing to completely automated routing systems that may send your telephone call around the world to get it to a spot only some small number of miles away. From pulses on copper wires the technology has moved to fiber optic cables that carry light pulses. We have also separated the system from wires altogether in many circumstances, even using satellites for signals from landline to landline. The developments would surely seem unexpected to Bell, and the equipment he developed would not work in today's electronic world. (Even the old rotary-dial phones of my youth would not work in many places today.) Indeed, the changes in the telephone system are almost as unexpected and revolutionary as the invention itself.

Why am I writing about the telephone system in this place? I chose that introduction as a poor analogy to the changes in the internet that we are living through. It's as if we were living around 1900 and witnessing the development of the telephone. We see ourselves as in the internet age fully, but we are really in the very earliest stages of it. Tim Berners-Lee, after all, only started working on what became the Web a bit more than twenty years ago.

The importance of our being so much in the early periods of the internet was brought home to me recently as I tried to complete some minor changes to the CSA Propylaea Project website. Being a website, propylaea.org relies on the standards that apply — the Hypertext Markup Language (HTML) specifies the coding requirements for web documents (and the browsers that interpret those documents) and, to a great extent, the use of Cascading Style Sheets has become virtually required to assist with formatting web pages. Those standards, though, are not fixed for more than a few years at a time. For instance, we are now seeing the arrival of a new standard for HTML (HTML 5), which was discussed in the last issue of the CSA Newsletter (see the section entitled The Leading Edge or the Bleeding Edge? in "Miscellaneous News Items," csanet.org/newsletter/fall11/nlf1106.html; XXIV,2; September, 2011). HTML 5 will obviously not be the last version of HTML; continuing changes can be assumed.²

Although the changes in coding standards are frequent, most of us are users rather than providers of the web pages. Therefore, we often miss the simple fact of being so early in the history and development of the internet. So we may fail to realize that developments at this point are not only frequent; they can bring quite major changes. As new HTML standards arrive and go into general use, there have been two unexpected aspects to the developmental process. First, it seems that the developers have been influenced by the train of thought that lies at the root of XML. According to this view, all tags should be about meaning, not typography. That is, using <cite> text </cite> should mean that the enclosed text is a citation, not, for instance, a foreign word that needs to be italicized. (That may seem to imply that there is a <foreign> </foreign> tag, but I am not aware of such a tag.) Part of that argument asserts that any strictly typographic tag should not be used and therefore should not exist. By this argument, tags that were once used to mark off headings for emphasis (e.g., larger type in boldface) have been defined to mean that the tagged heading is something akin to a section title, not just some words that are set off by typeface or weight or color.

The point is that HTML markup began as a kind of ad hoc combination of tags that denoted meaning (e.g., for a paragraph) and tags that denoted typographic requirements (e.g., for boldface), but the people who are leading the development of the coding system have been persuaded to make the tags for meaning separate from those for typography. Actually, they want to eliminate all tags that are only about typography, but their arguments become difficult when they must define a tag to generate Italics () or boldface (). By the "logic" used, signifies emphasis, and signifies importance, assuming that is different from emphasis. While both are used to generate typographic effects, the argument is that the tags are actually used to indicate words of emphasis or importance, not Italics or boldface. (This can be said with a straight face only by a special few.)

Second, the developers of HTML have decided that, over time, the specifications must occasionally permit the removal of old tags that were, in some way, improperly defined in an earlier specification. Such tags are called "deprecated." This means that the tag in question should no longer be used and will eventually not be supported by browsers as the specifications advance. When a tag has been deprecated, therefore, web developers are on notice that it may cease to function in the future. For instance, and have been deprecated and are apparently to be replaced by and on the argument that the new tags are qualitatively different, expressing meaning rather than typography.³

The combination of the insistence that tags indicate meaning, not typography, and the deprecation of tags yields a problem for the long-term use of web. It's a simple problem. In short, the tagging system is in a semi-constant state of flux. New tags are regularly created, which should cause no real problems for an old document, and old tags that were originally used for appearance may not be supported in the future, altering the appearance of those old documents. (The fact that old tags are left out of a specification does not mean the browser makers will not support them. Any browser maker may well continue to support them if it seems to them to be a good idea, i.e., to increase sales or market share.)

This may seem to be a minor problem, one that affects only the people who develop websites. However, the truth is that these kinds of inconsistencies provide a simple but powerful example of the problems with digital documents as repositories for important knowledge. If a future HTML specification causes browsers to render a web page improperly or, worse yet, makes it impossible to render a given webpage at all, knowledge is threatened and possibly lost. A user who wants to know what that document holds — assuming the file itself exists — must either find a browser that is of the age of the document (and will work on the operating system at hand) or migrate the document to some HTML format that both preserves the original and functions with a current browser. Neither solution is either easy or guaranteed.

We archaeologists are among those in the academic/scholarly world most in need of our field's old data. We cannot recreate past excavations. We cannot re-contextualize the objects we have. We cannot recreate a corpus of any given object type without the information accompanying the objects. In short, unless we wish to accept all present ideas and eschew the possibility of new analyses, we need to preserve all the data we have. All of it. Even web pages that consist only of text. To the extent that web pages may hold some of that data, this is an archival issue little different from those that have often been discussed often in this newsletter. They are different only in the sense that, being so close to the beginning of the internet age, we tend to miss the extent to which change is to be expected in web documents consisting mainly of text, perhaps with a few illustrations. We need to be aware of the issue, and we need to be prepared to overcome the obstacles so that our data do survive. (I admit that I am glossing over the problems of the included images.)

Since the web is no more going to stand still than the telephone system did, I believe that we have few choices for preserving our precious web documents (not including the kinds of complex data files that absolutely require an archival repository). One would be a version of HTML that would be designed so that, while it could be expanded, it could never be truncated in any way. Such a standard would permit HTML documents to be read at any date in the future — assuming HTML continues in use at all, a dangerous assumption. Another would be the use of archival facilities that are prepared to deal with the problems and will migrate data, including HTML files, from one format to another so as to make sure such files retain all their salient qualities. I think the last is to use a standard such as PDF for documents, not HTML, in the hope that it will, at the least, outlast most and is so widely used that, at the worst, migrating to a new version will always be possible. These are not good choices, but I see no other realistic ones, and the first of those is not truly realistic either. I hold out little hope that a better one will appear. In that case, those of us who use the web to hold important information must take our archival duties seriously. We must find repositories that will maintain our data, "not for a year but ever and a day," as the old song goes.³

-- Harrison Eiteljorg, II

Notes:

1. The Franklin Institute, "Bell's Telephone." fi.edu/franklin/inventor/bell.html, last accessed January, 2012. Return to text.

2. A note for those who have never tried to write a web document using HTML. The text must be marked with "tags" in the form of <tag>some text</tag>. The <> brackets alert a web browser such as Internet Explorer® that the enclosed characters are a tag, not part of the text, and the slash in the second iteration makes it clear that that second iteration marks the end of the tagged text. Thus, a paragraph is surrounded by and so that a browser will treat the enclosed text as a separate paragraph. Similarly, <cite>Iliad</cite> would alert the browser that the enclosed text, "Iliad," is a citation, perhaps to be rendered in Italics, though the choice of Italics depends upon the browser's settings. There are even tags to separate header information that never shows on the web page (including the version of HTML used in the document) from the body of the page itself. For more information, a simple search for HTML on the web should turn up more references than you really want. Return to text.

2. It must be noted that the version of HTML used in a document is one of the things that should be stated explicitly in the header information. That makes it easier for a browser to render the text properly, but it certainly does not guarantee that a browser will be capable of rendering the text correctly if the HTML version is quite old. Return to text.

3. A second note for those who do not compose web documents. Cascading stylesheets are now meant to take care of typography needs by letting the website designer assign appearance to the HTML tags in associated documents called stylesheets. In such a document the designer can say that a heading tagged as <h1>Heading</h1> should be twice the size of normal text and boldface. So long as the relationship between the stylesheet and the web page is retained, the headings should appear as specified, even if the browser's normal version of such a heading would only magnify the text by fifty percent. However, using those stylesheets does not permit a developer to create his/her own tags or to use any HTML tag that is not supported by the browser. That is, a tag can be a newly created one or an old one that is no longer supported, but such a tag may be ignored by the browser, in which case the text appears as if the tag were not there. Thus, I could not usefully create a <foreign> </foreign> tag that would denote words from another language and then define the typeface as Italic. I could define such a tag in a stylesheet, but it would, generally speaking, be ignored by the browser. It is worth noting that one of the most popular content management system for websites, Joomla — version 1.6 — allows users to enter text with word-processor-like formatting; that is, users may designate italics, boldface, underlining, and so on. Thus, although Joomla is a very standard tool for creating web pages, its current iteration ignores the requirement that HTML coding reflect only meaning, not typography. Return to text.

4. "Our Love Is Here to Stay," lyrics by Ira Gershwin, 1938. Return to text.