Embedded Markup Considered Harmful

October 2, 1997

Theodor Holm Nelson

A new religion, a first-generation religion, starts with a fundamental idea and expands it to fill the universe with visions of Beginnings and Ends--right and wrong, righteousness and sin, good and evil, Hell and Heaven.[1] It may have strict standards to make sure individuals demonstrate reverent compliance.

A second-generation religion shifts emphasis, because people's concerns have changed--perhaps with new lands to conquer, less worry about sin. The priests adjust the previous fundamental idea to grapple with the new situation. But in this second generation, priests must also show fidelity to the terminology of the earlier generation, framing their new concerns amongst the old ideas wherever possible. Everyone is stuck with the concepts already elucidated.

SGML is a first-generation religion. Its founding idea was to represent nameless fonts and abstracted text blocks at one remove from complete specification, so that the fonts and text blocks could be reformatted by changing a short list of definitions.

This idea was then expanded to fill the universe, becoming a technique for the sequential, hierarchical representation of any data, with embedded tags representing Beginnings and Ends. Great emphasis was put on formal correctness, defining a strict standard for compliance. Originally intended to create order in type-font selection, SGML has been extended and extended to fill the universe, becoming a reference language of sequential attributes and now hypertext links and graphics (HTML). Its believers think SGML can represent anything at all--at least, anything they approve of.

But now we see a change. The second generation of the SGML faith is the HTML religion, whose intention and outlook are entirely different, but which preaches in the robes of the old. A new land has been conquered--the Web. There is great prosperity, as in the time of Solomon, so that sin--formal correctness--is not a worry.

Embedded Markup

I want to discuss what I consider one of the worst mistakes of the current software world, embedded markup; which is, regrettably, the heart of such current standards as SGML and HTML. (There are many other embedded markup systems; an interesting one is RTF. But I will concentrate on the SGML-HTML theology because of its claims and fervor.)

There is no one reason this approach is wrong; I believe it is wrong in almost every respect. But I must be honest and acknowledge my objection as a serious paradigm conflict, or (if you will) religious conflict. In paradigm conflict and religious conflict, there can be no hope of doctrinal victory; the best we can seek is for both sides to understand each other fully and cordially.

SGML's advocates expect, or wish to enforce, a universal linear representation of hierarchial structure.

I believe that if this is a factual claim of appropriateness, it is a delusion; if it is an enforcement, it is an intolerable imposition which drastically curtails the representation of non-hierarchical media structure.

I will turn to general problems of the embedded method. I have three extremely different objections to embedded markup. The first is simple; the second is complicated to explain; and the third challenges the claim of generality.

Objection 1: Editing

The SGML approach is a delivery format, not a working format. Editing is outside the paradigm, happens "elsewhere."

If material is to be edited, it generally must be frequently counted to perform the edit operations. Tags throw off the counts. This means that while text is being reworked, some other representation must be maintained, or complex tricks invoked to maintain counts.[2] This seems quite wrong.

Objection 2: Transpublishing a Potential Conflict

This topic will take some explaining.

Network electronic publishing offers a unique special-case solution to the copyright problem that has not been generally recognized. I call it transpublishing. Let me explain.

In paper publishing, there are two copyright realms: a fortified zone of copyrighted material, defended by its owners and requiring prior negotiation by publishers for quotation and re-use; and an unfortified zone, the open sea of public domain, where anything may be quoted freely--but whose materials tend to be outdated and less desirable for re-use.

Transpublishing makes possible a new realm between these two, where everything may be treated as boilerplate (as with public-domain material), but where publishers relinquish none of their rights and receive revenue exactly proportional to use.

Two different parties have legitimate concerns. Original rightsholders are concerned for their territory of copyrighted material, as defined by law, so that they may maintain and benefit from their hard-won assets. But the public (everybody else, as well as rightsholders in their time off) would like to re-use and republish these materials in different ways.

What if a system could exist which would satisfy all parties--copyright holders and those who would like to quote and republish? What if materials could be quoted without restriction, or size limit, by anyone, without red tape or negotiation--but all publishers would continue to furnish the downloaded copies, and would be exactly rewarded, being paid for each copy?

Transpublication is a unique arrangement--only possible online--which can achieve this win-win solution.[3]

Transpublishing Defined

Transpublishing means including materials virtually in online documents: the new document pulls material from the old, so the original publisher's system furnishes the quoted material to each user on each download. (So far this only works for pictures, through the <IMG SRC> tag in HTML, but we are working on a tag for extracting text quotes.) [2]

Naturally the original rightsholder must give permission for this in advance ("transcopyright"). [3]

Transpublishing turns all participating materials into virtual clip art, freely to be recomposited into new online contexts. Its advantages are special. It provides a bridge to the original (a great benefit to understanding the written intent of the author, and possibly the author's reputation).

Furthermore, with a suitable micropayment system,[4] transpublishing should provide also a means by which the publisher is paid for each manifestation[5] thus quoted.

Transpublishing versus Embedded Markup

Embedded markup drastically interferes with transclusive re-use. For one thing, any arbitrary section of an HTML document may not have correct tags (since the tags overlap and extend over potentially long attribute fields). This means HTML-based transclusion cannot be handled by a simple tag, but probably requires some sort of proxy server.

Second--and it has taken a long time to get to this point[6]--the quoting author may legitimately want to change fonts and and markup.

This is done all the time in scholarly writing and serious journalism, with phrases like "emphasis mine." It needs to be possible in transpublishing to change emphasis and other attributes by nullifying the original markup. Of course, re-emphasizing through markup is an editorial modification, subject to judgment calls and issues of academic etiquette. But the inquiring reader can always follow the bridge of transclusion to see the original as formatted by the author.

There are two markup solutions to make transpublishing work with SGML and HTML.

Alternative method 1: parallel markup

The best alternative is parallel markup. I believe that sequential formatted objects are best represented by a format in which the text and the markup are treated as separate parallel members, presumably (but not necessarily) in different files.[7]

The tags can be like those of SGML, but they are not embedded in the text itself. They are in parallel streams which reference positions in the text data stream. Thus each tag is preceded by a count showing how far the tag is after the previous tag. (This incremental counting, rather than stating each tag's distance from the beginning, is to facilitate editing.)

This method has several advantages:

Clean data. The raw text may be counted, scanned, copied, etc. with ease.

Pluralism. Each markup stream is independent, allowing simultaneously different formatting of the same material. (Note that schemes are also possible for markup streams to be combined, but that is outside this discussion.)

Editability. The streams may be edited, though they must be edited in parallel. Operations of insertion, deletion, rearrangment, and transclusion are all easily definable. (However, some attention must be paid in the design of appropriate editing programs to such features as paired tags defining attribute fields, and when attribute fields are separated and joined, the editing program must behave accordingly.)

Transclusion with variation. The text may be transcluded (re-used by reference as a virtual quotation) in any online document. Transcluding authors may apply their own parallel markup streams.

How can parallel markup be fitted into the SGML model? Easily, as a variant form to be used for various legitimate purposes. Taking an SGML file to a parallel format is in most cases a reversible, non-destructive, non-lossy transformation.

Thus I believe we should call it "the Parallel Representation of SGML," and make it an optional part of the SGML standard.

Alternative method 2: tag override

Where it is inconvenient to break out the tags into a parallel stream--i.e., where they're already stuck or published in the original--we may fall back on the method of tag override. By this I mean simply treating the original tags as if they are not there; ignoring them while counting through the contents and furnishing instead a parallel tag stream, as in parallel markup. We do not dislodge the original markup, but simply ignore it.

This is smarmier at the implementation level, losing the benefit of clean counting and requiring a more complex editing apparatus. Otherwise it has the advantages of parallel markup: pluralism, editability, and transclusion with variation.

Note that this is tag override, not overload, since no symbol is being redefined.

Objection 3: Structures That Don't Fit

When SGML fanciers say "structure," they mean structure where everything is contained and sequential, with no overlap, no sharing of material in two places, no relations uncontained. SGML advocates I have talked to appear to have the belief that everything is either sequential and hierarchical, or can be represented that way. What is not expresssible sequentially and hierarchically is deemed to be nonexistent, inconceivable, evil, or mistaken.

I believe that embedded structure, enforcing sequence and hierarchy, limits the kinds of structure that can be expressed. The question we must ask is: What is the real structure of a thing or a document? (And does it reasonably fit the allowed spectrum of variation within the notational system?)

You can always force structures into other structures and claim that they're undamaged; another way to say this is that if you smash things up it is easier to make them fit. Enforcing sequence and hierarchy simply restricts the possibilities.

Like a TV dinner, embedded markup nominally contains everything you could want. "What else could you possibly want?" means "It's not on the menu."

Exactly Representing Thought

and Change

My principal long-term concern is the exact representation of human thought, especially that thought put into words and writing. But the sequentiality of words and old-fashioned writing have until now compromised that representation, requiring authors to force sequence on their material, and curtail its interconnections. Designing editorial systems for exact and deep representation is therefore my objective.

This issue creates a very different focus from that of the markup community: the task I see is not merely to represent frozen objects tossed over the transom by an author or management, or format static structures for printout or screen, but to maintain a continuing evolutionary base of material and to track the changes in it.

To find the support functions really needed for creative organization by authors and editors, we must understand the exact representation and presentation of human thought, and be able to track the continuities of structure and change.

This means we must find a stable means of representing structure very different from the sequential and hierarchial--a representation of structure which recognizes the most anarchic and overlapping relations; and the location of identical and corresponding materials in different versions; which recognizes and maintains constancies of structure and data across successive versions, even as addresses of these materials become unpredictably fragmented by editing.

Thus deep version management--knowing locations of shared materials to the byte level--is a vital problem to solve in the design of editing systems. And the same location management is necessary on a much broader scale to support transpublishing.

Embedded markup cannot represent this at all, and merely adds obstacles (impeded data structure) to solving these rich addressing problems.

Three Layers

I believe we should find a very general representational system, a reference model which breaks apart in parallel what is represented by SGML and HTML. This would make the creation of deep editing and version management methods much easier. By handling contents, structure, and special effects separately in such a reference model, the parts can be better understood and worked on, and far more general structures can be represented.

I would propose a three-layer model:[8]

A content layer to facilitate editing, content linking, and transclusion management.

A structure layer, declarable separately. Users should be able to specify entities, connections and co-presence logic, defined independently of appearance or size or contents; as well as overlay correspondence, links, transclusions, and "hoses" for movable content.

Finally, a special-effects-and-primping layer should allow the declaration of ever-so-many fonts, format blocks, fanfares, and whizbangs, and their assignment to what's in the content and structure layers.

I believe that a parallel system of this kind will soon become necessary because of the degree of entanglement and unmanageability of HTML. But we must learn from the recent past and provide sufficient abstractness and generality.

Conclusion

For editing and transpublishing, there are serious shortcomings to embedded markup. I believe that embedded markup, daily more tangled, will implode and leave HTML as an output format, supplanted by deeper editors and deeper hypermedia forms. In the meantime it is necessary to find other solutions to its shortcomings for transpublishing, especially parallelized tag models.

Few understand the true nature of hypertext and its relation to thought, let alone the vast interconnection of ideas, and the way that most expressions of ideas sever and misrepresent them. Today's popular but trivially-structured Web hypertext has excused people from seeing the real hypertext issues, or being able to create and publish deep complexes of thought.

We greatly need a general structure to represent all forms of interconnection and structure, and changes in both content and structure; and to visualize and re-use variants and alternatives, comparing them in context in order to understand and choose.

Mapping these serious concerns to an SGML-HTML template is not a minor inconvenience but an impossible violation of the problem.

Of course, people always try to fit information into a familiar mold, even when that structure has shown itself inhospitable, unshaped to that information. C. Northcote Parkinson has pointed out [4] that the fullest flowering of a paradigm, at least as seen by its participants--all gaps closed and issues unseen, the people no longer aware that there are any unsatisfied problems--may indicate that the paradigm is near its end.

Theodor Holm Nelson, Literary Machines. Mindful Press; latest edition available from Eastgate Systems, Cambridge, Massachusetts. Andrew Pam, "Fine-Grain Transclusion in the Hypertext Markup Language." Available at www.xanadu.net/xanadu/draft-pam-html-fine-trans-00.txt Theodor Holm Nelson, "Transcopyright: Dealing with the Dilemma of Digital Copyright." Educom Review, vol. 30, Jan/Feb 1997, p. 32. C. Northcote Parkinson, Parkinson's Law.

About the Author

Theodor Holm Nelson ted@xanadu.net

Theodor Holm Nelson, designer and generalist, has been a software designer and theorist since 1960 and a software consultant since 1967. His principal design work includes Project Xanadu and xanalogical systems, the transcopyright system, and the theory of virtuality design. His industry positions include Harcourt Brace & World publishers, Creative Computing Magazine, Datapoint Corporation, and Autodesk, Inc.; his university positions include Vassar College, University of Illinois, Swarthmore College, Strathclyde University, and Keio University.

Mr. Nelson has written several books, the most recent being The Future of Information (1997), as well as numerous articles, lectures, and presentations. He is best known for discovering the hypertext concept and for coining various words which have become popular, such as "hypertext," "hypermedia," "cybercrud," "softcopy," "electronic visualization," "dildonics," "technoid," "docuverse," and "transclusion."

He received a B.A. in Philosophy from Swarthmore College in 1959 and an M.A. in Social Relations from Harvard in 1963.