Please refer to the errata for this document, which may include some normative corrections.

The World Wide Web uses relatively simple technologies with sufficient scalability, efficiency and utility that they have resulted in a remarkable information space of interrelated resources, growing across languages, cultures, and media. In an effort to preserve these properties of the information space as the technologies evolve, this architecture document discusses the core design components of the Web. They are identification of resources, representation of resource state, and the protocols that support the interaction between agents and resources in the space. We relate core design components, constraints, and good practices to the principles and properties they support.

This is the 15 December 2004 Recommendation of “Architecture of the World Wide Web, Volume One.” This document has been reviewed by W3C Members, by software developers, and by other W3C groups and interested parties, and is endorsed by the Director as a W3C Recommendation. It is a stable document and may be used as reference material or cited from another document. W3C's role in making the Recommendation is to draw attention to the specification and to promote its widespread deployment. This enhances the functionality and interoperability of the Web.

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

The following principles, constraints, and good practice notes are discussed in this document and listed here for convenience. There is also a free-standing summary .

Many of the examples in this document that involve human activity suppose the familiar Web interaction model (illustrated at the beginning of the Introduction) where a person follows a link via a user agent, the user agent retrieves and presents data, the user follows another link, etc. This document does not discuss in any detail other interaction models such as voice browsing (see, for example, [ VOICEXML2 ]). The choice of interaction model may have an impact on expected agent behavior. For instance, when a graphical user agent running on a laptop computer or hand-held device encounters an error, the user agent can report errors directly to the user through visual and audio cues, and present the user with options for resolving the errors. On the other hand, when someone is browsing the Web through voice input and audio-only output, stopping the dialog to wait for user input may reduce usability since it is so easy to "lose one's place" when browsing with only audio-output. This document does not discuss how the principles, constraints, and good practices identified here apply in all interaction contexts.

This document strives for a balance between brevity and precision while including illustrative examples. TAG findings are informational documents that complement the current document by providing more detail about selected topics. This document includes some excerpts from the findings. Since the findings evolve independently, this document includes references to approved TAG findings. For other TAG issues covered by this document but without an approved finding, references are to entries in the TAG issues list .

This document presents the general architecture of the Web. Other groups inside and outside W3C also address specialized aspects of Web architecture, including accessibility, quality assurance, internationalization, device independence, and Web Services. The section on Architectural Specifications (§7.1) includes references to these related specifications.

Note: This document does not distinguish in any formal way the terms "language" and "format." Context determines which term is used. The phrase "specification designer" encompasses language, format, and protocol designers.

This document is intended to inform discussions about issues of Web architecture. The intended audience for this document includes:

The terms MUST, MUST NOT, SHOULD, SHOULD NOT, and MAY are used in the principles, constraints, and good practice notes in accordance with RFC 2119 [ RFC2119 ].

This document describes the properties we desire of the Web and the design choices that have been made to achieve them. It promotes the reuse of existing standards when suitable, and gives guidance on how to innovate in a manner consistent with Web architecture.

In the remainder of this document, we highlight important architectural points regarding Web identifiers, protocols, and formats. We also discuss some important general architectural principles (§5) and how they apply to the Web.

Nadia's browser is configured and programmed to interpret the receipt of an "application/xhtml+xml" typed representation as an instruction to render the content of that representation according to the XHTML rendering model, including any subsidiary interactions (such as requests for external style sheets or in-line images) called for by the representation. In the scenario, the XHTML representation data received from the initial request instructs Nadia's browser to also retrieve and render in-line the weather maps, each identified by a URI and thus causing an additional retrieval action, resulting in additional representations that are processed by the browser according to their own data formats (e.g., "application/svg+xml" indicates the SVG data format), and this process continues until all of the data formats have been rendered. The result of all of this processing, once the browser has reached an application steady-state that completes Nadia's initial requested action, is commonly referred to as a "Web page".

Formats (§4) . Most protocols used for representation retrieval and/or submission make use of a sequence of one or more messages, which taken together contain a payload of representation data and metadata, to transfer the representation between agents. The choice of interaction protocol places limits on the formats of representation data and metadata that can be transmitted. HTTP, for example, typically transmits a single octet stream plus metadata, and uses the "Content-Type" and "Content-Encoding" header fields to further identify the format of the representation. In this scenario, the representation transferred is in XHTML, as identified by the "Content-type" HTTP header field containing the registered Internet media type name, "application/xhtml+xml". That Internet media type name indicates that the representation data can be processed according to the XHTML specification.

Interaction (§3) . Web agents communicate using standardized protocols that enable interaction through the exchange of messages which adhere to a defined syntax and semantics. By entering a URI into a retrieval dialog or selecting a hypertext link, Nadia tells her browser to perform a retrieval action for the resource identified by the URI. In this example, the browser sends an HTTP GET request (part of the HTTP protocol) to the server at "weather.example.com", via TCP/IP port 80, and the server sends back a message containing what it determines to be a representation of the resource as of the time that representation was generated. Note that this example is specific to hypertext browsing of information—other kinds of interaction are possible, both within browsers and through the use of other types of Web agent; our example is intended to illustrate one common interaction, not define the range of possible interactions or limit the ways in which agents might use the Web.

Identification (§2) . URIs are used to identify resources. In this travel scenario, the resource is a periodically updated report on the weather in Oaxaca, and the URI is “http://weather.example.com/oaxaca”.

This scenario illustrates the three architectural bases of the Web that are discussed in this document:

While planning a trip to Mexico, Nadia reads “Oaxaca weather information: 'http://weather.example.com/oaxaca'” in a glossy travel magazine. Nadia has enough experience with the Web to recognize that "http://weather.example.com/oaxaca" is a URI and that she is likely to be able to retrieve associated information with her Web browser. When Nadia enters the URI into her browser:

Examples such as the following travel scenario are used throughout this document to illustrate typical behavior of Web agents —people or software acting on this information space. A user agent acts on behalf of a user. Software agents include servers, proxies, spiders, browsers, and multimedia players.

The World Wide Web ( WWW , or simply Web ) is an information space in which the items of interest, referred to as resources, are identified by global identifiers called Uniform Resource Identifiers ( URI ).

In order to communicate internally, a community agrees (to a reasonable extent) on a set of terms and their meanings. One goal of the Web, since its inception, has been to build a global community in which any party can share information with any other party. To achieve this goal, the Web makes use of a single global identification system: the URI. URIs are a cornerstone of Web architecture, providing identification that is common across the Web. The global scope of URIs promotes large-scale "network effects": the value of an identifier increases the more it is used consistently (for example, the more it is used in hypertext links (§4.4)).

Principle: Global Identifiers Global naming leads to global network effects.

This principle dates back at least as far as Douglas Engelbart's seminal work on open hypertext systems; see section Every Object Addressable in [Eng90].

2.1. Benefits of URIs The choice of syntax for global identifiers is somewhat arbitrary; it is their global scope that is important. The Uniform Resource Identifier , [URI], has been successfully deployed since the creation of the Web. There are substantial benefits to participating in the existing network of URIs, including linking, bookmarking, caching, and indexing by search engines, and there are substantial costs to creating a new identification system that has the same properties as URIs. Good practice: Identify with URIs To benefit from and increase the value of the World Wide Web, agents should provide URIs as identifiers for resources. A resource should have an associated URI if another party might reasonably want to create a hypertext link to it, make or refute assertions about it, retrieve or cache a representation of it, include all or part of it by reference into another representation, annotate it, or perform other operations on it. Software developers should expect that sharing URIs across applications will be useful, even if that utility is not initially evident. The TAG finding "URIs, Addressability, and the use of HTTP GET and POST" discusses additional benefits and considerations of URI addressability. Note: Some URI schemes (such as the "ftp" URI scheme specification) use the term "designate" where this document uses "identify."

2.3. URI Comparisons URIs that are identical, character-by-character, refer to the same resource. Since Web Architecture allows the association of multiple URIs with a given resource, two URIs that are not character-by-character identical may still refer to the same resource. Different URIs do not necessarily refer to different resources but there is generally a higher computational cost to determine that different URIs refer to the same resource. To reduce the risk of a false negative (i.e., an incorrect conclusion that two URIs do not refer to the same resource) or a false positive (i.e., an incorrect conclusion that two URIs do refer to the same resource), some specifications describe equivalence tests in addition to character-by-character comparison. Agents that reach conclusions based on comparisons that are not licensed by the relevant specifications take responsibility for any problems that result; see the section on error handling (§5.3) for more information about responsible behavior when reaching unlicensed conclusions. Section 6 of [URI] provides more information about comparing URIs and reducing the risk of false negatives and positives. See also the assertion that two URIs identify the same resource (§2.7.2). 2.3.1. URI aliases Although there are benefits (such as naming flexibility) to URI aliases, there are also costs. URI aliases are harmful when they divide the Web of related resources. A corollary of Metcalfe's Principle (the "network effect") is that the value of a given resource can be measured by the number and value of other resources in its network neighborhood, that is, the resources that link to it. The problem with aliases is that if half of the neighborhood points to one URI for a given resource, and the other half points to a second, different URI for that same resource, the neighborhood is divided. Not only is the aliased resource undervalued because of this split, the entire neighborhood of resources loses value because of the missing second-order relationships that should have existed among the referring resources by virtue of their references to the aliased resource. Good practice: Avoiding URI aliases A URI owner SHOULD NOT associate arbitrarily different URIs with the same resource. URI consumers also have a role in ensuring URI consistency. For instance, when transcribing a URI, agents should not gratuitously percent-encode characters. The term "character" refers to URI characters as defined in section 2 of [URI]; percent-encoding is discussed in section 2.1 of that specification. Good practice: Consistent URI usage An agent that receives a URI SHOULD refer to the associated resource using the same URI, character-by-character. When a URI alias does become common currency, the URI owner should use protocol techniques such as server-side redirects to relate the two resources. The community benefits when the URI owner supports redirection of an aliased URI to the corresponding "official" URI. For more information on redirection, see section 10.3, Redirection, in [RFC2616]. See also [CHIPS] for a discussion of some best practices for server administrators. 2.3.2. Representation reuse URI aliasing only occurs when more than one URI is used to identify the same resource. The fact that different resources sometimes have the same representation does not make the URIs for those resources aliases. Story Dirk would like to add a link from his Web site to the Oaxaca weather site. He uses the URI http://weather.example.com/oaxaca and labels his link “report on weather in Oaxaca on 1 August 2004”. Nadia points out to Dirk that he is setting misleading expectations for the URI he has used. The Oaxaca weather site policy is that the URI in question identifies a report on the current weather in Oaxaca—on any given day—and not the weather on 1 August. Of course, on the first of August in 2004, Dirk's link will be correct, but the rest of the time he will be misleading readers. Nadia points out to Dirk that the managers of the Oaxaca weather site do make available a different URI permanently assigned to a resource reporting on the weather on 1 August 2004. In this story, there are two resources: “a report on the current weather in Oaxaca” and “a report on the weather in Oaxaca on 1 August 2004”. The managers of the Oaxaca weather site assign two URIs to these two different resources. On 1 August 2004, the representations for these resources are identical. That fact that dereferencing two different URIs produces identical representations does not imply that the two URIs are aliases.

2.4. URI Schemes In the URI "http://weather.example.com/", the "http" that appears before the colon (":") names a URI scheme. Each URI scheme has a specification that explains the scheme-specific details of how scheme identifiers are allocated and become associated with a resource. The URI syntax is thus a federated and extensible naming system wherein each scheme's specification may further restrict the syntax and semantics of identifiers within that scheme. Examples of URIs from various schemes include: mailto:joe@example.org

ftp://example.org/aDirectory/aFile

news:comp.infosystems.www

tel:+1-816-555-1212

ldap://ldap.example.org/c=GB?objectClass?one

urn:oasis:names:tc:entity:xmlns:xml:catalog While Web architecture allows the definition of new schemes, introducing a new scheme is costly. Many aspects of URI processing are scheme-dependent, and a large amount of deployed software already processes URIs of well-known schemes. Introducing a new URI scheme requires the development and deployment not only of client software to handle the scheme, but also of ancillary agents such as gateways, proxies, and caches. See [RFC2718] for other considerations and costs related to URI scheme design. Because of these costs, if a URI scheme exists that meets the needs of an application, designers should use it rather than invent one. Good practice: Reuse URI schemes A specification SHOULD reuse an existing URI scheme (rather than create a new one) when it provides the desired properties of identifiers and their relation to resources. Consider our travel scenario: should the agent providing information about the weather in Oaxaca register a new URI scheme "weather" for the identification of resources related to the weather? They might then publish URIs such as "weather://travel.example.com/oaxaca". When a software agent dereferences such a URI, if what really happens is that HTTP GET is invoked to retrieve a representation of the resource, then an "http" URI would have sufficed. 2.4.1. URI Scheme Registration The Internet Assigned Numbers Authority ( IANA ) maintains a registry [IANASchemes] of mappings between URI scheme names and scheme specifications. For instance, the IANA registry indicates that the "http" scheme is defined in [RFC2616]. The process for registering a new URI scheme is defined in [RFC2717]. Unregistered URI schemes SHOULD NOT be used for a number of reasons: There is no generally accepted way to locate the scheme specification.

Someone else may be using the scheme for other purposes.

One should not expect that general-purpose software will do anything useful with URIs of this scheme beyond URI comparison. One misguided motivation for registering a new URI scheme is to allow a software agent to launch a particular application when retrieving a representation. The same thing can be accomplished at lower expense by dispatching instead on the type of the representation, thereby allowing use of existing transfer protocols and implementations. Even if an agent cannot process representation data in an unknown format, it can at least retrieve it. The data may contain enough information to allow a user or user agent to make some use of it. When an agent does not handle a new URI scheme, it cannot retrieve a representation. When designing a new data format, the preferred mechanism to promote its deployment on the Web is the Internet media type (see Representation Types and Internet Media Types (§3.2)). Media types also provide a means for building new information applications, as described in future directions for data formats (§4.6).

2.5. URI Opacity It is tempting to guess the nature of a resource by inspection of a URI that identifies it. However, the Web is designed so that agents communicate resource information state through representations, not identifiers. In general, one cannot determine the type of a resource representation by inspecting a URI for that resource. For example, the ".html" at the end of "http://example.com/page.html" provides no guarantee that representations of the identified resource will be served with the Internet media type "text/html". The publisher is free to allocate identifiers and define how they are served. The HTTP protocol does not constrain the Internet media type based on the path component of the URI; the URI owner is free to configure the server to return a representation using PNG or any other data format. Resource state may evolve over time. Requiring a URI owner to publish a new URI for each change in resource state would lead to a significant number of broken references. For robustness, Web architecture promotes independence between an identifier and the state of the identified resource. Good practice: URI opacity Agents making use of URIs SHOULD NOT attempt to infer properties of the referenced resource. In practice, a small number of inferences can be made because they are explicitly licensed by the relevant specifications. Some of these inferences are discussed in the details of retrieving a representation (§3.1.1). The example URI used in the travel scenario ("http://weather.example.com/oaxaca") suggests to a human reader that the identified resource has something to do with the weather in Oaxaca. A site reporting the weather in Oaxaca could just as easily be identified by the URI "http://vjc.example.com/315". And the URI "http://weather.example.com/vancouver" might identify the resource "my photo album." On the other hand, the URI "mailto:joe@example.com" indicates that the URI refers to a mailbox. The "mailto" URI scheme specification authorizes agents to infer that URIs of this form identify Internet mailboxes. Some URI assignment authorities document and publish their URI assignment policies. For more information about URI opacity, see TAG issues metaDataInURI-31 and siteData-36.