HTML Living Standard — Last Updated

12.2 Parsing HTML documents

This section only applies to user agents, data mining tools, and conformance checkers.

The rules for parsing XML documents into DOM trees are covered by the next section, entitled "The XML syntax".

User agents must use the parsing rules described in this section to generate the DOM trees from text/html resources. Together, these rules define what is referred to as the HTML parser .

While the HTML syntax described in this specification bears a close resemblance to SGML and XML, it is a separate language with its own parsing rules. Some earlier versions of HTML (in particular from HTML2 to HTML4) were based on SGML and used SGML parsing rules. However, few (if any) web browsers ever implemented true SGML parsing for HTML documents; the only user agents to strictly handle HTML as an SGML application have historically been validators. The resulting confusion — with validators claiming documents to have one representation while widely deployed web browsers interoperably implemented a different representation — has wasted decades of productivity. This version of HTML thus returns to a non-SGML basis. Authors interested in using SGML tools in their authoring pipeline are encouraged to use XML tools and the XML serialization of HTML.

For the purposes of conformance checkers, if a resource is determined to be in the HTML syntax, then it is an HTML document.

As stated in the terminology section, references to element types that do not explicitly specify a namespace always refer to elements in the HTML namespace. For example, if the spec talks about "a menu element", then that is an element with the local name " menu ", the namespace " http://www.w3.org/1999/xhtml ", and the interface HTMLMenuElement . Where possible, references to such elements are hyperlinked to their definition.

12.2.1 Overview of the parsing model

The input to the HTML parsing process consists of a stream of code points, which is passed through a tokenization stage followed by a tree construction stage. The output is a Document object.

Implementations that do not support scripting do not have to actually create a DOM Document object, but the DOM tree in such cases is still used as the model for the rest of the specification.

In the common case, the data handled by the tokenization stage comes from the network, but it can also come from script running in the user agent, e.g. using the document.write() API.

There is only one set of states for the tokenizer stage and the tree construction stage, but the tree construction stage is reentrant, meaning that while the tree construction stage is handling one token, the tokenizer might be resumed, causing further tokens to be emitted and processed before the first token's processing is complete.

In the following example, the tree construction stage will be called upon to handle a "p" start tag token while handling the "script" end tag token: ... < script > document . write ( '<p>' ); </ script > ...

To handle these cases, parsers have a script nesting level , which must be initially set to zero, and a parser pause flag , which must be initially set to false.

12.2.2 Parse errors

This specification defines the parsing rules for HTML documents, whether they are syntactically correct or not. Certain points in the parsing algorithm are said to be parse errors. The error handling for parse errors is well-defined (that's the processing rules described throughout this specification), but user agents, while parsing an HTML document, may abort the parser at the first parse error that they encounter for which they do not wish to apply the rules described in this specification.

Conformance checkers must report at least one parse error condition to the user if one or more parse error conditions exist in the document and must not report parse error conditions if none exist in the document. Conformance checkers may report more than one parse error condition if more than one parse error condition exists in the document.

Parse errors are only errors with the syntax of HTML. In addition to checking for parse errors, conformance checkers will also verify that the document obeys all the other conformance requirements described in this specification.

Some parse errors have dedicated codes outlined in the table below that should be used by conformance checkers in reports.

Error descriptions in the table below are non-normative.

12.2.3 The input byte stream

The stream of code points that comprises the input to the tokenization stage will be initially seen by the user agent as a stream of bytes (typically coming over the network or from the local file system). The bytes encode the actual characters according to a particular character encoding, which the user agent uses to decode the bytes into characters.

For XML documents, the algorithm user agents are required to use to determine the character encoding is given by XML . This section does not apply to XML documents. [XML]

Usually, the encoding sniffing algorithm defined below is used to determine the character encoding.

Given a character encoding, the bytes in the input byte stream must be converted to characters for the tokenizer's input stream, by passing the input byte stream and character encoding to decode.

A leading Byte Order Mark (BOM) causes the character encoding argument to be ignored and will itself be skipped.

Bytes or sequences of bytes in the original byte stream that did not conform to the Encoding standard (e.g. invalid UTF-8 byte sequences in a UTF-8 input byte stream) are errors that conformance checkers are expected to report. [ENCODING]

The decoder algorithms describe how to handle invalid input; for security reasons, it is imperative that those rules be followed precisely. Differences in how invalid byte sequences are handled can result in, amongst other problems, script injection vulnerabilities ("XSS").

When the HTML parser is decoding an input byte stream, it uses a character encoding and a confidence . The confidence is either tentative, certain, or irrelevant. The encoding used, and whether the confidence in that encoding is tentative or certain, is used during the parsing to determine whether to change the encoding. If no encoding is necessary, e.g. because the parser is operating on a Unicode stream and doesn't have to use a character encoding at all, then the confidence is irrelevant.

Some algorithms feed the parser by directly adding characters to the input stream rather than adding bytes to the input byte stream.

12.2.3.1 Parsing with a known character encoding

When the HTML parser is to operate on an input byte stream that has a known definite encoding , then the character encoding is that encoding and the confidence is certain.

12.2.3.2 Determining the character encoding

In some cases, it might be impractical to unambiguously determine the encoding before parsing the document. Because of this, this specification provides for a two-pass mechanism with an optional pre-scan. Implementations are allowed, as described below, to apply a simplified parsing algorithm to whatever bytes they have available before beginning to parse the document. Then, the real parser is started, using a tentative encoding derived from this pre-parse and other out-of-band metadata. If, while the document is being loaded, the user agent discovers a character encoding declaration that conflicts with this information, then the parser can get reinvoked to perform a parse of the document with the real encoding.

User agents must use the following algorithm, called the encoding sniffing algorithm , to determine the character encoding to use when decoding a document in the first pass. This algorithm takes as input any out-of-band metadata available to the user agent (e.g. the Content-Type metadata of the document) and all the bytes available so far, and returns a character encoding and a confidence that is either tentative or certain.

The document's character encoding must immediately be set to the value returned from this algorithm, at the same time as the user agent uses the returned value to select the decoder to use for the input byte stream.

When an algorithm requires a user agent to prescan a byte stream to determine its encoding , given some defined end condition , then it must run the following steps. These steps either abort unsuccessfully or return a character encoding. If at any point during these steps (including during instances of the get an attribute algorithm invoked by this one) the user agent either runs out of bytes (meaning the position pointer created in the first step below goes beyond the end of the byte stream obtained so far) or reaches its end condition , then abort the prescan a byte stream to determine its encoding algorithm unsuccessfully.

Let position be a pointer to a byte in the input byte stream, initially pointing at the first byte. Loop: If position points to: A sequence of bytes starting with: 0x3C 0x21 0x2D 0x2D (` <!-- `) Advance the position pointer so that it points at the first 0x3E byte which is preceded by two 0x2D bytes (i.e. at the end of an ASCII '-->' sequence) and comes after the 0x3C byte that was found. (The two 0x2D bytes can be the same as those in the '<!--' sequence.) A sequence of bytes starting with: 0x3C, 0x4D or 0x6D, 0x45 or 0x65, 0x54 or 0x74, 0x41 or 0x61, and one of 0x09, 0x0A, 0x0C, 0x0D, 0x20, 0x2F (case-insensitive ASCII '<meta' followed by a space or slash) Advance the position pointer so that it points at the next 0x09, 0x0A, 0x0C, 0x0D, 0x20, or 0x2F byte (the one in sequence of characters matched above). Let attribute list be an empty list of strings. Let got pragma be false. Let need pragma be null. Let charset be the null value (which, for the purposes of this algorithm, is distinct from an unrecognized encoding or the empty string). Attributes: Get an attribute and its value. If no attribute was sniffed, then jump to the processing step below. If the attribute's name is already in attribute list , then return to the step labeled attributes. Add the attribute's name to attribute list . Run the appropriate step from the following list, if one applies: If the attribute's name is " http-equiv " If the attribute's value is " content-type ", then set got pragma to true. If the attribute's name is " content " Apply the algorithm for extracting a character encoding from a meta element, giving the attribute's value as the string to parse. If a character encoding is returned, and if charset is still set to null, let charset be the encoding returned, and set need pragma to true. If the attribute's name is " charset " Let charset be the result of getting an encoding from the attribute's value, and set need pragma to false. Return to the step labeled attributes. Processing: If need pragma is null, then jump to the step below labeled next byte. If need pragma is true but got pragma is false, then jump to the step below labeled next byte. If charset is failure, then jump to the step below labeled next byte. If charset is UTF-16BE/LE, then set charset to UTF-8. If charset is x-user-defined, then set charset to windows-1252. Abort the prescan a byte stream to determine its encoding algorithm, returning the encoding given by charset . A sequence of bytes starting with a 0x3C byte (<), optionally a 0x2F byte (/), and finally a byte in the range 0x41-0x5A or 0x61-0x7A (A-Z or a-z) Advance the position pointer so that it points at the next 0x09 (HT), 0x0A (LF), 0x0C (FF), 0x0D (CR), 0x20 (SP), or 0x3E (>) byte. Repeatedly get an attribute until no further attributes can be found, then jump to the step below labeled next byte. A sequence of bytes starting with: 0x3C 0x21 (` <! `) A sequence of bytes starting with: 0x3C 0x2F (` </ `) A sequence of bytes starting with: 0x3C 0x3F (` <? `) Advance the position pointer so that it points at the first 0x3E byte (>) that comes after the 0x3C byte that was found. Any other byte Do nothing with that byte. Next byte: Move position so it points at the next byte in the input byte stream, and return to the step above labeled loop.

When the prescan a byte stream to determine its encoding algorithm says to get an attribute , it means doing this:

If the byte at position is one of 0x09 (HT), 0x0A (LF), 0x0C (FF), 0x0D (CR), 0x20 (SP), or 0x2F (/) then advance position to the next byte and redo this step. If the byte at position is 0x3E (>), then abort the get an attribute algorithm. There isn't one. Otherwise, the byte at position is the start of the attribute name. Let attribute name and attribute value be the empty string. Process the byte at position as follows: If it is 0x3D (=), and the attribute name is longer than the empty string Advance position to the next byte and jump to the step below labeled value. If it is 0x09 (HT), 0x0A (LF), 0x0C (FF), 0x0D (CR), or 0x20 (SP) Jump to the step below labeled spaces. If it is 0x2F (/) or 0x3E (>) Abort the get an attribute algorithm. The attribute's name is the value of attribute name , its value is the empty string. If it is in the range 0x41 (A) to 0x5A (Z) Append the code point b +0x20 to attribute name (where b is the value of the byte at position ). (This converts the input to lowercase.) Anything else Append the code point with the same value as the byte at position to attribute name . (It doesn't actually matter how bytes outside the ASCII range are handled here, since only ASCII bytes can contribute to the detection of a character encoding.) Advance position to the next byte and return to the previous step. Spaces: If the byte at position is one of 0x09 (HT), 0x0A (LF), 0x0C (FF), 0x0D (CR), or 0x20 (SP) then advance position to the next byte, then, repeat this step. If the byte at position is not 0x3D (=), abort the get an attribute algorithm. The attribute's name is the value of attribute name , its value is the empty string. Advance position past the 0x3D (=) byte. Value: If the byte at position is one of 0x09 (HT), 0x0A (LF), 0x0C (FF), 0x0D (CR), or 0x20 (SP) then advance position to the next byte, then, repeat this step. Process the byte at position as follows: If it is 0x22 (") or 0x27 (') Let b be the value of the byte at position . Quote loop: Advance position to the next byte. If the value of the byte at position is the value of b , then advance position to the next byte and abort the "get an attribute" algorithm. The attribute's name is the value of attribute name , and its value is the value of attribute value . Otherwise, if the value of the byte at position is in the range 0x41 (A) to 0x5A (Z), then append a code point to attribute value whose value is 0x20 more than the value of the byte at position . Otherwise, append a code point to attribute value whose value is the same as the value of the byte at position . Return to the step above labeled quote loop. If it is 0x3E (>) Abort the get an attribute algorithm. The attribute's name is the value of attribute name , its value is the empty string. If it is in the range 0x41 (A) to 0x5A (Z) Append a code point b +0x20 to attribute value (where b is the value of the byte at position ). Advance position to the next byte. Anything else Append a code point with the same value as the byte at position to attribute value . Advance position to the next byte. Process the byte at position as follows: If it is 0x09 (HT), 0x0A (LF), 0x0C (FF), 0x0D (CR), 0x20 (SP), or 0x3E (>) Abort the get an attribute algorithm. The attribute's name is the value of attribute name and its value is the value of attribute value . If it is in the range 0x41 (A) to 0x5A (Z) Append a code point b +0x20 to attribute value (where b is the value of the byte at position ). Anything else Append a code point with the same value as the byte at position to attribute value . Advance position to the next byte and return to the previous step.

For the sake of interoperability, user agents should not use a pre-scan algorithm that returns different results than the one described above. (But, if you do, please at least let us know, so that we can improve this algorithm and benefit everyone...)

12.2.3.3 Character encodings

User agents must support the encodings defined in Encoding , including, but not limited to, UTF-8 , ISO-8859-2 , ISO-8859-7 , ISO-8859-8 , windows-874 , windows-1250 , windows-1251 , windows-1252 , windows-1254 , windows-1255 , windows-1256 , windows-1257 , windows-1258 , gb18030 , Big5 , ISO-2022-JP , Shift_JIS , EUC-KR , UTF-16BE/LE , and x-user-defined . User agents must not support other encodings.

The above prohibits supporting, for example, CESU-8, UTF-7, BOCU-1, SCSU, EBCDIC, and UTF-32. This specification does not make any attempt to support prohibited encodings in its algorithms; support and use of prohibited encodings would thus lead to unexpected behavior. [CESU8] [UTF7] [BOCU1] [SCSU]

12.2.3.4 Changing the encoding while parsing

When the parser requires the user agent to change the encoding , it must run the following steps. This might happen if the encoding sniffing algorithm described above failed to find a character encoding, or if it found a character encoding that was not the actual encoding of the file.

If the encoding that is already being used to interpret the input stream is UTF-16BE/LE, then set the confidence to certain and return. The new encoding is ignored; if it was anything but the same encoding, then it would be clearly incorrect. If the new encoding is UTF-16BE/LE, then change it to UTF-8. If the new encoding is x-user-defined, then change it to windows-1252. If the new encoding is identical or equivalent to the encoding that is already being used to interpret the input stream, then set the confidence to certain and return. This happens when the encoding information found in the file matches what the encoding sniffing algorithm determined to be the encoding, and in the second pass through the parser if the first pass found that the encoding sniffing algorithm described in the earlier section failed to find the right encoding. If all the bytes up to the last byte converted by the current decoder have the same Unicode interpretations in both the current encoding and the new encoding, and if the user agent supports changing the converter on the fly, then the user agent may change to the new converter for the encoding on the fly. Set the document's character encoding and the encoding used to convert the input stream to the new encoding, set the confidence to certain, and return. Otherwise, navigate to the document again, with historyHandling set to " replace ", and using the same source browsing context, but this time skip the encoding sniffing algorithm and instead just set the encoding to the new encoding and the confidence to certain. Whenever possible, this should be done without actually contacting the network layer (the bytes should be re-parsed from memory), even if, e.g., the document is marked as not being cacheable. If this is not possible and contacting the network layer would involve repeating a request that uses a method other than ` GET `, then instead set the confidence to certain and ignore the new encoding. The resource will be misinterpreted. User agents may notify the user of the situation, to aid in application development.

This algorithm is only invoked when a new encoding is found declared on a meta element.

12.2.3.5 Preprocessing the input stream

The input stream consists of the characters pushed into it as the input byte stream is decoded or from the various APIs that directly manipulate the input stream.

Any occurrences of surrogates are surrogate-in-input-stream parse errors. Any occurrences of noncharacters are noncharacter-in-input-stream parse errors and any occurrences of controls other than ASCII whitespace and U+0000 NULL characters are control-character-in-input-stream parse errors.

The handling of U+0000 NULL characters varies based on where the characters are found and happens at the later stages of the parsing. They are either ignored or, for security reasons, replaced with a U+FFFD REPLACEMENT CHARACTER. This handling is, by necessity, spread across both the tokenization stage and the tree construction stage.

Before the tokenization stage, the input stream must be preprocessed by normalizing newlines. Thus, newlines in HTML DOMs are represented by U+000A LF characters, and there are never any U+000D CR characters in the input to the tokenization stage.

The next input character is the first character in the input stream that has not yet been consumed or explicitly ignored by the requirements in this section. Initially, the next input character is the first character in the input. The current input character is the last character to have been consumed.

The insertion point is the position (just before a character or just before the end of the input stream) where content inserted using document.write() is actually inserted. The insertion point is relative to the position of the character immediately after it, it is not an absolute offset into the input stream. Initially, the insertion point is undefined.

The "EOF" character in the tables below is a conceptual character representing the end of the input stream. If the parser is a script-created parser, then the end of the input stream is reached when an explicit "EOF" character (inserted by the document.close() method) is consumed. Otherwise, the "EOF" character is not a real character in the stream, but rather the lack of any further characters.

12.2.4 Parse state

12.2.4.1 The insertion mode

The insertion mode is a state variable that controls the primary operation of the tree construction stage.

Initially, the insertion mode is "initial". It can change to "before html", "before head", "in head", "in head noscript", "after head", "in body", "text", "in table", "in table text", "in caption", "in column group", "in table body", "in row", "in cell", "in select", "in select in table", "in template", "after body", "in frameset", "after frameset", "after after body", and "after after frameset" during the course of the parsing, as described in the tree construction stage. The insertion mode affects how tokens are processed and whether CDATA sections are supported.

Several of these modes, namely "in head", "in body", "in table", and "in select", are special, in that the other modes defer to them at various times. When the algorithm below says that the user agent is to do something " using the rules for the m insertion mode", where m is one of these modes, the user agent must use the rules described under the m insertion mode's section, but must leave the insertion mode unchanged unless the rules in m themselves switch the insertion mode to a new value.

When the insertion mode is switched to "text" or "in table text", the original insertion mode is also set. This is the insertion mode to which the tree construction stage will return.

Similarly, to parse nested template elements, a stack of template insertion modes is used. It is initially empty. The current template insertion mode is the insertion mode that was most recently added to the stack of template insertion modes. The algorithms in the sections below will push insertion modes onto this stack, meaning that the specified insertion mode is to be added to the stack, and pop insertion modes from the stack, which means that the most recently added insertion mode must be removed from the stack.

When the steps below require the UA to reset the insertion mode appropriately , it means the UA must follow these steps:

12.2.4.2 The stack of open elements

Initially, the stack of open elements is empty. The stack grows downwards; the topmost node on the stack is the first one added to the stack, and the bottommost node of the stack is the most recently added node in the stack (notwithstanding when the stack is manipulated in a random access fashion as part of the handling for misnested tags).

The "before html" insertion mode creates the html document element, which is then added to the stack.

In the fragment case, the stack of open elements is initialized to contain an html element that is created as part of that algorithm. (The fragment case skips the "before html" insertion mode.)

The html node, however it is created, is the topmost node of the stack. It only gets popped off the stack when the parser finishes.

The current node is the bottommost node in this stack of open elements.

The adjusted current node is the context element if the parser was created as part of the HTML fragment parsing algorithm and the stack of open elements has only one element in it (fragment case); otherwise, the adjusted current node is the current node.

Elements in the stack of open elements fall into the following categories:

Typically, the special elements have the start and end tag tokens handled specifically, while ordinary elements' tokens fall into "any other start tag" and "any other end tag" clauses, and some parts of the tree builder check if a particular element in the stack of open elements is in the special category. However, some elements (e.g., the option element) have their start or end tag tokens handled specifically, but are still not in the special category, so that they get the ordinary handling elsewhere.

The stack of open elements is said to have an element target node in a specific scope consisting of a list of element types list when the following algorithm terminates in a match state:

Initialize node to be the current node (the bottommost node of the stack). If node is the target node, terminate in a match state. Otherwise, if node is one of the element types in list , terminate in a failure state. Otherwise, set node to the previous entry in the stack of open elements and return to step 2. (This will never fail, since the loop will always terminate in the previous step if the top of the stack — an html element — is reached.)

The stack of open elements is said to have a particular element in scope when it has that element in the specific scope consisting of the following element types:

The stack of open elements is said to have a particular element in list item scope when it has that element in the specific scope consisting of the following element types:

The stack of open elements is said to have a particular element in button scope when it has that element in the specific scope consisting of the following element types:

The stack of open elements is said to have a particular element in table scope when it has that element in the specific scope consisting of the following element types:

The stack of open elements is said to have a particular element in select scope when it has that element in the specific scope consisting of all element types except the following:

Nothing happens if at any time any of the elements in the stack of open elements are moved to a new location in, or removed from, the Document tree. In particular, the stack is not changed in this situation. This can cause, amongst other strange effects, content to be appended to nodes that are no longer in the DOM.

In some cases (namely, when closing misnested formatting elements), the stack is manipulated in a random-access fashion.

12.2.4.3 The list of active formatting elements

Initially, the list of active formatting elements is empty. It is used to handle mis-nested formatting element tags.

The list contains elements in the formatting category, and markers. The markers are inserted when entering applet , object , marquee , template , td , th , and caption elements, and are used to prevent formatting from "leaking" into applet , object , marquee , template , td , th , and caption elements.

In addition, each element in the list of active formatting elements is associated with the token for which it was created, so that further elements can be created for that token if necessary.

When the steps below require the UA to push onto the list of active formatting elements an element element , the UA must perform the following steps:

If there are already three elements in the list of active formatting elements after the last marker, if any, or anywhere in the list if there are no markers, that have the same tag name, namespace, and attributes as element , then remove the earliest such element from the list of active formatting elements. For these purposes, the attributes must be compared as they were when the elements were created by the parser; two elements have the same attributes if all their parsed attributes can be paired such that the two attributes in each pair have identical names, namespaces, and values (the order of the attributes does not matter). This is the Noah's Ark clause. But with three per family instead of two. Add element to the list of active formatting elements.

When the steps below require the UA to reconstruct the active formatting elements , the UA must perform the following steps:

This has the effect of reopening all the formatting elements that were opened in the current body, cell, or caption (whichever is youngest) that haven't been explicitly closed.

The way this specification is written, the list of active formatting elements always consists of elements in chronological order with the least recently added element first and the most recently added element last (except for while steps 7 to 10 of the above algorithm are being executed, of course).

When the steps below require the UA to clear the list of active formatting elements up to the last marker , the UA must perform the following steps:

Let entry be the last (most recently added) entry in the list of active formatting elements. Remove entry from the list of active formatting elements. If entry was a marker, then stop the algorithm at this point. The list has been cleared up to the last marker. Go to step 1.

12.2.4.4 The element pointers

Initially, the head element pointer and the form element pointer are both null.

Once a head element has been parsed (whether implicitly or explicitly) the head element pointer gets set to point to this node.

The form element pointer points to the last form element that was opened and whose end tag has not yet been seen. It is used to make form controls associate with forms in the face of dramatically bad markup, for historical reasons. It is ignored inside template elements.

12.2.4.5 Other parsing state flags

The scripting flag is set to "enabled" if scripting was enabled for the Document with which the parser is associated when the parser was created, and "disabled" otherwise.

The scripting flag can be enabled even when the parser was created as part of the HTML fragment parsing algorithm, even though script elements don't execute in that case.

The frameset-ok flag is set to "ok" when the parser is created. It is set to "not ok" after certain tokens are seen.

12.2.5 Tokenization

Implementations must act as if they used the following state machine to tokenize HTML. The state machine must start in the data state. Most states consume a single character, which may have various side-effects, and either switches the state machine to a new state to reconsume the current input character, or switches it to a new state to consume the next character, or stays in the same state to consume the next character. Some states have more complicated behavior and can consume several characters before switching to another state. In some cases, the tokenizer state is also changed by the tree construction stage.

When a state says to reconsume a matched character in a specified state, that means to switch to that state, but when it attempts to consume the next input character, provide it with the current input character instead.

The exact behavior of certain states depends on the insertion mode and the stack of open elements. Certain states also use a temporary buffer to track progress, and the character reference state uses a return state to return to the state it was invoked from.

The output of the tokenization step is a series of zero or more of the following tokens: DOCTYPE, start tag, end tag, comment, character, end-of-file. DOCTYPE tokens have a name, a public identifier, a system identifier, and a force-quirks flag . When a DOCTYPE token is created, its name, public identifier, and system identifier must be marked as missing (which is a distinct state from the empty string), and the force-quirks flag must be set to off (its other state is on). Start and end tag tokens have a tag name, a self-closing flag , and a list of attributes, each of which has a name and a value. When a start or end tag token is created, its self-closing flag must be unset (its other state is that it be set), and its attributes list must be empty. Comment and character tokens have data.

When a token is emitted, it must immediately be handled by the tree construction stage. The tree construction stage can affect the state of the tokenization stage, and can insert additional characters into the stream. (For example, the script element can result in scripts executing and using the dynamic markup insertion APIs to insert characters into the stream being tokenized.)

Creating a token and emitting it are distinct actions. It is possible for a token to be created but implicitly abandoned (never emitted), e.g. if the file ends unexpectedly while processing the characters that are being parsed into a start tag token.

When a start tag token is emitted with its self-closing flag set, if the flag is not acknowledged when it is processed by the tree construction stage, that is a non-void-html-element-start-tag-with-trailing-solidus parse error.

When an end tag token is emitted with attributes, that is an end-tag-with-attributes parse error.

When an end tag token is emitted with its self-closing flag set, that is an end-tag-with-trailing-solidus parse error.

An appropriate end tag token is an end tag token whose tag name matches the tag name of the last start tag to have been emitted from this tokenizer, if any. If no start tag has been emitted from this tokenizer, then no end tag token is appropriate.

A character reference is said to be consumed as part of an attribute if the return state is either attribute value (double-quoted) state, attribute value (single-quoted) state or attribute value (unquoted) state.

When a state says to flush code points consumed as a character reference , it means that for each code point in the temporary buffer (in the order they were added to the buffer) user agent must append the code point from the buffer to the current attribute's value if the character reference was consumed as part of an attribute, or emit the code point as a character token otherwise.

Before each step of the tokenizer, the user agent must first check the parser pause flag. If it is true, then the tokenizer must abort the processing of any nested invocations of the tokenizer, yielding control back to the caller.

The tokenizer state machine consists of the states defined in the following subsections.

12.2.5.1 Data state

Consume the next input character:

12.2.5.2 RCDATA state

Consume the next input character:

12.2.5.3 RAWTEXT state

Consume the next input character:

U+003C LESS-THAN SIGN (<) Switch to the RAWTEXT less-than sign state. U+0000 NULL This is an unexpected-null-character parse error. Emit a U+FFFD REPLACEMENT CHARACTER character token. EOF Emit an end-of-file token. Anything else Emit the current input character as a character token.

12.2.5.4 Script data state

Consume the next input character:

12.2.5.5 PLAINTEXT state

Consume the next input character:

U+0000 NULL This is an unexpected-null-character parse error. Emit a U+FFFD REPLACEMENT CHARACTER character token. EOF Emit an end-of-file token. Anything else Emit the current input character as a character token.

12.2.5.6 Tag open state

Consume the next input character:

12.2.5.7 End tag open state

Consume the next input character:

12.2.5.8 Tag name state

Consume the next input character:

12.2.5.9 RCDATA less-than sign state

Consume the next input character:

12.2.5.10 RCDATA end tag open state

Consume the next input character:

ASCII alpha Create a new end tag token, set its tag name to the empty string. Reconsume in the RCDATA end tag name state. Anything else Emit a U+003C LESS-THAN SIGN character token and a U+002F SOLIDUS character token. Reconsume in the RCDATA state.

12.2.5.11 RCDATA end tag name state

Consume the next input character:

12.2.5.12 RAWTEXT less-than sign state

Consume the next input character:

12.2.5.13 RAWTEXT end tag open state

Consume the next input character:

ASCII alpha Create a new end tag token, set its tag name to the empty string. Reconsume in the RAWTEXT end tag name state. Anything else Emit a U+003C LESS-THAN SIGN character token and a U+002F SOLIDUS character token. Reconsume in the RAWTEXT state.

12.2.5.14 RAWTEXT end tag name state

Consume the next input character:

12.2.5.15 Script data less-than sign state

Consume the next input character:

12.2.5.16 Script data end tag open state

Consume the next input character:

12.2.5.17 Script data end tag name state

Consume the next input character:

12.2.5.18 Script data escape start state

Consume the next input character:

12.2.5.19 Script data escape start dash state

Consume the next input character:

12.2.5.20 Script data escaped state

Consume the next input character:

12.2.5.21 Script data escaped dash state

Consume the next input character:

12.2.5.22 Script data escaped dash dash state

Consume the next input character:

12.2.5.23 Script data escaped less-than sign state

Consume the next input character:

12.2.5.24 Script data escaped end tag open state

Consume the next input character:

12.2.5.25 Script data escaped end tag name state

Consume the next input character:

12.2.5.26 Script data double escape start state

Consume the next input character:

12.2.5.27 Script data double escaped state

Consume the next input character:

12.2.5.28 Script data double escaped dash state

Consume the next input character:

12.2.5.29 Script data double escaped dash dash state

Consume the next input character:

12.2.5.30 Script data double escaped less-than sign state

Consume the next input character:

12.2.5.31 Script data double escape end state

Consume the next input character:

12.2.5.32 Before attribute name state

Consume the next input character:

12.2.5.33 Attribute name state

Consume the next input character:

When the user agent leaves the attribute name state (and before emitting the tag token, if appropriate), the complete attribute's name must be compared to the other attributes on the same token; if there is already an attribute on the token with the exact same name, then this is a duplicate-attribute parse error and the new attribute must be removed from the token.

If an attribute is so removed from a token, it, and the value that gets associated with it, if any, are never subsequently used by the parser, and are therefore effectively discarded. Removing the attribute in this way does not change its status as the "current attribute" for the purposes of the tokenizer, however.

12.2.5.34 After attribute name state

Consume the next input character:

12.2.5.35 Before attribute value state

Consume the next input character:

12.2.5.36 Attribute value (double-quoted) state

Consume the next input character:

12.2.5.37 Attribute value (single-quoted) state

Consume the next input character:

12.2.5.38 Attribute value (unquoted) state

Consume the next input character:

12.2.5.39 After attribute value (quoted) state

Consume the next input character:

12.2.5.40 Self-closing start tag state

Consume the next input character:

Consume the :

U+003E GREATER-THAN SIGN (>) Switch to the . Emit the comment token. EOF Emit the comment. Emit an end-of-file token. U+0000 NULL This is an . Append a U+FFFD REPLACEMENT CHARACTER character to the comment token's data. Anything else Append the to the comment token's data.

12.2.5.42 Markup declaration open state

If the next few characters are:

Two U+002D HYPHEN-MINUS characters (-) Consume those two characters, create a comment token whose data is the empty string, and switch to the . ASCII case-insensitive match for the word "DOCTYPE" Consume those characters and switch to the DOCTYPE state. The string "[CDATA[" (the five uppercase letters "CDATA" with a U+005B LEFT SQUARE BRACKET character before and after) Consume those characters. If there is an adjusted current node and it is not an element in the HTML namespace, then switch to the CDATA section state. Otherwise, this is a cdata-in-html-content parse error. Create a comment token whose data is the "[CDATA[" string. Switch to the . Anything else This is an parse error. Create a comment token whose data is the empty string. Switch to the (don't consume anything in the current state).

Consume the :

U+002D HYPHEN-MINUS (-) Switch to the . U+003E GREATER-THAN SIGN (>) This is an . Switch to the . Emit the comment token. Anything else in the .

Consume the :

U+002D HYPHEN-MINUS (-) Switch to the U+003E GREATER-THAN SIGN (>) This is an . Switch to the . Emit the comment token. EOF This is an . Emit the comment token. Emit an end-of-file token. Anything else Append a U+002D HYPHEN-MINUS character (-) to the comment token's data. in the .

Consume the :

U+003C LESS-THAN SIGN (<) Append the to the comment token's data. Switch to the . U+002D HYPHEN-MINUS (-) Switch to the . U+0000 NULL This is an . Append a U+FFFD REPLACEMENT CHARACTER character to the comment token's data. EOF This is an . Emit the comment token. Emit an end-of-file token. Anything else Append the to the comment token's data.

Consume the :

U+0021 EXCLAMATION MARK (!) Append the to the comment token's data. Switch to the . U+003C LESS-THAN SIGN (<) Append the to the comment token's data. Anything else in the .

Consume the :

U+002D HYPHEN-MINUS (-) Switch to the . Anything else in the .

Consume the :

U+002D HYPHEN-MINUS (-) Switch to the . Anything else in the .

Consume the :

U+003E GREATER-THAN SIGN (>) EOF in the . Anything else This is a . in the .

Consume the :

U+002D HYPHEN-MINUS (-) Switch to the EOF This is an . Emit the comment token. Emit an end-of-file token. Anything else Append a U+002D HYPHEN-MINUS character (-) to the comment token's data. in the .

Consume the :

U+003E GREATER-THAN SIGN (>) Switch to the . Emit the comment token. U+0021 EXCLAMATION MARK (!) Switch to the . U+002D HYPHEN-MINUS (-) Append a U+002D HYPHEN-MINUS character (-) to the comment token's data. EOF This is an . Emit the comment token. Emit an end-of-file token. Anything else Append two U+002D HYPHEN-MINUS characters (-) to the comment token's data. in the .

Consume the :

U+002D HYPHEN-MINUS (-) Append two U+002D HYPHEN-MINUS characters (-) and a U+0021 EXCLAMATION MARK character (!) to the comment token's data. Switch to the . U+003E GREATER-THAN SIGN (>) This is an . Switch to the . Emit the comment token. EOF This is an . Emit the comment token. Emit an end-of-file token. Anything else Append two U+002D HYPHEN-MINUS characters (-) and a U+0021 EXCLAMATION MARK character (!) to the comment token's data. in the .

12.2.5.53 DOCTYPE state

Consume the next input character:

12.2.5.54 Before DOCTYPE name state

Consume the next input character:

12.2.5.55 DOCTYPE name state

Consume the next input character:

12.2.5.56 After DOCTYPE name state

Consume the next input character:

12.2.5.57 After DOCTYPE public keyword state

Consume the next input character:

12.2.5.58 Before DOCTYPE public identifier state

Consume the next input character:

12.2.5.59 DOCTYPE public identifier (double-quoted) state

Consume the next input character:

12.2.5.60 DOCTYPE public identifier (single-quoted) state

Consume the next input character:

12.2.5.61 After DOCTYPE public identifier state

Consume the next input character:

12.2.5.62 Between DOCTYPE public and system identifiers state

Consume the next input character:

12.2.5.63 After DOCTYPE system keyword state

Consume the next input character:

12.2.5.64 Before DOCTYPE system identifier state

Consume the next input character:

12.2.5.65 DOCTYPE system identifier (double-quoted) state

Consume the next input character:

12.2.5.66 DOCTYPE system identifier (single-quoted) state

Consume the next input character:

12.2.5.67 After DOCTYPE system identifier state

Consume the next input character:

12.2.5.68 Bogus DOCTYPE state

Consume the next input character:

U+003E GREATER-THAN SIGN (>) Switch to the data state. Emit the DOCTYPE token. U+0000 NULL This is an unexpected-null-character parse error. Ignore the character. EOF Emit the DOCTYPE token. Emit an end-of-file token. Anything else Ignore the character.

12.2.5.69 CDATA section state

Consume the next input character:

U+0000 NULL characters are handled in the tree construction stage, as part of the in foreign content insertion mode, which is the only place where CDATA sections can appear.

12.2.5.70 CDATA section bracket state

Consume the next input character:

U+005D RIGHT SQUARE BRACKET (]) Switch to the CDATA section end state. Anything else Emit a U+005D RIGHT SQUARE BRACKET character token. Reconsume in the CDATA section state.

12.2.5.71 CDATA section end state

Consume the next input character:

U+005D RIGHT SQUARE BRACKET (]) Emit a U+005D RIGHT SQUARE BRACKET character token. U+003E GREATER-THAN SIGN character Switch to the data state. Anything else Emit two U+005D RIGHT SQUARE BRACKET character tokens. Reconsume in the CDATA section state.

12.2.5.72 Character reference state

Set the temporary buffer to the empty string. Append a U+0026 AMPERSAND (&) character to the temporary buffer . Consume the next input character:

12.2.5.73 Named character reference state

Consume the maximum number of characters possible, where the consumed characters are identical to one of the identifiers in the first column of the named character references table. Append each character to the temporary buffer when it's consumed.

If the markup contains (not in an attribute) the string I'm ¬it; I tell you , the character reference is parsed as "not", as in, I'm ¬it; I tell you (and this is a parse error). But if the markup was I'm ∉ I tell you , the character reference would be parsed as "notin;", resulting in I'm ∉ I tell you (and no parse error). However, if the markup contains the string I'm ¬it; I tell you in an attribute, no character reference is parsed and string remains intact (and there is no parse error).

12.2.5.74 Ambiguous ampersand state

Consume the next input character:

12.2.5.75 Numeric character reference state

Set the character reference code to zero (0).

Consume the next input character:

12.2.5.76 Hexadecimal character reference start state

Consume the next input character:

12.2.5.77 Decimal character reference start state

Consume the next input character:

12.2.5.78 Hexadecimal character reference state

Consume the next input character:

12.2.5.79 Decimal character reference state

Consume the next input character:

12.2.5.80 Numeric character reference end state

Check the character reference code :

If the number is 0x00, then this is a null-character-reference parse error. Set the character reference code to 0xFFFD.

If the number is greater than 0x10FFFF, then this is a character-reference-outside-unicode-range parse error. Set the character reference code to 0xFFFD.

If the number is a surrogate, then this is a surrogate-character-reference parse error. Set the character reference code to 0xFFFD.

If the number is a noncharacter, then this is a noncharacter-character-reference parse error.

If the number is 0x0D, or a control that's not ASCII whitespace, then this is a control-character-reference parse error. If the number is one of the numbers in the first column of the following table, then find the row with that number in the first column, and set the character reference code to the number in the second column of that row. Number Code point 0x80 0x20AC EURO SIGN (€) 0x82 0x201A SINGLE LOW-9 QUOTATION MARK (‚) 0x83 0x0192 LATIN SMALL LETTER F WITH HOOK (ƒ) 0x84 0x201E DOUBLE LOW-9 QUOTATION MARK („) 0x85 0x2026 HORIZONTAL ELLIPSIS (…) 0x86 0x2020 DAGGER (†) 0x87 0x2021 DOUBLE DAGGER (‡) 0x88 0x02C6 MODIFIER LETTER CIRCUMFLEX ACCENT (ˆ) 0x89 0x2030 PER MILLE SIGN (‰) 0x8A 0x0160 LATIN CAPITAL LETTER S WITH CARON (Š) 0x8B 0x2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK (‹) 0x8C 0x0152 LATIN CAPITAL LIGATURE OE (Œ) 0x8E 0x017D LATIN CAPITAL LETTER Z WITH CARON (Ž) 0x91 0x2018 LEFT SINGLE QUOTATION MARK (‘) 0x92 0x2019 RIGHT SINGLE QUOTATION MARK (’) 0x93 0x201C LEFT DOUBLE QUOTATION MARK (“) 0x94 0x201D RIGHT DOUBLE QUOTATION MARK (”) 0x95 0x2022 BULLET (•) 0x96 0x2013 EN DASH (–) 0x97 0x2014 EM DASH (—) 0x98 0x02DC SMALL TILDE (˜) 0x99 0x2122 TRADE MARK SIGN (™) 0x9A 0x0161 LATIN SMALL LETTER S WITH CARON (š) 0x9B 0x203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK (›) 0x9C 0x0153 LATIN SMALL LIGATURE OE (œ) 0x9E 0x017E LATIN SMALL LETTER Z WITH CARON (ž) 0x9F 0x0178 LATIN CAPITAL LETTER Y WITH DIAERESIS (Ÿ)

Set the temporary buffer to the empty string. Append a code point equal to the character reference code to the temporary buffer . Flush code points consumed as a character reference. Switch to the return state .

12.2.6 Tree construction

The input to the tree construction stage is a sequence of tokens from the tokenization stage. The tree construction stage is associated with a DOM Document object when a parser is created. The "output" of this stage consists of dynamically modifying or extending that document's DOM tree.

This specification does not define when an interactive user agent has to render the Document so that it is available to the user, or when it has to begin accepting user input.

As each token is emitted from the tokenizer, the user agent must follow the appropriate steps from the following list, known as the tree construction dispatcher :

The next token is the token that is about to be processed by the tree construction dispatcher (even if the token is subsequently just ignored).

A node is a MathML text integration point if it is one of the following elements:

A node is an HTML integration point if it is one of the following elements:

If the node in question is the context element passed to the HTML fragment parsing algorithm, then the start tag token for that element is the "fake" token created during by that HTML fragment parsing algorithm.

Not all of the tag names mentioned below are conformant tag names in this specification; many are included to handle legacy content. They still form part of the algorithm that implementations are required to implement to claim conformance.

The algorithm described below places no limit on the depth of the DOM tree generated, or on the length of tag names, attribute names, attribute values, Text nodes, etc. While implementers are encouraged to avoid arbitrary limits, it is recognized that practical concerns will likely force user agents to impose nesting depth constraints.

12.2.6.1 Creating and inserting nodes

While the parser is processing a token, it can enable or disable foster parenting . This affects the following algorithm.

The appropriate place for inserting a node , optionally using a particular override target, is the position in an element returned by running the following steps:

When the steps below require the UA to create an element for a token in a particular given namespace and with a particular intended parent , the UA must run the following steps:

When the steps below require the user agent to insert a foreign element for a token in a given namespace, the user agent must run these steps:

When the steps below require the user agent to insert an HTML element for a token, the user agent must insert a foreign element for the token, in the HTML namespace.

When the steps below require the user agent to adjust MathML attributes for a token, then, if the token has an attribute named definitionurl , change its name to definitionURL (note the case difference).

When the steps below require the user agent to adjust SVG attributes for a token, then, for each attribute on the token whose attribute name is one of the ones in the first column of the following table, change the attribute's name to the name given in the corresponding cell in the second column. (This fixes the case of SVG attributes that are not all lowercase.)

Attribute name on token Attribute name on element attributename attributeName attributetype attributeType basefrequency baseFrequency baseprofile baseProfile calcmode calcMode clippathunits clipPathUnits diffuseconstant diffuseConstant edgemode edgeMode filterunits filterUnits glyphref glyphRef gradienttransform gradientTransform gradientunits gradientUnits kernelmatrix kernelMatrix kernelunitlength kernelUnitLength keypoints keyPoints keysplines keySplines keytimes keyTimes lengthadjust lengthAdjust limitingconeangle limitingConeAngle markerheight markerHeight markerunits markerUnits markerwidth markerWidth maskcontentunits maskContentUnits maskunits maskUnits numoctaves numOctaves pathlength pathLength patterncontentunits patternContentUnits patterntransform patternTransform patternunits patternUnits pointsatx pointsAtX pointsaty pointsAtY pointsatz pointsAtZ preservealpha preserveAlpha preserveaspectratio preserveAspectRatio primitiveunits primitiveUnits refx refX refy refY repeatcount repeatCount repeatdur repeatDur requiredextensions requiredExtensions requiredfeatures requiredFeatures specularconstant specularConstant specularexponent specularExponent spreadmethod spreadMethod startoffset startOffset stddeviation stdDeviation stitchtiles stitchTiles surfacescale surfaceScale systemlanguage systemLanguage tablevalues tableValues targetx targetX targety targetY textlength textLength viewbox viewBox viewtarget viewTarget xchannelselector xChannelSelector ychannelselector yChannelSelector zoomandpan zoomAndPan

When the steps below require the user agent to adjust foreign attributes for a token, then, if any of the attributes on the token match the strings given in the first column of the following table, let the attribute be a namespaced attribute, with the prefix being the string given in the corresponding cell in the second column, the local name being the string given in the corresponding cell in the third column, and the namespace being the namespace given in the corresponding cell in the fourth column. (This fixes the use of namespaced attributes, in particular lang attributes in the XML namespace.)

When the steps below require the user agent to insert a character while processing a token, the user agent must run the following steps:

Let data be the characters passed to the algorithm, or, if no characters were explicitly specified, the character of the character token being processed. Let the adjusted insertion location be the appropriate place for inserting a node. If the adjusted insertion location is in a Document node, then return. The DOM will not let Document nodes have Text node children, so they are dropped on the floor. If there is a Text node immediately before the adjusted insertion location , then append data to that Text node's data. Otherwise, create a new Text node whose data is data and whose node document is the same as that of the element in which the adjusted insertion location finds itself, and insert the newly created node at the adjusted insertion location .

Here are some sample inputs to the parser and the corresponding number of Text nodes that they result in, assuming a user agent that executes scripts. Input Number of Text nodes A < script > var script = document . getElementsByTagName ( 'script' )[ 0 ]; document . body . removeChild ( script ); </ script > B One Text node in the document, containing "AB". A < script > var text = document . createTextNode ( 'B' ); document . body . appendChild ( text ); </ script > C Three Text nodes; "A" before the script, the script's contents, and "BC" after the script (the parser appends to the Text node created by the script). A < script > var text = document . getElementsByTagName ( 'script' )[ 0 ]. firstChild ; text . data = 'B' ; document . body . appendChild ( text ); </ script > C Two adjacent Text nodes in the document, containing "A" and "BC". A < table > B < tr > C </ tr > D </ table > One Text node before the table, containing "ABCD". (This is caused by foster parenting.) A < table >< tr > B </ tr > C </ table > One Text node before the table, containing "A B C" (A-space-B-space-C). (This is caused by foster parenting.) A < table >< tr > B </ tr > </ em > C </ table > One Text node before the table, containing "A BC" (A-space-B-C), and one Text node inside the table (as a child of a tbody ) with a single space character. (Space characters separated from non-space characters by non-character tokens are not affected by foster parenting, even if those other tokens then get ignored.)

When the steps below require the user agent to while processing a comment token, optionally with an explicitly insertion position position , the user agent must run the following steps:

Let data be the data given in the comment token being processed. If position was specified, then let the adjusted insertion location be position . Otherwise, let adjusted insertion location be the appropriate place for inserting a node. Create a node whose data attribute is set to data and whose node document is the same as that of the node in which the adjusted insertion location finds itself. Insert the newly created node at the adjusted insertion location .

DOM mutation events must not fire for changes caused by the UA parsing the document. This includes the parsing of any content inserted using document.write() and document.writeln() calls. [UIEVENTS]

However, mutation observers do fire, as required by DOM .

12.2.6.2 Parsing elements that contain only text

The generic raw text element parsing algorithm and the generic RCDATA element parsing algorithm consist of the following steps. These algorithms are always invoked in response to a start tag token.

When the steps below require the UA to , then, while the is a element, a element, an element, an element, an element, a element, an element, an element, an element, or an element, the UA must pop the off the .

If a step requires the UA to generate implied end tags but lists an element to exclude from the process, then the UA must perform the above steps as if that element was not in the above list.

When the steps below require the UA to , then, while the is a element, a element, a element, a element, an element, an element, an element, a element, an element, an element, an element, an element, a element, a element, a element, a element, a element, or a element, the UA must pop the off the .

12.2.6.4 The rules for parsing tokens in HTML content

12.2.6.4.1 The " initial " insertion mode

When the user agent is to apply the rules for the "initial" insertion mode, the user agent must handle the token as follows:

A character token that is one of U+0009 CHARACTER TABULATION, U+000A LINE FEED (LF), U+000C FORM FEED (FF), U+000D CARRIAGE RETURN (CR), or U+0020 SPACE Ignore the token. A comment token as the last child of the Document object. A DOCTYPE token If the DOCTYPE token's name is not identical to " html ", or the token's public identifier is not missing, or the token's system identifier is neither missing nor identical to " about:legacy-compat ", then there is a parse error. Append a DocumentType node to the Document node, with the name attribute set to the name given in the DOCTYPE token, or the empty string if the name was missing; the publicId attribute set to the public identifier given in the DOCTYPE token, or the empty string if the public identifier was missing; the systemId attribute set to the system identifier given in the DOCTYPE token, or the empty string if the system identifier was missing; and the other attributes specific to DocumentType objects set to null and empty lists as appropriate. Associate the DocumentType node with the Document object so that it is returned as the value of the doctype attribute of the Document object. Then, if the document is not an iframe srcdoc document, and the DOCTYPE token matches one of the conditions in the following list, then set the Document to quirks mode: The force-quirks flag is set to on .

is set to . The name is set to anything other than " html " (compared identically).

" (compared identically). The public identifier is set to: " -//W3O//DTD W3 HTML Strict 3.0//EN// "

" The public identifier is set to: " -/W3C/DTD HTML 4.0 Transitional/EN "

" The public identifier is set to: " HTML "

" The system identifier is set to: " http://www.ibm.com/data/dtd/v11/ibmxhtml1-transitional.dtd "

" The public identifier starts with: " +//Silmaril//dtd html Pro v0r11 19970101// "

" The public identifier starts with: " -//AS//DTD HTML 3.0 asWedit + extensions// "

" The public identifier starts with: " -//AdvaSoft Ltd//DTD HTML 3.0 asWedit + extensions// "

" The public identifier starts with: " -//IETF//DTD HTML 2.0 Level 1// "

" The public identifier starts with: " -//IETF//DTD HTML 2.0 Level 2// "

" The public identifier starts with: " -//IETF//DTD HTML 2.0 Strict Level 1// "

" The public identifier starts with: " -//IETF//DTD HTML 2.0 Strict Level 2// "

" The public identifier starts with: " -//IETF//DTD HTML 2.0 Strict// "

" The public identifier starts with: " -//IETF//DTD HTML 2.0// "

" The public identifier starts with: " -//IETF//DTD HTML 2.1E// "

" The public identifier starts with: " -//IETF//DTD HTML 3.0// "

" The public identifier starts with: " -//IETF//DTD HTML 3.2 Final// "

" The public identifier starts with: " -//IETF//DTD HTML 3.2// "

" The public identifier starts with: " -//IETF//DTD HTML 3// "

" The public identifier starts with: " -//IETF//DTD HTML Level 0// "

" The public identifier starts with: " -//IETF//DTD HTML Level 1// "

" The public identifier starts with: " -//IETF//DTD HTML Level 2// "

" The public identifier starts with: " -//IETF//DTD HTML Level 3// "

" The public identifier starts with: " -//IETF//DTD HTML Strict Level 0// "

" The public identifier starts with: " -//IETF//DTD HTML Strict Level 1// "

" The public identifier starts with: " -//IETF//DTD HTML Strict Level 2// "

" The public identifier starts with: " -//IETF//DTD HTML Strict Level 3// "

" The public identifier starts with: " -//IETF//DTD HTML Strict// "

" The public identifier starts with: " -//IETF//DTD HTML// "

" The public identifier starts with: " -//Metrius//DTD Metrius Presentational// "

" The public identifier starts with: " -//Microsoft//DTD Internet Explorer 2.0 HTML Strict// "

" The public identifier starts with: " -//Microsoft//DTD Internet Explorer 2.0 HTML// "

" The public identifier starts with: " -//Microsoft//DTD Internet Explorer 2.0 Tables// "

" The public identifier starts with: " -//Microsoft//DTD Internet Explorer 3.0 HTML Strict// "

" The public identifier starts with: " -//Microsoft//DTD Internet Explorer 3.0 HTML// "

" The public identifier starts with: " -//Microsoft//DTD Internet Explorer 3.0 Tables// "

" The public identifier starts with: " -//Netscape Comm. Corp.//DTD HTML// "

" The public identifier starts with: " -//Netscape Comm. Corp.//DTD Strict HTML// "

" The public identifier starts with: " -//O'Reilly and Associates//DTD HTML 2.0// "

" The public identifier starts with: " -//O'Reilly and Associates//DTD HTML Extended 1.0// "

" The public identifier starts with: " -//O'Reilly and Associates//DTD HTML Extended Relaxed 1.0// "

" The public identifier starts with: " -//SQ//DTD HTML 2.0 HoTMetaL + extensions// "

" The public identifier starts with: " -//SoftQuad Software//DTD HoTMetaL PRO 6.0::19990601::extensions to HTML 4.0// "

" The public identifier starts with: " -//SoftQuad//DTD HoTMetaL PRO 4.0::19971010::extensions to HTML 4.0// "

" The public identifier starts with: " -//Spyglass//DTD HTML 2.0 Extended// "

" The public identifier starts with: " -//Sun Microsystems Corp.//DTD HotJava HTML// "

" The public identifier starts with: " -//Sun Microsystems Corp.//DTD HotJava Strict HTML// "

" The public identifier starts with: " -//W3C//DTD HTML 3 1995-03-24// "

" The public identifier starts with: " -//W3C//DTD HTML 3.2 Draft// "

" The public identifier starts with: " -//W3C//DTD HTML 3.2 Final// "

" The public identifier starts with: " -//W3C//DTD HTML 3.2// "

" The public identifier starts with: " -//W3C//DTD HTML 3.2S Draft// "

" The public identifier starts with: " -//W3C//DTD HTML 4.0 Frameset// "

" The public identifier starts with: " -//W3C//DTD HTML 4.0 Transitional// "

" The public identifier starts with: " -//W3C//DTD HTML Experimental 19960712// "

" The public identifier starts with: " -//W3C//DTD HTML Experimental 970421// "

" The public identifier starts with: " -//W3C//DTD W3 HTML// "

" The public identifier starts with: " -//W3O//DTD W3 HTML 3.0// "

" The public identifier starts with: " -//WebTechs//DTD Mozilla HTML 2.0// "

" The public identifier starts with: " -//WebTechs//DTD Mozilla HTML// "

" The system identifier is missing and the public identifier starts with: " -//W3C//DTD HTML 4.01 Frameset// "

" The system identifier is missing and the public identifier starts with: " -//W3C//DTD HTML 4.01 Transitional// " Otherwise, if the document is not an iframe srcdoc document, and the DOCTYPE token matches one of the conditions in the following list, then set the Document to limited-quirks mode: The public identifier starts with: " -//W3C//DTD XHTML 1.0 Frameset// "

" The public identifier starts with: " -//W3C//DTD XHTML 1.0 Transitional// "

" The system identifier is not missing and the public identifier starts with: " -//W3C//DTD HTML 4.01 Frameset// "

" The system identifier is not missing and the public identifier starts with: " -//W3C//DTD HTML 4.01 Transitional// " The system identifier and public identifier strings must be compared to the values given in the lists above in an ASCII case-insensitive manner. A system identifier whose value is the empty string is not considered missing for the purposes of the conditions above. Then, switch the insertion mode to "before html". Anything else If the document is not an iframe srcdoc document, then this is a parse error; set the Document to quirks mode. In any case, switch the insertion mode to "before html", then reprocess the token.

12.2.6.4.2 The " before html " insertion mode

When the user agent is to apply the rules for the "before html" insertion mode, the user agent must handle the token as follows:

The document element can end up being removed from the Document object, e.g. by scripts; nothing in particular happens in such cases, content continues being appended to the nodes as described in the next section.

12.2.6.4.3 The " before head " insertion mode

When the user agent is to apply the rules for the "before head" insertion mode, the user agent must handle the token as follows:

12.2.6.4.4 The " in head " insertion mode

When the user agent is to apply the rules for the "in head" insertion mode, the user agent must handle the token as follows:

12.2.6.4.5 The " in head noscript " insertion mode

When the user agent is to apply the rules for the "in head noscript" insertion mode, the user agent must handle the token as follows:

12.2.6.4.6 The " after head " insertion mode

When the user agent is to apply the rules for the "after head" insertion mode, the user agent must handle the token as follows:

12.2.6.4.7 The " in body " insertion mode

When the user agent is to apply the rules for the "in body" insertion mode, the user agent must handle the token as follows:

When the steps above say the user agent is to close a p element , it means that the user agent must run the following steps:

The adoption agency algorithm , which takes as its only argument a token token for which the algorithm is being run, consists of the following steps:

This algorithm's name, the "adoption agency algorithm", comes from the way it causes elements to change parents, and is in contrast with other possible algorithms for dealing with misnested content.

12.2.6.4.8 The " text " insertion mode

When the user agent is to apply the rules for the "text" insertion mode, the user agent must handle the token as follows:

12.2.6.4.9 The " in table " insertion mode

When the user agent is to apply the rules for the "in table" insertion mode, the user agent must handle the token as follows:

When the steps above require the UA to clear the stack back to a table context , it means that the UA must, while the current node is not a table , template , or html element, pop elements from the stack of open elements.

This is the same list of elements as used in the has an element in table scope steps.

The current node being an html element after this process is a fragment case.

12.2.6.4.10 The " in table text " insertion mode

When the user agent is to apply the rules for the "in table text" insertion mode, the user agent must handle the token as follows:

12.2.6.4.11 The " in caption " insertion mode

When the user agent is to apply the rules for the "in caption" insertion mode, the user agent must handle the token as follows:

12.2.6.4.12 The " in column group " insertion mode

When the user agent is to apply the rules for the "in column group" insertion mode, the user agent must handle the token as follows:

12.2.6.4.13 The " in table body " insertion mode

When the user agent is to apply the rules for the "in table body" insertion mode, the user agent must handle the token as follows:

When the steps above require the UA to clear the stack back to a table body context , it means that the UA must, while the current node is not a tbody , , thead , template , or html element, pop elements from the stack of open elements.

The current node being an html element after this process is a fragment case.

12.2.6.4.14 The " in row " insertion mode

When the user agent is to apply the rules for the "in row" insertion mode, the user agent must handle the token as follows:

When the steps above require the UA to clear the stack back to a table row context , it means that the UA must, while the current node is not a tr , template , or html element, pop elements from the stack of open elements.

The current node being an html element after this process is a fragment case.

12.2.6.4.15 The " in cell " insertion mode

When the user agent is to apply the rules for the "in cell" insertion mode, the user agent must handle the token as follows:

Where the steps above say to close the cell , they mean to run the following algorithm:

The stack of open elements cannot have both a td and a th element in table scope at the same time, nor can it have neither when the close the cell algorithm is invoked.

12.2.6.4.16 The " in select " insertion mode

When the user agent is to apply the rules for the "in select" insertion mode, the user agent must handle the token as follows:

12.2.6.4.17 The " in select in table " insertion mode

When the user agent is to apply the rules for the "in select in table" insertion mode, the user agent must handle the token as follows:

12.2.6.4.18 The " in template " insertion mode

When the user agent is to apply the rules for the "in template" insertion mode, the user agent must handle the token as follows:

12.2.6.4.19 The " after body " insertion mode

When the user agent is to apply the rules for the "after body" insertion mode, the user agent must handle the token as follows:

12.2.6.4.20 The " in frameset " insertion mode

When the user agent is to apply the rules for the "in frameset" insertion mode, the user agent must handle the token as follows:

12.2.6.4.21 The " after frameset " insertion mode

When the user agent is to apply the rules for the "after frameset" insertion mode, the user agent must handle the token as follows:

12.2.6.4.22 The " after after body " insertion mode

When the user agent is to apply the rules for the "after after body" insertion mode, the user agent must handle the token as follows:

12.2.6.4.23 The " after after frameset " insertion mode

When the user agent is to apply the rules for the "after after frameset" insertion mode, the user agent must handle the token as follows:

12.2.6.5 The rules for parsing tokens in foreign content

When the user agent is to apply the rules for parsing tokens in foreign content, the user agent must handle the token as follows:

12.2.7 The end

✔ MDN Document/DOMContentLoaded_event Support in all current engines. Firefox 1+ Safari 3.1+ Chrome 1+ Opera 9+ Edge 79+ Edge (Legacy) 12+ Internet Explorer 9+ Firefox Android 4+ Safari iOS 2+ Chrome Android 18+ WebView Android 1+ Samsung Internet 1.0+ Opera Android 10.1+ caniuse.com table

Once the user agent stops parsing the document, the user agent must run the following steps:

✔ MDN Window/load_event Support in all current engines. Firefox 1+ Safari 1.3+ Chrome 1+ Opera 4+ Edge 79+ Edge (Legacy) 12+ Internet Explorer 4+ Firefox Android 4+ Safari iOS 1+ Chrome Android 18+ WebView Android 1+ Samsung Internet 1.0+ Opera Android 10.1+

When the user agent is to abort a parser , it must run the following steps:

Throw away any pending content in the input stream, and discard any future content that would have been added to it. Set the current document readiness to " interactive ". Pop all the nodes off the stack of open elements. Set the current document readiness to " complete ".

12.2.8 Coercing an HTML DOM into an infoset

When an application uses an HTML parser in conjunction with an XML pipeline, it is possible that the constructed DOM is not compatible with the XML tool chain in certain subtle ways. For example, an XML toolchain might not be able to represent attributes with the name xmlns , since they conflict with the Namespaces in XML syntax. There is also some data that the HTML parser generates that isn't included in the DOM itself. This section specifies some rules for handling these issues.

If the XML API being used doesn't support DOCTYPEs, the tool may drop DOCTYPEs altogether.

If the XML API doesn't support attributes in no namespace that are named " xmlns ", attributes whose names start with " xmlns: ", or attributes in the XMLNS namespace, then the tool may drop such attributes.

The tool may annotate the output with any namespace declarations required for proper operation.

If the XML API being used restricts the allowable characters in the local names of elements and attributes, then the tool may map all element and attribute local names that the API wouldn't support to a set of names that are allowed, by replacing any character that isn't supported with the uppercase letter U and the six digits of the character's code point when expressed in hexadecimal, using digits 0-9 and capital letters A-F as the symbols, in increasing numeric order.

For example, the element name foo<bar , which can be output by the HTML parser, though it is neither a legal HTML element name nor a well-formed XML element name, would be converted into fooU00003Cbar , which is a well-formed XML element name (though it's still not legal in HTML by any means).

As another example, consider the attribute xlink:href . Used on a MathML element, it becomes, after being adjusted, an attribute with a prefix " xlink " and a local name " href ". However, used on an HTML element, it becomes an attribute with no prefix and the local name " xlink:href ", which is not a valid NCName, and thus might not be accepted by an XML API. It could thus get converted, becoming " xlinkU00003Ahref ".

The resulting names from this conversion conveniently can't clash with any attribute generated by the HTML parser, since those are all either lowercase or those listed in the adjust foreign attributes algorithm's table.

If the XML API restricts comments from having two consecutive U+002D HYPHEN-MINUS characters (--), the tool may insert a single U+0020 SPACE character between any such offending characters.

If the XML API restricts comments from ending in a U+002D HYPHEN-MINUS character (-), the tool may insert a single U+0020 SPACE character at the end of such comments.

If the XML API restricts allowed characters in character data, attribute values, or comments, the tool may replace any U+000C FORM FEED (FF) character with a U+0020 SPACE character, and any other literal non-XML character with a U+FFFD REPLACEMENT CHARACTER.

If the tool has no way to convey out-of-band information, then the tool may drop the following information:

The mutations allowed by this section apply after the HTML parser's rules have been applied. For example, a <a::> start tag will be closed by a </a::> end tag, and never by a </aU00003AU00003A> end tag, even if the user agent is using the rules above to then generate an actual element in the DOM with the name aU00003AU00003A for that start tag.

12.2.9 An introduction to error handling and strange cases in the parser

This section is non-normative.

This section examines some erroneous markup and discusses how the HTML parser handles these cases.

This section is non-normative.

The most-often discussed example of erroneous markup is as follows:

< p > 1 < b > 2 < i > 3 </ b > 4 </ i > 5 </ p >

The parsing of this markup is straightforward up to the "3". At this point, the DOM looks like this:

: 1 : 2 : 3



Here, the has five elements on it: , , , , and . The just has two: and . The is " ".

Upon receiving the end tag token with the tag name "b", the "adoption agency algorithm" is invoked. This is a simple case, in that the formatting element is the element, and there is no furthest block . Thus, the ends up with just three elements: , , and , while the has just one: . The DOM tree is unmodified at this point.

The next token is a character ("4"), triggers the , in this case just the element. A new element is thus created for the "4" node. After the end tag token for the "i" is also received, and the "5" node is inserted, the DOM looks as follows:

: 1 : 2 : 3 : 4 : 5



This section is non-normative.

A case similar to the previous one is the following:

< b > 1 < p > 2 </ b > 3 </ p >

Up to the "2" the parsing here is straightforward:

: 1 : 2



The interesting part is when the end tag token with the tag name "b" is parsed.

Before that token is seen, the has four elements on it: , , , and . The just has the one: . The is " ".

Upon receiving the end tag token with the tag name "b", the "adoption agency algorithm" is invoked, as in the previous example. However, in this case, there is a furthest block , namely the element. Thus, this time the adoption agency algorithm isn't skipped over.

The common ancestor is the element. A conceptual "bookmark" marks the position of the in the , but since that list has only one element in it, the bookmark won't have much effect.

As the algorithm progresses, node ends up set to the formatting element ( ), and last node ends up set to the furthest block ( ).

The last node gets appended (moved) to the common ancestor , so that the DOM looks like:

: 1 : 2



A new element is created, and the children of the element are moved to it:

: 1



: 2



Finally, the new element is appended to the element, so that the DOM looks like:

: 1 : 2



The element is removed from the and the , so that when the "3" is parsed, it is appended to the element:

: 1 : 2 : 3



12.2.9.3 Unexpected markup in tables

This section is non-normative.

Error handling in tables is, for historical reasons, especially strange. For example, consider the following markup:

< table > < b > < tr >< td > aaa </ td ></ tr > bbb </ table > ccc

The highlighted b element start tag is not allowed directly inside a table like that, and the parser handles this case by placing the element before the table. (This is called foster parenting.) This can be seen by examining the DOM tree as it stands just after the table element's start tag has been seen:

...and then immediately after the b element start tag has been seen:

At this point, the stack of open elements has on it the elements html , body , table , and b (in that order, despite the resulting DOM tree); the list of active formatting elements just has the b element in it; and the insertion mode is "in table".

The tr start tag causes the b element to be popped off the stack and a tbody start tag to be implied; the tbody and tr elements are then handled in a rather straight-forward manner, taking the parser through the "in table body" and "in row" insertion modes, after which the DOM looks as follows:

Here, the stack of open elements has on it the elements html , body , table , tbody , and tr ; the list of active formatting elements still has the b element in it; and the insertion mode is "in row".

The td element start tag token, after putting a td element on the tree, puts a marker on the list of active formatting elements (it also switches to the "in cell" insertion mode).

The marker means that when the "aaa" character tokens are seen, no b element is created to hold the resulting Text node:

The end tags are handled in a straight-forward manner; after handling them, the stack of open elements has on it the elements html , body , table , and tbody ; the list of active formatting elements still has the b element in it (the marker having been removed by the "td" end tag token); and the insertion mode is "in table body".

Thus it is that the "bbb" character tokens are found. These trigger the "in table text" insertion mode to be used (with the original insertion mode set to "in table body"). The character tokens are collected, and when the next token (the table element end tag) is seen, they are processed as a group. Since they are not all spaces, they are handled as per the "anything else" rules in the "in table" insertion mode, which defer to the "in body" insertion mode but with foster parenting.

When the active formatting elements are reconstructed, a b element is created and foster parented, and then the "bbb" Text node is appended to it:

The stack of open elements has on it the elements html , body , table , tbody , and the new b (again, note that this doesn't match the resulting tree!); the list of active formatting elements has the new b element in it; and the insertion mode is still "in table body".

Had the character tokens been only ASCII whitespace instead of "bbb", then that ASCII whitespace would just be appended to the tbody element.

Finally, the table is closed by a "table" end tag. This pops all the nodes from the stack of open elements up to and including the table element, but it doesn't affect the list of active formatting elements, so the "ccc" character tokens after the table result in yet another b element being created, this time after the table:

12.2.9.4 Scripts that modify the page as it is being parsed

This section is non-normative.

Consider the following markup, which for this example we will assume is the document with URL https://example.com/inner , being rendered as the content of an iframe in another document with the URL https://example.com/outer :

< div id = a > < script > var div = document . getElementById ( 'a' ); parent . document . body . appendChild ( div ); </ script > < script > alert ( document . URL ); </ script > </ div > < script > alert ( document . URL ); </ script >

Up to the first "script" end tag, before the script is parsed, the result is relatively straightforward:

html head body div id =" a " =" #text : script #text : var div = document.getElementById('a'); ⏎ parent.document.body.appendChild(div);



After the script is parsed, though, the div element and its child script element are gone:

They are, at this point, in the Document of the aforementioned outer browsing context. However, the stack of open elements still contains the div element.

Thus, when the second script element is parsed, it is inserted into the outer Document object.

Those parsed into different Document s than the one the parser was created for do not execute, so the first alert does not show.

Once the div element's end tag is parsed, the div element is popped off the stack, and so the next script element is in the inner Document :

This script does execute, resulting in an alert that says "https://example.com/inner".

12.2.9.5 The execution of scripts that are moving across multiple documents

This section is non-normative.

Elaborating on the example in the previous section, consider the case where the second script element is an external script (i.e. one with a src attribute). Since the element was not in the parser's Document when it was created, that external script is not even downloaded.

In a case where a script element with a src attribute is parsed normally into its parser's Document , but while the external script is being downloaded, the element is moved to another document, the script continues to download, but does not execute.

In general, moving script elements between Document s is considered a bad practice.

12.2.9.6 Unclosed formatting elements

This section is non-normative.

The following markup shows how nested formatting elements (such as b ) get collected and continue to be applied even as the elements they are contained in are closed, but that excessive duplicates are thrown away.

<!DOCTYPE html> < p >< b class = x >< b class = x >< b >< b class = x >< b class = x >< b > X < p > X < p >< b >< b class = x >< b > X < p ></ b ></ b ></ b ></ b ></ b ></ b > X

The resulting DOM tree is as follows:

DOCTYPE: html

html head body p b class =" x " =" b class =" x " =" b b class =" x " =" b class =" x " =" b #text : X⏎ p b class =" x " =" b b class =" x " =" b class =" x " =" b #text : X⏎ p b class =" x " =" b b class =" x " =" b class =" x " =" b b b class =" x " =" b #text : X⏎ p #text : X⏎



Note how the second p element in the markup has no explicit b elements, but in the resulting DOM, up to three of each kind of formatting element (in this case three b elements with the class attribute, and two unadorned b elements) get reconstructed before the element's "X".

Also note how this means that in the final paragraph only six b end tags are needed to completely clear the list of active formatting elements, even though nine b start tags have been seen up to this point.

12.3 Serializing HTML fragments

For the purposes of the following algorithm, an element serializes as void if its element type is one of the void elements, or is basefont , bgsound , frame , or keygen .

The following steps form the HTML fragment serialization algorithm . The algorithm takes as input a DOM Element , Document , or DocumentFragment referred to as the node , and returns a string.

This algorithm serializes the children of the node being serialized, not the node itself.

It is possible that the output of this algorithm, if parsed with an HTML parser, will not return the original tree structure. Tree structures that do not roundtrip a serialize and reparse step can also be produced by the HTML parser itself, although such cases are typically non-conforming.

For instance, if a textarea element to which a Comment node has been appended is serialized and the output is then reparsed, the comment will end up being displayed in the text control. Similarly, if, as a result of DOM manipulation, an element contains a comment that contains the literal string " --> ", then when the result of serializing the element is parsed, the comment will be truncated at that point and the rest of the comment will be interpreted as markup. More examples would be making a script element contain a Text node with the text string " </script> ", or having a p element that contains a ul element (as the ul element's start tag would imply the end tag for the p ). This can enable cross-site scripting attacks. An example of this would be a page that lets the user enter some font family names that are then inserted into a CSS style block via the DOM and which then uses the innerHTML IDL attribute to get the HTML serialization of that style element: if the user enters " </style><script>attack</script> " as a font family name, innerHTML will return markup that, if parsed in a different context, would contain a script node, even though no script node existed in the original DOM.

For example, consider the following markup: < form id = "outer" >< div ></ form >< form id = "inner" >< input > This will be parsed into: html head body form id =" outer " =" div form id =" inner " =" input

The input element will be associated with the inner form element. Now, if this tree structure is serialized and reparsed, the <form id="inner"> start tag will be ignored, and so the input element will be associated with the outer form element instead. < html >< head ></ head >< body >< form id = "outer" >< div > < form id = "inner" > < input ></ form ></ div ></ form ></ body ></ html > html head body form id =" outer " =" div input



As another example, consider the following markup: < a >< table >< a > This will be parsed into: html head body a a table

That is, the a elements are nested, because the second a element is foster parented. After a serialize-reparse roundtrip, the a elements and the table element would all be siblings, because the second <a> start tag implicitly closes the first a element. < html >< head ></ head >< body >< a > < a > </ a >< table ></ table ></ a ></ body ></ html > html head body a a table



For historical reasons, this algorithm does not round-trip an initial U+000A LINE FEED (LF) character in pre , textarea , or listing elements, even though (in the first two cases) the markup being round-tripped can be conforming. The HTML parser will drop such a character during parsing, but this algorithm does not serialize an extra U+000A LINE FEED (LF) character.

For example, consider the following markup: < pre > Hello. </ pre > When this document is first parsed, the pre element's child text content starts with a single newline character. After a serialize-reparse roundtrip, the pre element's child text content is simply " Hello. ".

Because of the special role of the is attribute in signaling the creation of customized built-in elements, in that it provides a mechanism for parsed HTML to set the element's is value, we special-case its handling during serialization.This ensures that an element's is value is preserved through serialize-parse roundtrips.

When creating a customized built-in element via the parser, a developer uses the is attribute directly; in such cases serialize-parse roundtrips work fine. < script > window . SuperP = class extends HTMLParagraphElement {}; customElements . define ( "super-p" , SuperP , { extends : "p" }); </ script > < div id = "container" >< p is = "super-p" > Superb! </ p ></ div > < script > console . log ( container . innerHTML ); // <p is="super-p"> container . innerHTML = container . innerHTML ; console . log ( container . innerHTML ); // <p is="super-p"> console . assert ( container . firstChild instanceof SuperP ); </ script > But when creating a customized built-in element via its constructor or via createElement() , the is attribute is not added. Instead, the is value (which is what the custom elements machinery uses) is set without intermediating through an attribute. < script > container . innerHTML = "" ; const p = document . createElement ( "p" , { is : "super-p" }); container . appendChild ( p ); // The is attribute is not present in the DOM: console . assert ( ! p . hasAttribute ( "is" )); // But the element is still a super-p: console . assert ( p instanceof SuperP ); </ script > To ensure that serialize-parse roundtrips still work, the serialization process explicitly writes out the element's is value as an is attribute: < script > console . log ( container . innerHTML ); // <p is="super-p"> container . innerHTML = container . innerHTML ; console . log ( container . innerHTML ); // <p is="super-p"> console . assert ( container . firstChild instanceof SuperP ); </ script >

Escaping a string (for the purposes of the algorithm above) consists of running the following steps:

Replace any occurrence of the " & " character by the string " & ". Replace any occurrences of the U+00A0 NO-BREAK SPACE character by the string " ". If the algorithm was invoked in the attribute mode, replace any occurrences of the " " " character by the string " " ". If the algorithm was not invoked in the attribute mode, replace any occurrences of the " < " character by the string " < ", and any occurrences of the " > " character by the string " > ".

12.4 Parsing HTML fragments

The following steps form the HTML fragment parsing algorithm . The algorithm takes as input an Element node, referred to as the context element, which gives the context for the parser, as well as input , a string to parse, and returns a list of zero or more nodes.