Let Σ be a non-empty, finite set of symbols, called an alphabet. Then Σ* is the countable infinite set of finite words that can be formed by concatenating zero or more symbols from Σ. Any well-defined subset L ⊆ Σ* is a language.

Let's apply this to XML. Its alphabet is the Unicode character set U, which is non-empty and finite. Not every concatenation of zero or more Unicode characters is a well-formed XML document, for example, the string

<tag> soup &; not <//good>

is clearly not. The subset XML ⊂ U* that forms well-formed XML documents is decidable (or “recursive”). There exists a machine (algorithm or computer program) that takes as input any word w ∈ U* and after a finite amount of time, outputs either 1 if w ∈ XML and 0 otherwise. Such an algorithm is a sub-routine of any XML processing software. Not all languages are decidable. For example, the set of valid C programs that terminate in a finite amount of time, is not (this is known as the halting problem). When one designs a new language, an important decision to make is whether it should be as powerful as possible or whether the expressiveness would better be restricted in favor of decidability.

Some languages can be defined by means of a grammar that is said to produce the language. A grammar consists of

a finite set of literals (also called terminal symbols),

a disjoint finite set of variables of the grammar (also called non-terminal symbols),

a distinguished starting symbol, taken from the set of variables and

a finite set of rules (so-called productions) that allow certain kinds of replacements.

Any word that consists exclusively of literals and can be derived by starting with the starting symbol and then applying the given rules belongs to the language produced by the grammar.

For example, the following grammar (in rather informal notation) lets you derive exactly the integers in decimal notation. The literals of the grammar are the digits 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , and 0 . The variables are the symbols S and D. S is the starting symbol. Any occurrence of the variable S may be replaced with the literal 0 or

or by any of the literals other than 0 followed by the variable D. Any occurrence of the variable D may be replaced by any of the literals followed by another instance of the variable D or

by the empty string. Here is how we derive 42 : S —(apply rule 4, 2nd variant)→ 4 D —(apply rule 5, 1st variant)→ 42 D —(apply rule 5, 2nd variant)→ 42 .

Depending on how elaborate rules you allow in your grammar, differently sophisticated machines are required to prove that a given word can actually be produced by the grammar. The example given above is a regular grammar, which is the most simple and least powerful. The next powerful class of grammars are called context-free. These grammars are also very simple to verify. XML (unless I'm overlooking some obscure feature I'm not aware of) can be described by a context-free grammar. The classification of grammars forms the Chomsky Hierarchy of grammars (and therefore languages). Every language that can be described by a grammar is at least semi-decidable (or “recursively enumerable”). That is, there exists a machine that, given a word that actually belongs to the language, derives a proof that it can be produced by the grammar within finite time, and will never output a wrong proof. Such a machine is called a verifier. Note that the machine may never halt when given a word that doesn't actually belong to the language. Clearly, we want our programming languages be described by less powerful grammars for the benefit of being able to reject invalid programs within finite time.

Schemata are an addition to XML that allow refining the set of well-formed documents. A well-formed document that follows a certain schema is called valid according to that schema. For example, the string

<?xml version="1.0" encoding="utf-8" ?> <root>all evil</root>

is a well-formed XML document but not a valid XHTML document. There exists schemata for XHTML, SVG, XSLT and what not else. Schema validation can also be done by an algorithm that is guaranteed to halt after finite amount of steps for every input. Such a program is called a validator or a validating parser. Schemata are defined by so-called scema definition languages, which are a way to formally define grammars. XSD is the official schema-definition language for XML and is, itself, XML-based. RELAX NG is a more elegant, much simpler and slightly less powerful alternative to XSD.

Because you can define your own schemata, XML is called an extensible language, which is the origin of the “X” in “XML”.

You can define a set of rules that gives XML documents an interpretation as descriptions of computer programs. XSLT, mentioned earlier, is an example of such a programming language built with XML. More generally, you can serialize the abstract syntax tree of almost any programming language quite naturally into XML, if this is what you want.