Academia Sinica Computing Centre
This paper explains the architecture behind XXX: the eXperimental XML leXer: the idea of Notation Processors.
The ISO SGML standard, [IS 8879:1986], has widely been criticized for not providing a well-layered analysis couched in conventional computer science terms to allow straightforward implementations. In practise, it means that SGML's productions are non-deterministic: information provided outside the productions is required to implement the grammar.
Many people, especially those habituated to UNIX's little language approach, complain that it is not possible to implement conforming SGML parsers using the common compiler compilers: [lex], [yacc], [flex], [bison], etc.
Others point out an inconsistency of purpose in SGML: it steadfastly keeps itself as a text-based notation (i.e., not a binary format) to allow editing in simple text editors; yet its complexity can prevent it from being useful for the more sophisticated but non-structure-based text processing utilities, notably [Perl].
XML was to a large extent designed to address these criticisms:
XML has two levels: well-formed and valid. However, the XML processing model has been criticized [St Laruent, XML-DEV] still for being under-layered, especially in consideration of W3C's current fertility rate as various core specifications are developed and added.
To these, I would add another criticism: the layering of XML is geared towards the lexical/semantic divide learned from compiler compilers such as Yacc, and there is a more natural way to model XML. In turn, this model may provide a simpler design approach for XML systems. Actually, this criticism is no negative criticism at all: as in the case of SGML, it is only by proposing (and working) with a technology that one can become familiar enough with it to propose better approaches.
The trouble with the lex/yacc approach, in the specific SGML/XML arena, is that:
Underlying [IS 8879:1986] but never made very explicit is the notion of a token layer; it is evidenced in the various name tokens, in the recognition modes, in the short-reference delimiter maps, and in the datatag feature. Let us rescue this idea, and explore it more fully.
The architecture I propose has three classes of objects:
In the note "Validate This! Content Models for Different Targets", I show how an XML document can be represented as a tree/graph of nodelists: these node-lists can be validated against regular expressions. That note shows that not only can regular expressions on such a graph validate many useful things in addition to the allowed content of an element, but it can also be used to explain XML well-formedness.
We can therefore describe the tokenizer as an object which takes characters from entities and produces a token tree.
Next we can define the parser as an object which takes a regular grammar and parses the token tree against it, both to validate and to identify objects of interest.
I leave a definition of the tree transformer to some later time.
We can then describe a validating XML processor as having five layers:
When the attribute or element contains embedded notations, more layers would be needed.
Let us explore some implications of this architecture. The first is that it becomes possible to derive a highly parametized general-purpose tokenizer and a general-purpose regular-language parser, and create a notation processor from these.
Because the tokenizer is coupled to entities and not to the parser, we can use the same general-purpose tokenizer (with different parameters) to also generate a node-list tree from data in element content and in attributes. And we can use the same general-purpose parser (with different parameters) to parse the node-list of the embedded data. This could give a worthwhile decrease in code size for an XML application.
This kind of approach should be congenial to readers familar with LISP, C++ templates, object patterns, or other abstraction mechanisms. (The idea is explored from a different angle in XML Notation Schemas.)
The XXX archicture does not have any great novelty, except that it proposes that there is a convenient layer of tokenizing in SGML/XML which does require a full regular lexical scanning language (with the attendent complexity of, e.g., lex). XML is very rich in the different kinds of tokens it has (delimiters, identifiers, white space, tokens within token groups) but instead of a full regular language, they conform to the following model:
lexer :== startDelimiter, whiteSpace?, ( subtag1 | subtag2 | subtag3 | subtag4 | subtag5 | subtag6 | subtag7 | subtag8 | (token, whiteSpace?) | CDATA )*, ( altEndDelimiter | endDelimiter )?
Notation processors could be made dynamically by the addition of code to construct the tokenizers parameter-list, based on some parameters in a node-list. For example, a XML Schema "picture" could be tokenized, parsed, and transformed into a node-list; this node-list is then used to drive a new tokenizer to read the data being checked.
(As a side comment, an XXX notation processor system is something like a LISP which has only cells and interned-symbols.)
In XML, the element tree, the entity tree, and the notation tree are all synchronous. Indeed, for simple documents with no DTD, one entity, and no embedded notations, there is no difference. Class-based XML implementations are often keyed off the element tree; it would be possible also to key a class-based implementation off the entity tree, especially for SGML which has a richer selection of notations available for entities. In this paper, I suggest that the notation tree may be the most suitable structure for tokenizer objects to follow.
Copyright (C) 1999 Rick Jelliffe. Please feel free to publish this in any way you like, but try to update it to the most recent version, and keep my name on it.