Validate This! Content Models on Different Targets

Rick Jelliffe
Academia Sinica
Taipei, Taiwan

1999-05-21

Abstract

This paper discusses various structures in an XML or SGML document which may be validated using a regular-expression on a string of name tokens.

Background

XML and SGML provide declarations which allow the contents of an element to be validated against a model. The models are simple regular expressions which are nominally matched against a string of name tokens provided by the lexical scanner in the XML, and are keyed by the element type name (known in SGML as the generic identifier).

In the paper Using XSL as a Validation Language, I discuss how validation can be performed using tree locations rather than regular expressions: both to identify the candidate objects using tree location then to check some assertion using tree location. In this paper I take the orthogonal topic of identifying other structures within an XML or SGML document which may be validated using regular expressions on a string of name tokens. I concentrate on XML elements.

A Node Model for Simple XML

Let is make a simple node model for XML. Here is an XML document.

<?xml version="1.0" ?>
<!DOCTYPE box [
<!ELEMENT box ( box )* >
<!ATTLIST box
  id ID #REQUIRED
  length-breadth-width NMTOKENS #REQUIRED
  units NMTOKEN #REQUIRED >
]>
<box id="b1" length-breadth-width="3 5 8" units="cm">
<box id="b2" />
</box>

Here is a digram showing how we model it:

In this modeling system, we only have nodes (boxes), node-lists (red lines), arcs (black lines), names (text) and strings (quoted text). Terminal nodes I will represent using the name {terminal-node} which is equivalent, for terminal nodes in element content, to #PCDATA.

In this diagram, I have placed NMTOKENs in quotes (they might be considered terminals); however for the rest of the paper they are no longer treated as strings but as name tokens (i.e., there is no difference between "Attributes", "units" and "cm".) (This also means I do not need to treat CDATA differently to the tokenized attribute types, which simplifies exposition: it would be better to use the SGML convention of a # prefix to avoid name clashes, but this would be distracting. )

Readers should be alert that this paper does not propose a syntax for validation. The syntax used is vanilla, and familiar to users of regular expressions, SGML or XML markup declarations and BNF.

Using Regular Expressions

The essense of this paper is that each node-list (the red line) can be modelled against a content-model of tokens:

If each node-list can be modelled by a regular expression, each node list can be validated against one.

Here is a list of some useful validations:

Applying the Model to Well-Formed XML

We can use the model above to describe full "well-formed" XML better.

For a start, we can say that in XML the model for a node-list under "Element" is in fact: (XML Attributes & Contents & Attributes) and that the "XML Attributes" are a node-list with the model (XML name & xml:lang? & xml:space?) for XML 1.0. (This represents a good conceptual advance on SGML; however, of course, these attributes use the same syntax as any other attributes so in some other profile of SGML they would be regarded as normal attributes.)

Let us now create three more kinds of non-terminal nodes: "PI", "Comment" and, following Paul Prescod, "WhiteSpace". We can now explain the XML content types when xml:space="preserve":

We can further say that an element with xml:space="default" may provide different behaviour than this: see the example below explaining the SGML model.

Applying the Model to Valid XML

The addition of document type definitions allows several further validations, because it provides more information.

First, the element reference attribute types (ID, IDREF) can be modeled by adding a new kind of node "Reference", which replaces IDREF attribute values and has a content model of ({terminal-node} & Element). In adding a "Reference" node, we move from a simple tree structure to a directed graph. We can thus define a content model for an IDREF attribute for an element, constraining it to point to a single element type: an attribute named "boxRef" could be constrained to only reference elements type "box".

We can model IDREF attributes as having a content model of ({terminal-node} & Element*). The content model of an IDREFS attribute allows more interesting forms of validation, especially for complexing linking: we could specify that a "household" IDREFS attribute had content model to reference only elements of (mother? & father? & child* & grandparents* & grandchild* & unmarried-sibling* & refugee* & pet* & ghost*)

Note, too, that this kind of validation is also available using URLS to other available documents whose ID attributes we can identify. However, the advantage of using NMTOKENS where ever possible is that NMTOKENS have more chance of conforming to the name rules of programming and scripting languages: to be used directly as keys or classnames without requiring an extra level of lookup.

This opens up a new kind extended validation. Similar forms of validation can be applied to ENTITY and NOTATION references, assuming that the DTD has been made available as nodes according to this model.

Various models can be applied to the markup declarations: an XML entity declaration has a model of (Entity Name & ({terminal-node} |((System Identifier | (Public Identifier & System Identifier)) & Notation? ))

Applying the Model to SGML

The SGML content types and white-space rules can be explained as follows:

Element references and marked sections could be included into this model, as non-terminal nodes. Numeric character references are a lexical tag and do not figure here. I have not investigated inclusions.

In SGML, an entity is more like an element: it has a model of (Entity Name & ({terminal-node} |((System Identifier | (Public Identifier & System Identifier?)) & ( Notation & Attributes ))) and "Reference" has the content model ({terminal-node} & (Element | Entity))

Architectural forms can be seen, in part, as a system of declaring validation constrains based on using attribute values other than the element type name.

Comparison with XSL Validation

This brings us to the point very close to that raised in Using XSL as a Validation Language: we can view content-model validation as a process of

What is the difference between a tree transformer and a validator, then? I have suggested in the paper above that the difference is one of implementation rather than nature; however it is a significant difference and I would not want to reduce them too far. The user of a validator does not think in terms of nodes and arcs, they see only a higher-level view of the document objects; tree-location languages do not provide very good positional hooks for relative checking of element contents, in particular if they do not provide the equvalents of "|", ",", "+" and some mechanism to cope with groups (), (and, if possible, the short forms "*" and "&", and a not indicator to allow "content models by exclusion") they cannot act as implementations for regular-expression-based validators. They would be expected to operate on content models only containing "?" and "&".


Copyright (C) 1999 Rick Jelliffe. Please feel free to publish this in any way you like, but try to update it to the most recent version, and keep my name on it.