Academia Sinica Computing Centre
This note looks at two two intriguing possibilities for structural validation raised by the XPath draft:
As an example, I express RDF using path models. RDF has structural constraints that cannot be expressed using DTD content models. However, these constraints can be expressed using path models.
This note suggests that XPath axes can be used to provide targets for regular expression validation and capture constraints that cannot be as readily expressed by content models.
XPath axes are introduced in the XPath draft. The idea of validating documents using different targets is explored in the note Validate This! Content Models on Different Targets. The idea of using XSL for document validation is explored in the note Using XSL as a Validation Language.
XPath includes the following axes:
These axes can be used as the target of regular expressions. These allow different kinds of validation constraints to be specified.
Current content models can be characterized as regular expressions with targets on the axis child.
For example, given the axis ancestor, the model (a, b*, c?, (d | e)) is satisfied at the element f in the following document
<e> <b> <b> <a> <f/></a><b/><b/><e/>
Given the axis descendent, the model (g, h*, i?, (j | k)) is satisfied at element l in the following document:
<l> <g/><j> </l>
and also by
<l> <g> <h/><i/> </g> <k/> </l>
The constraints imposed by other axes can readily be imagined in similar fashion.
The question needs to be asked "What is wrong with content models on children?" My answer would be "nothing", but that to allow the maximum freedom in contraint-modeling, these other axes can be available at little cost, if an XPath implementation is available.
Many of the kinds of validation I suggested in Validate This! can be implemented by the different axes of XPath. In particular, the attribute axis allows dependencies between attributes to be expressed. For example, to say that if a "size" attribute is specied, a "units" attributes should also be specified.
XPath seems an appropriate technology on which to base validators on different targets.
Instead of specifying an axis for model, XPaths could be used inside DTD content models (or their equivalent in other schema languages): in this view, the content model is just a degenerate case of XPaths which only specify children element types.
The term content model is no longer applicable: it is a path model. A validator sequentially examines each path to see if it is satisfied: the sequence operator increments a location cursor at each level of interest. For arguments sake, all condidate path models must be valid; this is different to XSLT, where a priority attribute decides which of several templates whose templates match the current point should be used.
Path models are interesting in that they allow a different approach to document validation than the top-down model. This may be useful for extensible documents. It also allows a different approach to document construction and document minimization.
<!ELEMENT a ( ../x, b, c/d, c/e)>
would be satisfied by
<x> <a> <b/> <c> <d/> <e/> </c> </a> </x>
<!ELEMENT person ( name/@first, name/@last )> <!ELEMENT name EMPTY> <!ATTLIST name first CDATA #IMPLIED last CDATA #IMPLIED >
would be satisfied by
<person><name first="Rick"/><name second="Jelliffe"/></person>
but not by
<person><name first="Rick"/><name first="Ricko" /></person>
We can see that the presence, names, optionality and enumerated values of attributes can be expressed using this:
<!ELEMENT a ( (@color='red' | @color='green' | @color='blue' )?, @name='A', @id, a* >
is the same as
<!ELEMENT a ( a* )> <!ATTLIST a color ( red | green | blue ) #IMPLIED name CDATA #FIXED "A" id CDATA #REQUIRED >
It does not allow any datatyping, and I do not see how value defaulting fits in yet.
Here is a quick and dirty representation of RDF. I use the extension ./. to mean the next child element, using the second dot. in the sense of "one single token" as used in text processing languages. So ( ./.)* means ANY elements.
Furthermore, in order to emphasize that this approach is not
only suitable for DTD syntax (and because the markup declaration
"ELEMENT" is no longer meaningful) I here use a
processing instruction called "validate". (These PIs
pseudo-attributes syntax, so it should be easy to see that element syntax is trivially possible.) I have used PIs here to emphasize that this form is very terse, and so suitable for end-user validation; a more verbose form using element syntax might be more appropriate at the data generation side.
<!-- This model specifies children and attributes --> <?validate when="rdf:RDF" assert="(@id?, @type?, @about?, @aboutEach?, @aboutEachPrefix?, @bagID, ( rdf:Seq | rdf:Alt | rdf:Bag | rdf:Description | ./. )*" ?> <?validate when="rdf:Description" assert="(@id?, (@parseType='Resource' | @parseType='Literal' )?, @resource?, @bagID?, (./.)* )" ?>
<!-- These models specifies parents and parent attributes --> <?validate when="rdf:li" assert="( ../@id?, ( ../rdf:Alt | ../rdf:Seq | ../rdf:Bag ), (./.)* )" ?> <?validate when="rdf:subject" assert="( ../../rdf:RDF, (@parseType='Resource' | @parseType='Literal' )?, @resource?, (./.)* " ?> <?validate when="rdf:object" assert="( ../../rdf:RDF, (@parseType='Resource' | @parseType='Literal' )?, @resource?, (./.)* " ?> <?validate when="rdf:predicate" assert="( ../../rdf:RDF, (@parseType='Resource' | @parseType='Literal' )?, @resource?, (./.)* " ?> <?validate when="rdf:value" assert="( ../../rdf:RDF, (@parseType='Resource' | @parseType='Literal' )?, @resource?, (./.)*" ?>
<!-- We even have XPaths on the left side --> <?validate when="../../rdf:RDF" assert="(@id?, (@parseType='Resource' | @parseType='Literal' )?, @resource?, @bagID?, (./.)* )" ?> <?validate when="./[@parseType='Resource']" assert="( @rdf:value | rdf:value )" ?> <?validate when="[@parseType='Literal']" assert="( #PCDATA )" ?>
The use of XPaths on the left side may seem strange: however it allows some kinds of constraints to be expressed that cannot otherwise be expressed. For example, in RDF the element type rdf:li is only allowed to appear in the elements rdf:Bag, rdf:Set and rdf:Seq. Content model schema laguages have no way to express this restriction. However for some elements meant to be used a a group it is arguable that this "bottom up" validation is more useful than "top-down" validation.
The advantage of the path model and axis model approaches is that it makes attributes into first-class objects, with the same kind of modeling capability as child elements. This is desirable because there is no theoretical justification for a split between elements and attributes: it is a matter of taste, convenience, and (as Lou Burnard has pointed out concerning the TEI material) embodies a theory about the document. Data can be restructured to fit many DTDs: in doing so we lose the ability to validate certain structures. This not only applies to children and attributes, but also to descendents and parents: what different data is being expressed in <a b.c="x" b.d="y"/> compared to <a><b><c>x</c><d>y</d></b></a> ?
However the disadvantage is that perhaps we may disguise structures by using these kinds of models. For example, in the RDF example, it would be much clearer to treat rdf:Description as an architectural form, and also to create the architectural form that underlies rdf:Alt, rdfBag and rdf:Seq, and similarly the form that underlies rdf:subject, etc. However, the use of path models and axis models does not preclude the introduction of class-based systems.
Axis models and path models show that there are perhaps three kinds of schematic information present in documents:
The second category is contentious. It may be that for clean layering, we need schemas independently for each. The first kind of schema would, especially for tree structures, gain advantage from using path models or axis models. I am not clear yet where default values and graph-validation fit in: perhaps to allow graph-validation (along IDs and links) we would need to allow traversing. This seems to be a mix of data-typing and structure: once we know the data type we can traverse and then validate the structure.
Copyright (C) 1999 Rick Jelliffe. Please feel free to publish this in any way you like, but try to update it to the most recent version, and keep my name on it.