Resource Directory (RDDL) for Hook 0.2

A One-Element Language for Validation of XML Documents based on Partial Order

This version: February 5, 2002
Previous version: Feb 15, 2001

Editor:

This document is a RDDL Resource Directory Description for the Hook 0.2 validation language, which is an XHTML document with special XLinks that locate various resources useful for Hook.

The Hook validation language is a thought experiment in minimalism in XML schema languages. The purpose of such a minimal language would be to provide useful but ultra-terse success/fail validation for basic incoming QA, especially of datagrams. It is like a checksum for a schema.

The validation it performs can be characterized as "Does this element have a feasible name, ancestry, previous-siblings and contents?", there being some tradeoff between the how fully the later criteria are tested.

Let us start with the following technical criteria:

The Language

A Hook schema is an element containing a list of element names, some of which may be grouped by square brackets. This list represents a certain ordering of the names and validation consists of checking conformity to this ordering.

The DTD for the language is

  <!ELEMENT hook:order ( #PCDATA)>
  <!ATTLIST  hook:order
	xmlns:hook  CDATA #FIXED "http://www.ascc.net/xml/hook"
	targetNamespace CDATA #IMPLIED
	friendly ( true | false ) "true"
	short (true | false) "false"
	top (true| false) "true"
 >

The order element has the following grammar, where s is one or more whitespace (or string-start or string-end) and NCame is an XML name with no colons.

  s ( (NCName "."? s )| 
	( "[" s (NCname ("."|";")? s)+ "]" s ) 
         )+ 

The order element specifies an ordering of elements; element grouped by square brackets are in the same level or order.

Validation occurs by, for each element in the document proceding in document (streaming) order, checking that every previous-sibling element at the same level and then each ancestor element are ordered according to the list order (ignoring intermediate list items, but failing if there is no corresponding item in the schema to any element.) A name may appear more than once. (Actually, an implementation only needs to look a the first child and or next-sibling to perform validation, but explaining it this way around may make the syntax easier to understand.)

A fullstop (period) on an element indicates that the element may have no contents (no subelements and only zero-or-more whitespace characters ): this is almost the same as EMPTY or data content. A semi-colon indicates that the element is not recursive: the named element cannot immediately contain any elements from the same group. (It can still be followed by elements of the same group. ) A semi-colon in a group at the end of a schema thus indicates that simple content only is possible

Normally [ x y ] allows

   <x><y/><x/><y/></x>
   <y><x/><y/><x/></y>

but [ x y; ] allows

   <x><y/><x/><y/></x>

because the y does not contain x, but not

   <y><x/><y/><x/></y>

because the y does contain x. And [ x y. ] allows

   <x><y/><x/><y/></x>

but not

   <y><x/></y>

So [ x y ] means

but [ x y; ] adds the constraint

So ";" is used to break out of the recursion allowed in a [ ] group.

Intuitively, this is like first making a big list of every element allowed, putting them all in a choice group. This gives us a complete definition of every allowed element: it defines the namespace and catches spelling errors. Next, if there is some element(s) that can start, move them out to the front (or copy them if they can reappear. Now the schema validates the top-level elements too. Next, if there are some elements that can only appear as the last elements in a coment model ( e.g. the z in (x, y, z) or the b and c in ( a, (b | c)*) ) then move these out to a group at the end. Now we have validation for elements in simple mixed content. Continue factoring until done.

So given the following schema:

   <hook:order>A B. C</hook:order>

then the following documents are valid


  <A/>

  <A><B/></A>

  <A><C/></A>

  <A><B/><C/></A>

  <A><C><C/></C></A>

  <A><A/></A>

But not

 <B/>

  <C><B/></C>

  <B><A/></B>

  <A><C/><B/></A>

  <A><C/><C/><B/></A>

 <A><B><B/></B></A>

It is quite possible that there are languages which exhibit orders that cannot be usefully captured. In those cases, a hook schema still can show the top element, all names in the namespace, and which elements must be empty.

Example

The following example is a hook schema for XHTML Basic

 <hook:order targetNamespace="http://www.w3.org/1999/xhtml" >
  html head  [ title; meta. link. base. ]   body
  [ a br. blockquote caption; div  dl; h1; h2; h3; h4; h5; h6;  
	img. ol; p; pre; table; ul; ]  
  [ tr;  dt; dd; li; ]  td 
  [ a br. blockquote div  form img. ol; ul; li; ]  
  [ input; label; select; textarea; ]  [ option. ]
  [ abbr acronym address cite code dfn em kbd q samp span strong var object; ] 
  param 
 </hook:order>

This schema captures a lot of containment relationships OK, I think: probably it has some mistake. But it will not detect what may be a common XHTML problem, where omit-end-tag HTML elements like <body> are converted to <body />. However it will detect problems like <meta> not being converted to an empty tag and so spuriously including other head elements.

The next example is RSS.

 <hook:order  targetNamespace="http://purl.org/rss/1.0/" >
  channel   title link image items item
  title link url description textinput.
 </hook:order>

A Hook schema for the well-known Purchase Order example would be:

 <hook:order  targetNamespace="..." >
  PurchaseOrder  [comment; ShipTo; ]  Name Street City State Zip  
  ShipDate [ comment; Items; ] Item productName quantity price comment
 </hook:order>             

This is a much more successful example! Note, every valid PO document will also be valid against this schema and that the schema validates all sequence requirements. What it won't catch is if an end-tag is in the wrong palce w.r.t what should be a sibling. So it seems that Hook may be good for validating datagrams of this kind.

Following is a schema for Schematron 1.5

 <hook:order  targetNamespace="http://www.ascc.net/xml/schematron" >
  schema  ns title p phase active p pattern rule   [ assert; report; key.] 
  diagnostics  diagnostic [ name. dir; emph; value-of. ]
<hook:order>

Again, this is pretty good: there is a good amount of order to capture. The "diagnostics diagnostic" could also come before or or after rule

In all four cases above the character count is less than 400 characters, so it looks like they would be retrieve in the first packet group from a server.

Comments

Hook seems to suit languages that have large flat bottoms, languages with specific requirements early on in each content model, languages with specific elements that do not re-occur in different contexts with different priorities, languages with attributes that are not vital or will be checked by other mechanisms.

Hook would seem useful as a coarse-grained but ultra-terse validation language.

If we say that validation is to catch errors that are most likely to happen, the most likely errors are spelling errors, children in the wrong order, and required parents: Hook gets or catches most.

How much would this help an interactive editor? It would know which elements can start, but for new documents it would present to many choices: however if editing existing documents it would cull the available list pretty well, because it would know what the current level was. It would know empty elements.

It would be nice to signal order by < but too much markup would be required.


Formalization

Joe English has posted interesting material regarding formalisms for Hook, algorithm for implementing and other material. See XMLHACK.COM item.


Why Hook?

The name Hook comes from a supposed hook shape of drawing this on a parse tree tracing previous-sibling then up the descendents.


Related Resources for Hook 0.2

Well known URI

The well known URI for Connect is http://www.ascc.net/xml/hook.

Root namespace URI

http://www.ascc.net/xml/hook is the namespace of the root element of a hook program.


Copyright 2001 (C) Rick Jelliffe

There is no Hook 0.2 software from me, however if you make some, please consider making that software available under the conditions of the zlib/libpng license (the least restrictive). Comments, fixes and upgrades welcome: email ricko@topologi.com

Acknowledgements

Thanks to Joe English and John Cowan for pointing out various issues.