This version: February 5, 2002
Previous version: Feb 15, 2001
This document is a RDDL Resource Directory Description for the Hook 0.2 validation language, which is an XHTML document with special XLinks that locate various resources useful for Hook.
The Hook validation language is a thought experiment in minimalism in XML schema languages. The purpose of such a minimal language would be to provide useful but ultra-terse success/fail validation for basic incoming QA, especially of datagrams. It is like a checksum for a schema.
The validation it performs can be characterized as "Does this element have a feasible name, ancestry, previous-siblings and contents?", there being some tradeoff between the how fully the later criteria are tested.
Let us start with the following technical criteria:
A Hook schema is an element containing a list of element names, some of which may be grouped by square brackets. This list represents a certain ordering of the names and validation consists of checking conformity to this ordering.
The DTD for the language is
<!ELEMENT hook:order ( #PCDATA)> <!ATTLIST hook:order xmlns:hook CDATA #FIXED "http://www.ascc.net/xml/hook" targetNamespace CDATA #IMPLIED friendly ( true | false ) "true" short (true | false) "false" top (true| false) "true" >
The order element has the following grammar, where s is one or more whitespace (or string-start or string-end) and NCame is an XML name with no colons.
s ( (NCName "."? s )| ( "[" s (NCname ("."|";")? s)+ "]" s ) )+
The order element specifies an ordering of elements; element grouped by square brackets are in the same level or order.
Validation occurs by, for each element in the document proceding in document (streaming) order, checking that every previous-sibling element at the same level and then each ancestor element are ordered according to the list order (ignoring intermediate list items, but failing if there is no corresponding item in the schema to any element.) A name may appear more than once. (Actually, an implementation only needs to look a the first child and or next-sibling to perform validation, but explaining it this way around may make the syntax easier to understand.)
A fullstop (period) on an element indicates that the element may have no contents (no subelements and only zero-or-more whitespace characters ): this is almost the same as EMPTY or data content. A semi-colon indicates that the element is not recursive: the named element cannot immediately contain any elements from the same group. (It can still be followed by elements of the same group. ) A semi-colon in a group at the end of a schema thus indicates that simple content only is possible
Normally [ x y ] allows
but [ x y; ] allows
because the y does not contain x, but not
because the y does contain x. And [ x y. ] allows
So [ x y ] means
but [ x y; ] adds the constraint
So ";" is used to break out of the recursion allowed in a [ ] group.
Intuitively, this is like first making a big list of every element allowed, putting them all in a choice group. This gives us a complete definition of every allowed element: it defines the namespace and catches spelling errors. Next, if there is some element(s) that can start, move them out to the front (or copy them if they can reappear. Now the schema validates the top-level elements too. Next, if there are some elements that can only appear as the last elements in a coment model ( e.g. the z in (x, y, z) or the b and c in ( a, (b | c)*) ) then move these out to a group at the end. Now we have validation for elements in simple mixed content. Continue factoring until done.
So given the following schema:
<hook:order>A B. C</hook:order>
then the following documents are valid
<A/> <A><B/></A> <A><C/></A> <A><B/><C/></A> <A><C><C/></C></A> <A><A/></A>
<B/> <C><B/></C> <B><A/></B> <A><C/><B/></A> <A><C/><C/><B/></A> <A><B><B/></B></A>
It is quite possible that there are languages which exhibit orders that cannot be usefully captured. In those cases, a hook schema still can show the top element, all names in the namespace, and which elements must be empty.
The following example is a hook schema for XHTML Basic
<hook:order targetNamespace="http://www.w3.org/1999/xhtml" > html head [ title; meta. link. base. ] body [ a br. blockquote caption; div dl; h1; h2; h3; h4; h5; h6; img. ol; p; pre; table; ul; ] [ tr; dt; dd; li; ] td [ a br. blockquote div form img. ol; ul; li; ] [ input; label; select; textarea; ] [ option. ] [ abbr acronym address cite code dfn em kbd q samp span strong var object; ] param </hook:order>
This schema captures a lot of containment relationships OK, I think: probably it has some mistake. But it will not detect what may be a common XHTML problem, where omit-end-tag HTML elements like <body> are converted to <body />. However it will detect problems like <meta> not being converted to an empty tag and so spuriously including other head elements.
The next example is RSS.
<hook:order targetNamespace="http://purl.org/rss/1.0/" > channel title link image items item title link url description textinput. </hook:order>
A Hook schema for the well-known Purchase Order example would be:
<hook:order targetNamespace="..." > PurchaseOrder [comment; ShipTo; ] Name Street City State Zip ShipDate [ comment; Items; ] Item productName quantity price comment </hook:order>
This is a much more successful example! Note, every valid PO document will also be valid against this schema and that the schema validates all sequence requirements. What it won't catch is if an end-tag is in the wrong palce w.r.t what should be a sibling. So it seems that Hook may be good for validating datagrams of this kind.
Following is a schema for Schematron 1.5
<hook:order targetNamespace="http://www.ascc.net/xml/schematron" > schema ns title p phase active p pattern rule [ assert; report; key.] diagnostics diagnostic [ name. dir; emph; value-of. ] <hook:order>
Again, this is pretty good: there is a good amount of order to capture. The "diagnostics diagnostic" could also come before or or after rule
In all four cases above the character count is less than 400 characters, so it looks like they would be retrieve in the first packet group from a server.
Hook seems to suit languages that have large flat bottoms, languages with specific requirements early on in each content model, languages with specific elements that do not re-occur in different contexts with different priorities, languages with attributes that are not vital or will be checked by other mechanisms.
Hook would seem useful as a coarse-grained but ultra-terse validation language.
If we say that validation is to catch errors that are most likely to happen, the most likely errors are spelling errors, children in the wrong order, and required parents: Hook gets or catches most.
How much would this help an interactive editor? It would know which elements can start, but for new documents it would present to many choices: however if editing existing documents it would cull the available list pretty well, because it would know what the current level was. It would know empty elements.
It would be nice to signal order by < but too much markup would be required.
Joe English has posted interesting material regarding formalisms for Hook, algorithm for implementing and other material. See XMLHACK.COM item.
The name Hook comes from a supposed hook shape of drawing this on a parse tree tracing previous-sibling then up the descendents.
The well known URI for Connect is
http://www.ascc.net/xml/hook is the namespace of the root
element of a hook program.
Copyright 2001 (C) Rick Jelliffe
There is no Hook 0.2 software from me, however if you make some, please consider making that software available under the conditions of the zlib/libpng license (the least restrictive). Comments, fixes and upgrades welcome: email email@example.com
Thanks to Joe English and John Cowan for pointing out various issues.