Academia Sinica Computing Centre
Regular-expression content-models can be processed in various ways to obtain constraints which are easier to implement. Weak validation has some attractive properties that may coincide with the nature of many XML documents.
Some content models are difficult to validate: in particular, content models containing many "&" operators may be subject to combinatorial explosions if validated using a conventional automaton.
Conversely, validators implemented using XSL, as proposed in my note Using XSL as a Validation Language, can validate content models containing many "&" operators readily. However, they may suffer a combinatorial explosion from content models using the various repeating and optional operators.
An alternative approach has been mooted in the XML Schema draft, which is to allow "open" and "closed" schemas.
This is an interesting idea; it means that we can interpret content model schemas in different, but useful ways.
One way that I think may be promising is to allow forms of weak validation.
One form of weaker constraint that can be extracted from a content model is to find a list of all element types that are always-required. For example, taking the following models:
<!ELEMENT eg ( a, b, a?, ( c | ( f, b, c)), (d | e)*>
a weak content model would be (if #ANY means a single element):
<!ELEMENT eg ( a , b, #ANY+, c, #ANY+ ))>
or (if #ANYSEQ means any string of elements in any following positions)
<!ELEMENT eg ( a , b, (#ANYSEQ* & c ) ))>
where that content model would be interpreted as "very open" (that is the function of the #ANY or #ANYSEQ tokens). The leftmost consecutive required element types are specified using "," but after the first optionality or grouping indicator, the "&" connector is used. As mentioned, this kind of content model is trivial for XSL validation.
Why would this be useful?
It would be useful if XML supported a kind of weak validator that supported very-open content models with only "," or "&" connectors: this validates "always-required" element types only. This could use information from XML markup declarations and XML schema, but be trivial to implement.
The above examples raise an interesting issue. Is
( (a)+ & b) different from
( (a+) & b ) or
( a & a* & b )
Also worth considering is the issue How can validation schemas cope assist the authors of in-progress documents? It seems that the distinction between valid and invalid is too extreme to be useful during authoring. I note that the FrameMaker+SGML structured editor presents the document creator with several interesting choices in this regard:
So we can identify several different strengths of validity:
The fourth type, potential valdity, is where no impossible elements are present. It can be performed by weak validation, where a model like
(a, b?, (c | d), (b)*)
is replaced by unambiguous
(a?, (b | c | d)* )
or the ambiguous
(a?, b?, ( c| d)?, b* )
Copyright (C) 1999 Rick Jelliffe. Please feel free to publish this in any way you like, but try to update it to the most recent version, and keep my name on it.