The Schematron

An XML Structure Validation Language using Patterns in Trees

Version: 2000-04-23

DTD and Schema

This is the basic structure of an instance:

Here is the minimum DTD for The Schematron. What could be simpler!
(The most current DTD and any DTDs in development are given at the end of this page.)

<!-- +//IDN sinica.edu.tw//DTD Schematron 1.0a//EN -->
<!ELEMENT schema  ( title?, pattern+ )>
<!ELEMENT assert  ( #PCDATA )>
<!ELEMENT pattern ( rule+ )>
<!ELEMENT report  ( #PCDATA)>
<!ELEMENT rule    ( assert | report  )+>
<!ELEMENT title   ( #PCDATA )>
<!ATTLIST schema  ns      CDATA #IMPLIED
color="#FF8000">  >
<!ATTLIST assert  test    CDATA #REQUIRED >
color="#FF8000">  
<!ATTLIST pattern name    CDATA #REQUIRED
                  see     CDATA #IMPLIED > 
<!ATTLIST report  test    CDATA #REQUIRED >
<!ATTLIST rule    context CDATA #REQUIRED >

Here is the equivalent Schematron schema for that DTD (in the following, the content models are open, which means that any other elements can be used too):

<schema>
    <title>Demonstration Patterns for the Schematron Itself</title>
    <pattern name="The Open Schematron DTD 1.0">
        <rule context="schema">
            <assert test="pattern">A schema element should contain at least one pattern elements.</assert>	
        </rule>
        <rule context="pattern">
            <assert test="rule">A pattern element should contain at least one rule elements.</assert>
            <assert test="@name">A pattern element should have an attribute called name.</assert>
         </rule>
        <rule context="rule">
            <assert test="assert | report ">A rule elemement should contain at least one assert or report elements.</assert>
            <assert test="@context">A rule element should have an attribute called context.
            This should be an XPath for selecting nodes to make assertions and reports about.</assert>
        </rule>
        <rule context="assert">
            <assert test="@test">An assert element should have an attribute called test.
            This should be an XSLT expression.</assert>
        </rule>
        <rule context="report">
            <assert test="@test">A report element should have an attribute called test.
            This should be an XSLT expression.</assert>
        </rule>
    </pattern>
</schema>

Let us enhance the schema so that the content models are closed, which means that no other:

<schema>
    <pattern name="The Closed Schematron DTD 1.0a">
        <rule context="schema">
            <assert test="count(*) = count(pattern  | title)">Unexpected element(s) found: a schema element should contain only pattern elements.</assert>	
            <assert test="pattern">A schema element should contain at least one pattern element.</assert>	
            <report test="phase">The element phase is only used in the 1.2 DTD</report>
        </rule>
        <rule context="pattern">
            <assert test="count(*) = count(rule)">Unexpected element(s) found: A pattern element should contain only rule elements.</assert>
            <assert test="rule">A pattern element should contain at least one rule elements.</assert>
            <assert test="@name">A pattern element should have an attribute called name.</assert>
         </rule>
        <rule context="rule">
            <assert test="count(*) = count(assert | report ) ">Unexpected element(s) found: a rule elemement should contain only assert and report elements.</assert>
            <assert test="assert | report ">A rule elemement should contain at least one assert or report elements.</assert>
            <assert test="@context">A rule element should have an attribute called context.
            This should be an XPath for selecting nodes to make assertions and reports about.</assert>
            <report test="key">The element key is only used in the 1.2 DTD</report>
        </rule>
        <rule context="assert">
            <assert test="@test">An assert element should have an attribute called test.
            This should be an XSLT expression.</assert>
            <report test="name">The element name is only used in the 1.1 DTD</report>
        </rule>
        <rule context="report">
            <assert test="@test">A report element should have an attribute called test.
            This should be an XSLT expression.</assert>
            <report test="name">The element name is only used in the 1.1 DTD</report>
        </rule>
    </pattern>
</schema>

There are many kinds of extra assertions that we might like to make in addition: for example, it is important to test that attribute and elements values are not empty. The following schema contains various kinds of assertions that are not possible with DTDs. The following schema could be combined with the previous schema, or simply used in conjunction with standard DTD (or other content-model-based schema language.)

<schema>
    <pattern name="Schematron Roles are in Namespaces">
        <!-- This is a good example of something that cannot be done with DTDs -->
	<rule context="assert[@role]">
		<assert test="/schema[@ns]">A role attribute should not be used if there is no
		namespace being defined</assert>
	</rule>
	<rule context="report[@role]">
		<assert test="/schema[@ns]">A role attribute should not be used if there is no
		namespace being defined</assert>
	</rule>
    </pattern>
    <pattern name="Schematron Extras">
         <rule context="pattern">
            <assert test="parent::schema">A pattern element should be within a schema element.</assert>
            <assert test="string-length(@name) &gt; 0">A pattern element's name attribute should contain some text.</assert>
            <report test="count(rule) &gt; 4000">Warning: This implementation of The Schematron 
		only allows 4000 rules per pattern. Split the current pattern into two.</report>
        </rule>
        <rule context="rule">
            <assert test="parent::pattern">A pattern element should be within a schema element.</assert>  
            <assert test="string-length(@context) &gt; 0">A rule element's context attribute should contain an XPath.</assert>
        </rule>
        <rule context="assert">
            <assert test="parent::rule">An assert element should be within a rule element.</assert>
            <assert test="string-length(@test) &gt; 0">An assert element's test attribute should contain an XPath expression.</assert>
            <assert test="string-length(text()) &gt; 0">An assert element should contain some explanatory text.</assert>
        </rule>
        <rule context="report">
            <assert test="parent::rule">A report element should be within a rule element.</assert>
            <assert test="string-length(@test) &gt; 0">A report element's test attribute should contain an XPath expression.</assert>
            <assert test="string-length(text()) &gt; 0">A report element should contain some explanatory text.</assert>
        </rule>
    </pattern>
    <pattern name="Miscellanous examples of useful things">
        <rule context="schema">
            <report test="ancestor::schema">A schema cannot contain a schema</report>
        </rule>
        <rule context='rule[@context="html"]'>
            <report test="*">This schema seems to be for HTML.</report>
        </rule>
    </pattern>
</schema>

Note that this schema tests both that the elements contain the correct content, but also that the elements have the correct parent. Also, this schema can make string-based chacks on text in attributes and elements.  The Schematron DTD allows very specific error messages that can be tailored to the exact schema.


Heuristics for Schematron from DTDs

I am preparing a note on XML Schemas versus DTDs and Schematron.

Here is a post I sent to XML-DEV about a method of deriving a Schematron schema from a content model. It only fails on some kinds of groupings: I don't think these kinds of groupings are an important or common class in practise, so it looks like Schematron is effectively more powerful than content models (though formally it may not be a complete superset: however, on any pragmatic measure the things that Schematron does in addition to a grammar are far more useful and important than the thing it perhaps is less powerful in, which only relates to elements or groups repeated within a content model at various places with different occurrence constraints.

On the issue of XML Schemas being compiled into Schematron, I am
interested in knowing if anyone can come up with content models that
could not be validated by a Schematron schema automatically generated
from an XML Schema so that, for every unique particle in an element
complex type:
        1) an assertion statement is created for all the allowed successors of
that particle within that parent
        2) an assertion statement is created for all the allowed predecessors
of that particle within that parent
        3) an assertion statement is created giving the effective minOccurs of
that element within the whole type
        4) an assertion statement is created giving the effective maxOccurs of
that element within the whole type.

These simply derivable rules seem to validate most of the constraints in
most content models. But I can see a family of cases where these don't
capture everything: that is the case of particles repeated in different
contexts.  For example, a content model like
        ( a{3-4}, b, a{5-6} ) 
would by the above translation rules have the assertions
 <rule context="x/a">
  <assert 
   test="following-sibling::b or following-sibling::a or
position()=count(parent::*/*)"
  >Allowed successors...</assert>
  <assert 
   test="previous-sibling::b or previous-sibling::a or position()=1"
  >Allowed predecessors...</assert>
 </rule>
  <rule context="x/b">
  <assert 
   test="following-sibling::a"
 <assert 
   test="previous-sibling::a"
  >Allowed predecessors...</assert>
 </rule>
  <rule context="x">
   <assert test="count(a) > 7 and count(a) < 11"
   >(max and min on a)
   <assert text="count(b) =1 "
   >(max and min on b)</assert>
but that corresponds to a slightly weaker content model:
   ( a,  (( b, a{7-10}) 
     ( a,  (( b, a{6-9}) 
      | ( a,  (( b, a{5-8}) 
       | ( a,  (( b, a{4-7}) 
        |( a,  (( b, a{3-6}) 
          |( a,  (( b, a{2-5}) 
            |( a,   b, a{1-4}) 
      ))))))))))))
i.e. 
        ( a{1-7}, b, a{1-9} ) where a>7 and a<11

If we can find any convenient way to represent these kind of grouping
constraints (and other similar ones) then it is possible that the
approach based on assertions on two-step path models is more powerful
that grammars (for modeling constraints, which is only one of the things
that a schema language can be for: a schema language can also allow
naming of structures present according to some analytical paradigm such
as "type" or "pattern"). (Of course, if allowed an infinite number of
subcontexts within an assert, that would give us a better purchase (in
the mountaineering sense) but I am trying to resist that if possible.
    

One immediate extension is to have both total max and min occurs and local max and min occurs values for each element. That makes (a{3-5},b,a{5-7}) into (a{3-7}, b, a{3-7}) where 8<count(a)<12

Note that there is another version of the heuristic which would involve first making the lists as above, then repeating them with successor and predecessor lists of length 2..n each, where n is the maxiumum number of unique particles in the content model. This would catch more of the contsrtraints of long content models made with many permutations of some small vocabulary of elements.


1.0a DTD (Initial)

<!-- +//IDN sinica.edu.tw//DTD Schematron 1.0a//EN -->
<!ELEMENT schema  ( title?, pattern+ )>
<!ELEMENT assert  ( #PCDATA )> 
<!ELEMENT pattern ( rule+ )> 
<!ELEMENT report  ( #PCDATA )>
<!ELEMENT rule    ( assert | report  )+>
<!ELEMENT title   ( #PCDATA )>
<!ATTLIST schema  ns      CDATA #IMPLIED  >
<!ATTLIST assert  test    CDATA #REQUIRED >  
<!ATTLIST pattern name    CDATA #REQUIRED
                  see     CDATA #IMPLIED > 
<!ATTLIST report  test    CDATA #REQUIRED >
<!ATTLIST rule    context CDATA #REQUIRED >

1.1 DTD (Current)

Add <name> element for more specific messages, and various ID attributes for referencing and to give role-names to parts of patterns.

<!-- +//IDN sinica.edu.tw//DTD Schematron 1.1//EN -->
<!ELEMENT schema  ( title?, pattern+ )>
<!ELEMENT assert  ( #PCDATA | name )*> 
<!ELEMENT name     EMPTY >
<!ELEMENT pattern ( rule+ )> 
<!ELEMENT report  ( #PCDATA | name )*>
<!ELEMENT rule    ( assert | report   )+>
<!ELEMENT title   ( #PCDATA ) >
<!ATTLIST schema  ns      CDATA #IMPLIED  >
<!ATTLIST assert  test    CDATA #REQUIRED
                  role    ID    #IMPLIED > 
<!ATTLIST name    path    CDATA #IMPLIED >
	<!-- Schematrons should implement '.' 
	as the default value for path -->
<!ATTLIST pattern name    CDATA #REQUIRED
                  see     CDATA #IMPLIED
                  id      ID    #IMPLIED > 
<!ATTLIST report  test    CDATA #REQUIRED
                  role    ID    #IMPLIED >
<!ATTLIST rule    context CDATA #REQUIRED
                  role    ID    #IMPLIED >

1.3 DTD (In Development)

Add <phase/> element to allow dynamic schemas, <key/> element to allow graph validation, fpi attributes to allow SGML Formal Public Identifiers for better management; diagnostics (formerly hints) to allow prescriptions of possible causes of a validation failure (this may help keep assert statements phrased as assertions rather than as error messages); ns for adding namespaces; p element for nicer messages to user about the schema; emph element and icon attribute for better visual presentation. Remove ns attribute on schema and add xmlns attribute. Add subject attribute to report and assert, based on Dan Conolly's idea, to allow strict identification of the subject or the assertion, for use by RDF etc. Status: <key /> and fpi now implemented. p and phase just need to be hooks in the skeleton. subject can just be an atribute.

<!-- +//IDN sinica.edu.tw//DTD Schematron 1.3b//EN -->
<!-- Data types -->
<!ENTITY % URI  "CDATA" >
<!ENTITY % PATH "CDATA" >
<!ENTITY % EXPR "CDATA" >
<!ENTITY % FPI  "CDATA" >
<!-- Element declarations -->
<!ELEMENT schema  ( title?, ns*, phase*, p*, pattern+ , p*, diagnostics )>
<!ELEMENT assert  ( #PCDATA | name | emph )*>

<!ELEMENT emph     ( #PCDATA )>
<!ELEMENT diagnostic (#PCDATA | value-of | emph )* >
<!ELEMENT diagnostics ( diagnostic+ )>
<!ELEMENT key      EMPTY >
<!ELEMENT name     EMPTY >
<!ELEMENT ns       EMPTY >
<!ELEMENT p       ( #PCDATA ) >
<!ELEMENT pattern ( p*, rule+ )>
<!ELEMENT phase   ( #PCDATA ) >
<!ELEMENT report  ( #PCDATA | name )*>
<!ELEMENT rule    ( assert | report | key )+>
<!ELEMENT title   ( #PCDATA ) >
<!ELEMENT value-of EMPTY >
<!-- Attribute declarations -->
<!ATTLIST schema  xmlns %URI; #FIXED "http:/www.ascc.net/xml/schematron"
                  fpi     %FPI; #IMPLIED
                  defaultPhase IDREF #IMPLIED
                  icon    %URI; #IMPLIED
                  xml:lang NMTOKEN #IMPLIED >
<!ATTLIST assert  test    %EXPR; #REQUIRED
                  role    ID    #IMPLIED 
		  diagnostics IDREFS #IMPLIED
                  icon    %URI; #IMPLIED
                  subject %PATH; #IMPLIED  >
<!ATTLIST diagnostic id      ID    #REQUIRED >
<!ATTLIST key     name    NMTOKEN #REQUIRED
                  path    %PATH; #REQUIRED 
                  icon    %URI;  #IMPLIED >
<!ATTLIST name    path    %PATH; #IMPLIED >
	<!-- Schematrons should implement '.' 
	as the default value for path -->
<!ATTLIST pattern name    CDATA #REQUIRED
                  see     %URI; #IMPLIED
                  id      ID    #IMPLIED 
                  icon    %URI; #IMPLIED>
<!ATTLIST ns      uri     %URI; #REQUIRED >
                  prefix  NMTOKEN #IMPLIED >
<!ATTLIST phase   id      ID    #REQUIRED
                  fpi     %FPI; #IMPLIED
                  activePatterns IDREFS #REQUIRED 
                  icon    %URI; #IMPLIED >
<!ATTLIST report  test    %EXPR; #REQUIRED
                  role    ID    #IMPLIED 
                  diagnostics IDREFS    #IMPLIED
                  icon    %URI; #IMPLIED
                  subject %PATH; #IMPLIED >
<!ATTLIST rule    context %PATH; #REQUIRED
                  role    ID    #IMPLIED >
<!ATTLIST value-of  select %PATH; #REQUIRED >

Copyright (C) Rick Jelliffe, Academia Sinica Computing Centre. The Schematron software and this page are available for any public use, under the conditions of the GPL or MPL, but please mention our names in any documentation or About screens for any products that uses it. Comments, fixes and upgrades welcome: email ricko@gate.sinica.edu.tw