Academia Sinica Computing Centre
One of the longest-running debates in high technology has been about the benefits of integration versus modularity; this has been conjoined by various debates on open versus proprietary systems, on tight code versus layered code, between fixed versus extensible designs, by externally-prescribing versus self-describing. Most of all, the issues center on the support of organic plurality: making the same framework support a universe of slightly different user needs and configurations and process-topologies, and giving the users the ability to select and alter these components on their systems without being locked into a brand family (where the brands include ISO and W3C, let alone any closed consortia providing the standards-facade for big players).
TCP/IP is an example of such an architecture. Rather than than being based on a systematic networking model, such as OSI, TCP/IP systems merely build in mechanisms to allow one level to select the next. This allows a great flexibility and constructive anarchy. The organic nature of TCP/IP (it has shown it can grow naturally, with a constructive anarchy) is directly related to its support of plurality. MIME and HTTP similarly succeeds because they allow this plurality.
The W3C has been at the forefront of supporting these kinds of modular, layered, open, extensible, self-describing architectures, through the WWW. The use of MIME types, content-negotiation, XML's processing instructions, HTML's head element type's profile attribute and RDF are all examples of this approach.
Data kidnap is where some technology gratuitously requires or provides markup that makes it difficult to use the data with another technology. Generic markup (the seperation of processing and style from logical structures, and the use of that logical structure as the framework on which processing and style is keyed) prevents data kidnap.
Data kidnap is a way to provide systems that look open: modular, layered, extensible, self-describing, etc., but which do not allow organic plurality. These systems may as well be monolithic: integrated, tightly-coded, fixed, prescribed by strangers.
For example, if a browser only supported one kind of graphic and one set of tags and one stylesheet language, and those graphic formats, tags and stylesheet languages were not ubuiquitously available in other browsers, we could say that it captured the data: the data has to be written for that browser. The problem of data kidnap is of course most well-known on the web in its guise of incompatible proprietary extensions to HTML.
To prevent data kidnap, every new technology or application class (or layer) introduced onto the WWW must be preceded by a mechanism, as an intermediate layer, to allow alternatives to that technology. And the technology itself should handle extensions gracefully, for to allow incremental change and experimentation.
For example, a major class of applications is the stylesheet. The XML Stylesheet PI allows multiple stylesheets to be registered for the same document. Without any mechanism like this (there is no need to enter any discussion on the merits of PIs at this point: I would like to point out that in XML they serve less as holders for actual instructions for processing, and more as headers to allow appropriate handlers to be selected and invoked for an application: PIs serve a slightly different function in practise in SGML systems, so some of the SGML-derived blanket condemnation of PIs may need to be reconsidered),
I do not distinguish here between internal and external mechanisms, but it seems preverse to use external mechanisms when markup is available. Markup is not available for data transport (we do not want to parse every document we send) but it is available for other uses of documents. As mentioned above, external mechanisms just defer the data kidnapissue by one layer, they do not address the issue.
Data kidnap can occur even if when widely-implemented specifications are used: Netscape may support CSS and RDF while Microsoft may support XSL and
In XML we can expect a few more years of draft standards in many areas. This makes the data kidnap issues even stronger: it is not just different alternative technologies and graceful extensions that should be allowed, it is different drafts of the same technology. It would be nice to be able to match the stylesheet with the draft.
There are other issues related to data kidnap that I will not go into now: data lockout is one. This is where a schema specification uses some feature which is gratuitously outside the power of some other core application or standard. For example, RDF allows any number of attributes with names based on their number: _1, _2,. etc. This locks outs using DTDs (or almost every other schema language) for providing a complete schematic description; of course DTDs can be made that will handle most examples, by predefining the number of attributes required by guesswork (see my RDF.DTD for an example); in the case of RDF it forced them back into using an BNF notation which incompletely captures XML and so is inaccurate in any case).
Almost all data is created with the target applications in mind; this is only prudent. So there will always be some structures that may be more suitable for some use than another; this may even verge on data lockout or data kidnap. Other applications may require information that is not present in the source data; I am not claiming that all data should be equally useable by all applications! The key word is gratuitous: making dependencies when they are not required.
Unless the W3C as part of its standard procedure also provides mechanisms and guidelines to allow plurality for each emerging application area, vendors have no choice but to introduce these gratuitous dependencies.
At the moment the most obvious locus for data kidnap is the schema. If the data description language is vendor or platform specific and there is no mechanism to allow plurality, then the data has been captured by the vendor or platform.
An example is BizTalk, a technology which I think is very exciting. It is basically a routing wrapper for data, and it specifies that XML-Data should be used for the schema. There is nothing wrong with that. However, the W3C has not provided any mechanism for selecting other schemas, so the BizTalk uses the namespace URI as the key. There is nothing particularly wrong with that: as long as BizTalk developes realize that the URI is not the name of the schema but is the name used to key a particular schema in a particular language.
But the namespace URI is also used by the XML Schema draft to select a schema. If the applications are not written to support content negotiation or its equivalent, then the data has been captured. In this case, the data has been captured by the schema, but the symptom is that the namespace URI must be altered if the same data is to appear in a BizTalk body and as data for XML schema; this violates layering.
It also shows a loophole in (the interpretation of) namespaces. I think a namespace URI that needs to be altered to fit a particular schema language (when the data does not need to be altered and the abstract schema is also the same) is not a universal name at all: it is not a name that is distinct from processing. Others, however, think that universality does not imply that when we move the data to a different processing pipeline we can keep the same namespace URI: to them a universal name only needs to be distinct from all other names, even if that means it is specific to one single usage.
However, one of the main reasons for namespaces was to convert local identifiers (element type names, etc) into public identifiers; if they are only fit for a single usage, that converts them back into local identifiers. Namespaces disconnect a name from a schema, freeing them to be publically used in other contexts: they should not be tightly coupled again.
There is a tie up with content negotiation. Content negotiation is simple: the request for a resource is accompanied by a list of preferred formats and the resource is returned in the most preferred format available. However, not all resources are kept on the same server, and not all server-operators can or want to reference alternative resources at other sites; we cannot expect a company to provide pointers to alternative technologies that are strategic to a competitor.
So we can say that content negotiation is just that: negotiation of content. To prevent data kidnapping, we also need some kind of application negotiation. This is what plurality mechanisms in markup promotes.
Another species of data kidnap is workflow kidnap. This is where the data is tied to some particular process flow: for example, if there is no convenient way for a process to annotate data it makes it more difficult to process the document in a pipeline approach. This was a very big flaw in SGML, and especially in SGML systems in which DTD registration was a difficult approach. (I would say that it also explains the usefulness of OmniMark, in which it is very easy to customize DTDs for particular parts of a pipeline.)
XML Schema draft seems to perpetuate this idea. The flaw is that a document is not treated organically, as something that moves around like an amoeba: growing, dividing, assimilating, dieing. Instead, a document has a schema; the schema becomes a monolith. It means that every stage of a document's life gets a new schema, rather than each stage of the document's life has various parts of a rich schema applicable. If one stage in a process validates its input and does not alter the data, there is no need for next stage to revalidate the data, for example. Each stage will have particular validation requirements.
The way to overcome workflow kidnap is to make sure that the mechanism for declaring the schemas (or other application ) with a phase attribute. A process knows can know that, for example, at its phase in the workflow a particular attiribute is #REQUIRED, but at previous stages it has been defaulted to #FIXED "".
Viewing documents organically brings out a very important consideration: that a schema application is just another application. A schema is metadata is data. (In the case where a document is used by only one process, or it is used by a closed number or processes and the schema includes phase-specific declarations, this is consideration is not particularly useful. But in the general case it is material.)
In recent notes I have brought out several different species of schema other than open or closed content models on elements: XSL tree patterns; regular expressions on IDREFs and other attributes; name-independent validation; and weak validation. These could all be useful, and I do not think I have fully explored every case. Not allowing parallel or piecemeal validation using different schema languages working ondifferent underlying categories of validation promotes workflow kidnap.
Let me emphasize again that these mechanism work to create a level playing-field for plurality; they do not assume that market power and the legitimate needs of specialized applications are not also real factors at work. It is always possible to write a wrapper and a script to use any data, or to transform it. But these are complications that we should not force new entrants to a market to do.
The appropriate XML syntax for the mechanism to use is:
Furthermore, it should be deprecated to conflate a namespace URI and the name of a specific application .
The best and the most famous articles relating to organic plurality are:
Copyright (C) 1999 Rick Jelliffe. Please feel free to publish this in any way you like, but try to update it to the most recent version, and keep my name on it.