The XML Logo 
    (from the XML FAQ)

[ Please use generation 4 or 5 HTML browsers for this Website ]

XHTML: HTML-in-XML

The HTML files in this site (Chinese XML Now!) follow the W3C's proposed format for future HTML: XHTML (formerly codenamed Voyager.) This uses the HTML DTD, but with XML syntax: HTML-in-XML.

There are some additional requirements in XHTML which the XML data content models cannot check. So we have made an Exclusion Validator for XHTML in XSL available.

How We Make This Site

We edit our files as ordinary HTML files. Then we run a script using Dave Ragget's excellent tidy program over the files to convert them into well-formed HTML and to add XML headers. Then we check that tidy has generated well-formed XML using James Clark's xmlwf tool. At times, we also use OmniMark Light edition, and the GNU utility sed.

Tidy, xmlwf, OmniMark Light Edition and the GNU tools are all available free over the Internet. We have also created a custom version of tidy for Chinese that handles Big5 and UTF-8 with better Chinese line-breaking (it also handles ShiftJIS better).

Another program which looks interesting is John Walker's demoronizer, which combines a transcoder with some tag re-arrangement, but we are not using it.

Checking XHMTL

In each directory, we are putting two shell scripts:


#!/bin/sh

# generate.sh 
for i in XSLvalidation checklist faq-phrase-index 
do
        /project3/xml/bin/tidy -config config.txt $i.html
done

// config.txt for HTML tidy for English XHTML files
write-back: yes
indent: auto
indent-spaces: 2
wrap: 70
markup: yes
clean: yes
output-xml: yes 
input-xml: no 
show-warnings: yes
numeric-entities: yes
quote-marks: yes
quote-nbsp: no 
quote-ampersand: yes
break-before-br: no
uppercase-tags: no 
uppercase-attributes: no 
smart-indent: yes
output-xhtml: yes
char-encoding: utf8
language: en
ncr: yes


#!/bin/sh

# check.sh 
for i in XSLvalidation checklist faq-phrase-index 
do
        /project3/xml/bin/xmlwf $i.html
done

Currently we are not checking validity on XHTML documents; this primarily causes problems relating to missing or duplicate ID attributes so far. The XML documents which are processed using OmniMark usually must be valid, however.

We will be integrating this into an overnight batch file to make sure that files are tidied overnight. This seems to provide a simple way to gradually introduce users to the idea of document well-formedness and validity.