The XML Logo (from the XML 
      FAQ)

Chinese XML Now! Test Files

1999-04-26

The test files are all small XML files for testing purposes. Each file establishes a single useful fact about the software you are testing. The files are available with a .xml or a .txt extension. [Note]

There is also a combined test file, with all the tests in a single resource. This is good for a quick overview. All the files are also available in a single tar distribution.

If you are accessing these files from an XML Web-browser, do not panic if the browser cannot accept some of the documents. Some browsers may not have a "default style" defined; XML documents may appear strange or mangled. Other browsers may only accept XML documents which have a stylesheet. Furthermore, if your browser uses a font which does not have Han ideograms, then the characters will appear as a white box or a space. These are all legitimate behaviours.

However, if the browser interprets the XML markup incorrectly, or does not accept Han ideograms in element names, or does not treat numeric character references as ISO 10646 characters, these are signs of a bug. If you are using beta code, please do not be too frustrated: the vendor is trying to find out these kind of problems.

The files are sent out from our web server with the MIME content-type

(Our web server does not send MIME charset information, as far as we know.)

Well-formed XML, No Stylesheet, No DOCTYPE Declaration, No Namespace

Test
UTF-8
Big5
GB2312
0) ASCII-codes-only WF file with encoding name in all uppercase or lowercase.
1) ASCII-codes-only WF file with encoding name using recommended form.
2) ASCII-codes-only WF file with decimal NCR in data.
3) ASCII-codes-only WF file with hexadecimal NCR in data.
4) ASCII-codes-only WF file with decimal NCR in attribute value.
5) ASCII-codes-only WF file with hexadecimal NCR in attribute value.
6) ASCII-codes-only WF file with decimal NCR in CDATA section.
7) ASCII-codes-only WF file with hexadecimal NCR in CDATA section.
8) The file includes one ideographic character encoded directly in data.
9) The file includes a more troublesome ideographic character encoded directly in data.
10) The file includes one ideographic character encoded directly in an attribute.
11) The file includes one ideographic character encoded directly in an element type name.
12) The file includes one ideographic character encoded directly in an ID attribute and later in an IDREF attribute.

(Note there is no test for XML PIs and Comments)

Well-formed XML, Stylesheet, No DOCTYPE Declaration, No Namespace

The following tests all invoke a CSS Style Sheet This style sheet has styles for elements using Han ideograms (hanzi, kanji) as its Element Type Name (GI). Thus, this is test is really a test of whether the CSS implementation supports Native Language Markup. Note that the CSS Style Sheet is just plain ASCII; the Big5, Gb2312 and UTF-8 test documents all use this entity.

Test
UTF-8
Big5
GB2312
0) ASCII-codes-only WF file with encoding name in all uppercase or lowercase.
1) ASCII-codes-only WF file with encoding name using recommended form.
2) ASCII-codes-only WF file with decimal NCR in data.
3) ASCII-codes-only WF file with hexadecimal NCR in data.
4) ASCII-codes-only WF file with decimal NCR in attribute value.
5) ASCII-codes-only WF file with hexadecimal NCR in attribute value.
6) ASCII-codes-only WF file with decimal NCR in CDATA section.
7) ASCII-codes-only WF file with hexadecimal NCR in CDATA section.
8) The file includes one ideographic character encoded directly in data.
9) The file includes a more troublesome ideographic character encoded directly in data.
10) The file includes one ideographic character encoded directly in an attribute.
11) The file includes one ideographic character encoded directly in an element type name.
12) The file includes one ideographic character encoded directly in an ID attribute and later in an IDREF attribute.

(Note there is no test for XML PIs and Comments)

Well-formed XML, No Stylesheet, DOCTYPE Declaration, No Namespace

The following tests all include a DOCTYPE declaration. The markup declarations referenced by the SYSTEM identifier in that DOCTYPE declaration contains, among other things, an element type declaration for an element with a Han ideogram (hanzi, kanji) as its Element Type Name (GI). Note that the entity containing the markup declaration is itself encoded in UTF-8 only; the Big5, Gb2312 and UTF-8 test documents all use this entity.

Test
UTF-8
Big5
GB2312
0) ASCII-codes-only WF file with encoding name in all uppercase or lowercase.
1) ASCII-codes-only WF file with encoding name using recommended form.
2) ASCII-codes-only WF file with decimal NCR in data.
3) ASCII-codes-only WF file with hexadecimal NCR in data.
4) ASCII-codes-only WF file with decimal NCR in attribute value.
5) ASCII-codes-only WF file with hexadecimal NCR in attribute value.
6) ASCII-codes-only WF file with decimal NCR in CDATA section.
7) ASCII-codes-only WF file with hexadecimal NCR in CDATA section.
8) The file includes one ideographic character encoded directly in data.
9) The file includes a more troublesome ideographic character encoded directly in data.
10) The file includes one ideographic character encoded directly in an attribute.
11) The file includes one ideographic character encoded directly in an element type name.
12) The file includes one ideographic character encoded directly in an ID attribute and later in an IDREF attribute.

(Note there is no test for XML PIs and Comments)

Well-formed XML, No Stylesheet, No DOCTYPE Declaration, Namespace

The namespace referenced by the xmlns attribute in these tests is just a name. It does not resolve to a schema definition file.

Test
UTF-8
Big5
GB2312
0) ASCII-codes-only WF file with encoding name in all uppercase or lowercase.
1) ASCII-codes-only WF file with encoding name using recommended form.
2) ASCII-codes-only WF file with decimal NCR in data.
3) ASCII-codes-only WF file with hexadecimal NCR in data.
4) ASCII-codes-only WF file with decimal NCR in attribute value.
5) ASCII-codes-only WF file with hexadecimal NCR in attribute value.
6) ASCII-codes-only WF file with decimal NCR in CDATA section.
7) ASCII-codes-only WF file with hexadecimal NCR in CDATA section.
8) The file includes one ideographic character encoded directly in data.
9) The file includes a more troublesome ideographic character encoded directly in data.
10) The file includes one ideographic character encoded directly in an attribute.
11) The file includes one ideographic character encoded directly in an element type name.
12) The file includes one ideographic character encoded directly in an ID attribute and later in an IDREF attribute.

(Note there is no test for XML PIs and Comments)

Combined Tests.

Well-formed XML, Stylesheet, DOCTYPE Declaration, No Namespace Declarations, xml:lang attributes, standalone.

This test includes all the previous tests cases. It is a good quick test of the most important basic requirements for Chinese documents.

Test
UTF-8
Big5
GB2312
13) This file has all the test cases of test files 1 to 12.

Related Test Files

Test UTF-8 Big5 GB2312
Charles Muller's Dictionary of East Asian Buddhist Terms is a multilingual resource. Excellent for demonstrating. XML instances, DTD, XSL stylesheet. . .
FujiXerox Japanese document xml . .
A test of the html:lang and xml:lang attributes. Chinese and English. HTML-in-XML . .
A test of the xml:lang attributes. Chinese and English. WF XML . .
Chinese and English, with xml:lang attribute. English - Chinese XML Glossary (en, zh)) English - Chinese XML Glossary (zh, en ) English - Chinese XML Glossary (zh, en)

There are some HTML test files of various languages and encodings, including MIME header labelling of the "charset", at the Vancouver Webpages' Using Multiple Languages in HTML.



TAR Distribution

The test directories and files are also all available in the following file: http://www.ascc.net/xml/zh-xml-test.tar.gz. (file size = 70K). The file is UNIX tar and GNU gzip format, but many PC-based archivers (e.g. WinZIP) should accept the format: note that some archivers will rename the file zh-xml-test_tar.gz; you may have rename it back to .tar by hand to get untar it.

If you use the tar distribution, please note that the images and out-going links will be incorrect. These test files may be corrected and updated occassionally, so please link to this site rather instead of putting up your own duplicateon the Internet, where this is appropriate.



Finally...here is an HTTP header echo service: you can see what information your browser is sending. The service is the Pascal's Header Echo.

Academia Sinica

[in English][in Chinese]

[Legal Notices]