The XML Logo (from the 
      XML FAQ)

Chinese XML FAQ

This FAQ is about using XML for Chinese-language documents. For general English-language XML or SGML FAQs see:

1. What is XML?

XML (eXtensible Markup Language) is a simple language for marking up structures in text documents. It is based on an International Standard -- Standard Generalized Markup Language (SGML) -- International Organization for Standardization (ISO) ISO 8879:1986. It looks like HTML. You can create and use your own tags and document structures with it. You can also use it for serializing from databases.

2. Who developed it?

ISO SGML orginally came out of IBM, but has had a lot of input from different companies. XML is developed by the World Wide Web Consortium (W3C); the director of W3C is Tim Berners-Lee, who invented the WWW. The XML project leader was Jon Bosak, chief Information Architect at Sun Microsystems. The XML specification was co-written by representatives of Netscape, Microsoft, and a large academic project called the Text Encoding Initiative (TEI). The W3C Special Interest Group (W3C XML SIG) had representatives from over 100 different companies and invited experts.

3. Which companies are using it or supporting it?

Microsoft, Netscape, Sun, IBM, Corel, Adobe, Oracle, RealAudio...almost everyone.

4. How do I know if I am using XML?

Most XML is hidden; you write your own specialist markup language using it. There are many specialist markup languages built using XML now. For example,

5. Can I use Chinese Data?

Yes. All conforming XML processors must support ISO 10646 characters. ISO 10646 is a big character set that is an ISO standard. These include the characters from Big5 and GB2312.

However, XML is just starting. So most XML software has not been tested on Chinese data yet (December, 1998).

6. Can I use Chinese element names?

(1998-01-04) Yes. All conforming XML processors must allow you to use Han ideographs for element names. But you can only use the characters defined in the character set you are using; you cannot use numeric character references in names.

There is a particular problem with Big5. See Question 18.

7. Can I use Big5?

Maybe. A conforming XML processor must support UTF-8 and UTF-16 (Unicode). But XML can be encoded using almost any character set. It is the parser-writer's decision which other character set encodings to support.

If you use Big5, every file ("parseable entity") used in your XML document must start with the following header
<?xml version="1.0" encoding="Big5"?>

Why is it important to set the character set correctly? XML will be used for commercial information. If your XML document is not labelled with the correct character set information, an XML processor will reject it. XML moves away from character set guessing (i.e., what HTML does) to explicit markup of character sets.

There is a particular problem with Big5. See Question 18.

8. Can I use GB2312?

Maybe. A conforming XML processor must support UTF-8 and UTF-16 (Unicode). But XML can be encoded using almost any character set. It is the parser-writer's decision which encodings to support.

If you use GB2312, every file ("parseable entity") used in your XML document must start with the following header
<?xml version="1.0" encoding="GB2312"?>

9. How do I know what character sets the XML software supports?

The "Chinese XML Now!" project at Academia Sinica, Taipei, has some information. Also, some manufacturers may use the Chinese Numberplate logo. In any case, you can always convert your XML document into UTF-8 and use any XML software.

10. What if I am using Big5 and the character I need is not there?

You can use any character defined by the ISO 10646 universal character set. If Big5 (or GB2312) does not have that character, you can use a "numeric character reference". This looks like
&#x5ABC;
where the "5ABC" is a hexadecimal number giving the ISO10646 number. Windows NT's "Character Map" utility lets you see all the ISO 10646 characters in any font.

If the character is a variant on an ISO10646 character, you can make up an element type (and attribute) so that it will display properly. For example, you could borrow the "SPAN" element type from HTML and then use a stylesheet (e.g. Cascading Stylesheets: CSS) to select the font you need for that character.

If the character is not in ISO10646, then you have to use the private-use character area. This is not an improvement on what you have to do now! We hope some better system will be included in XML and ISO10646 in the future.

11. What is the best free XML software for Chinese at the moment?

For Chinese XML, the best at the moment (December 1998) is probably Internet Explorer 5.0 beta. The best XML parser is probably IBM's XML Parser for Java. The best XML/SGML parser is probably James Clark's SP software (C++). An XML version of Perl is coming too!

12. What about Web systems that "Transcode"?

Some Japanese Web servers, proxies or browsers automatically convert between Japanese character sets (e.g., between Shift-JIS and EUC-J). This also apparantly occurs in some other languages (e.g. Russian.) As far as we know, this "transcoding" does not happen automatically for Chinese Web servers (e.g., between Big5 and GB2312; between the "traditional" characters and the simplified characters.)

Big5 to GB2312 conversion is not perfect: some characters are missing. It is possible to create an XML-aware "lossless" transcoder--this a transcoder that will convert unavailable characters into Numeric Character References (NCRs). (We have made some in Academia Sinica Computing Centre, for example.)

In order to prevent transcoding, you can send the document using the MIME type application/xml. That is supposed to prevent the document being transcoded. If you are using Apache, then the following may be useful to look at:
AddType application/xml XML xml
or
ForceType application/xml
or
DefaultType application/xml

However, please note that for HTML, the popular Web browsers use lots of tricks to guess which character set encoding is used. This will probably continue even with HTML-in-XML (also known as "Voyager"). So using application/xml will prevent proxies from transcoding, but the receiving system may have "lossy" transcoding built-in anyway! (See question 19.)

13. What if my Web Server sends the wrong charset?

Many (most) web servers do not send the correct charset in the HTTP/MIME headers. In fact, many Web Servers do not allow you to specify the character set at all!

Here are some guidelines:--

14. How do I send the correct MIME/HTTP headers using Apache .htaccess files? (Not finished)

If you are using a recent version of the Apache server, then your Webmaster must give you "AllowOverride FileInfo" permission. Then you can put a file called .htaccess in any directory. (Note that in MIME terminology, "encoding" means "compression". In the XML encoding header, "encoding" means "coded character set") Here is a line that may be useful--

DefaultLanguage zh
AddType application/xml XML xml

Why is it important to set the language correctly? Because it can help with searching later.

Why do you want to send XML with the MIME type application/xml rather than text/xml? Because then the file will not be transcoded. Transcoding can make characters disappear when going from Big5 to GB2312. (See also question 12)

15. What is the standard attribute xml:lang for?

Every XML element can have an attribute called xml:lang. It lets you set the language you are using. You can use this to help searching and typesetting. Put this attribute on the top-level element in your Chinese XML document. The values you can use for Chinese include:

Of course, this attribute is very simple, but it is important to label all your documents with what language they use. Then a Chinese Web-Robot can automatically add you text to a WWW index, and a Western Web-Robot will know that it should not add the information. Or an automatic translation service can be invoked. Some words are used differently in different Chinese locales (e.g. Taiwan and China) so it can help with automated translation and searching too.

15a. Can I mix different kinds of Chinese in the same document?

(1998-12-15)

Yes. Every XML element can take an attribute called "xml:lang" which says which language the element is. This is not the character encoding (e.g. Big5 or GB2312), but the language: for example

        <p xml:lang="zh-TW">...</p>
means that the element is in Chinese, as used in Taiwan. By implication, the element p should use traditional characters.
        <p xml:lang="zh-HK">...</p>
means that the element is in Hong Kong Chinese.
        <p xml:lang="en-SG">...</p>
means that the element is in Singapore English.
        <p xml:lang="zh-CN-YUH">...<z xml:lang="en">blah;</z>...</p>
means that the element p is in Cantonese Chinese ("YUH" is the code for Cantonese in the SIL Enthnologue: http://www.sil.org/ethnologue/countries/Chin.html ) but the subelement z is in English. (Some characters are used phonetically; these kind of characters are dialect-specific and unreadable outside the dialect.) By implication, the element p should use simplified characters.

Of course, you can also invent your own attributes to do anything you like:

        <p xml:lang="zh-HK-simplified" traditional="OK">...</p>
means that the element contains data in Hong Kong Chinese, but it should use simplified characters. But the attribute specification 'traditional="OK"' is your attribute: you can use it to say that it is a OK to also use the traditional glyph (image).

In XML, you use markup to describe all the interesting information about the data. Then you write a program or stylesheet or report generator to implement what you need, using the markup.

16. I heard that Unicode is not a good character set for Chinese!

The Unicode Consortium are a group of companies (including the Japanese company Justsystem, and companies with large Japanese joint operations, like (Fuji-)Xerox) that decided to make a big character set which had all the world's characters. They took the ISO character set ISO 10646 and have added other information: standard names and characteristics. Unicode includes all the characters from GB2312 and (probably) all the characters from Big5. Plus it includes many other characters. (ISO 10646 has several encodings: UTF-8 is 8-bit and UTF-16 is 16 bit. Unicode is a form of UTF-16.)

So Unicode is better than Big5 and GB:2312--it has more characters.

But, there are problems with the ISO 10646 encodings:

So XML files do not have to be encoded in UTF-8 or UTF-16. You can use Big5 or GB2312. But not much XML software supports the Chinese character sets. So it is good advice to move to UTF-8 or UTF-16 in the long run.

17. Why does software xxx work with Big5: the documentation says it does not?

Big5 is an "7-bit unsafe" "ASCII-family" coded character set.

This means that there is a lot of XML software which will work with XML documents encoded in Big5. But it is an accident, because, strictly, if an XML-system does not understand the encoding given in the XML encoding header, the XML processor should signal an error. In particular, such systems will probably not handle numeric character references (NCR) correctly (See question 10). But they may be useful anyway, of course, even if they are non-conforming.

There is a particular problem with Big5. See Question 18.

18. Some Big5 documents fail with strange errors? Why?

The second byte of Big5 characters can cause problems on some systems. Big5 is not "8-bit safe" (see Question 17.)

The problems will show up only on systems which do not convert the Big5-encoded documents into an 8-bit safe internal format (e.g., Unicode, or UTF-8 or UTF-16.) On these systems, some bytes of the Big5 code will be interpreted as the wrong characters.

The first problem occurs when you are using Native Language Markup (e.g., you are using Han Ideograms for element names, attribute names, ID attributes, etc.) There is no way to fix this problem. If you must use that kind of software, then you must avoid using (as in markup) any Big5 character whose second byte is not a valid name character.

The second problem occurs in the very rare case that you are using one of the following Han ideographs in a CDATA section, and that character is followed by the string "]>". To fix this problem, you can split the CDATA section into two CDATA sections, and sandwich the naughty character between them. The following characters all have the byte 5D as their second byte (in Big5): this is ASCII "]".

兡也包因沘氓侷柵苗孫孫財
崧淫設弼琶跑愍窟榜蒸奭稽
霄瓢館縲擻鼕孃魔釁佉沎岠
狋垚柛胅娭涘罞偟惈牻荺傒
焱菏酡廅滘絺赩塴榗箂踃嬁
澕蓴醊獧螗餟燱螬駸礑鎞瀧
鄿瀯騬醹躕鱕

Warning: We are still checking these characters.

(Note: If you cannot see all the characters, then see question 19.)

19. I cannot see all the characters on my HTML browser! Why?

If you cannot see all the characters, then

Try changing the "Encoding" menu item (e.g. to Big5 or UTF-8): it is under a different menu on different browsers. )

20. What is Big5/GCCS, EUDC, and Big5plus?

(1998-12-31)

EUDC (Extended User-defined Characters) is the general name used in Hong Kong for standard sets of user-defined characters (sometimes called by the Japanese term gaiji). They include R&D EUDC, HKUST EUDC and GCCS EUDC.

Big5 was developed in Taiwan. The "traditional" characters are also used in Hong Kong. But Hong Kong also uses other characters which are rarer in Taiwan. The Hong Kong Government has made the Government Chinese Character Set (GCCS), which is Big5 plus an additional 3049 characters. It seems to be in widespread use.

Taiwan standards committees have also recented added an extra 7000 or so characters to Big5, calling it Big5Plus. We cannot be very sure how much it is being used.

Big5/GCCS, EUDC and Big5plus do not have registered IANA encoding names for the Internet. So be careful.

For future interoperability, it is important that all WWW software gives the correct headers in the HTML and XML files. How can you trust your e-commerce data if you cannot know the character set? If you use these new versions of Big5, always put an extra comment or processing instruction at the head of the document to document it. For XML, we suggest that you put, as the second tag in the document, an XML processing instruction with target "ascc-hint" and an attribute "non-IANA".

        <?xml version="1.0" encoding="Big5" ?>
        <?ascc-hint non-IANA="Big5plus" ?>
      
and
        <?xml version="1.0" encoding="Big5" ?>
        <?ascc-hint non-IANA="GCCS" ?>
      

Where can I get more Information?

Try the Chinese XML Now! page at Academia Sinica, Taipei.

What is the "Chinese XML Now!" project?

This is a small project from the Computing Center at Academia Sinica, Taipei. It aims at providing information to developers of Chinese XML Software. It is sometimes difficult for non-Chinese-reading software developers to find useful information on the WWW; and when the project began, there was not much Chinese information on XML either.

The project tries to support material equally in English and Chinese, and in UTF-8, Big5 and GB2312.

Who should can I contact about this FAQ?

We welcome corrections, questions and ideas. The contact for the English language material is Rick Jelliffe: ricko@gate.sinica.edu.tw. The contact for Chinese language is Chin-Tang Chang: ctchang@gate.sinica.edu.tw.


Contributions

Thanks for corrections to Sidney Lu, John Cowan (plus apologies for the previous typo in his name), and Toshinori Numata


Cataloging Information (Dublin Core)

<DC:TITLE       xml:lang="en">The Chinese XML FAQ (English version) </DC:TITLE>
<DC:CREATOR                  >Rick Jelliffe </DC:CREATOR>
<DC:CONTRIBUTOR xml:lang="zh-TW-Lt">Chin-Tang Chang</DC:CONTRIBUTOR>
<DC:SUBJECT     xml:lang="en">XML, SGML, Chinese, FAQ,
                              Big5, GB2312, Unicode, ISO 10646, UTF-8, UTF-16, 
                              Apache, Voyager </DC:SUBJECT>
<DC:DESCRIPTION xml:lang="en">Frequently Asked Questions about using XML for Chinese </DC:DESCRIPTION>
<DC:PUBLISHER   xml:lang="en">Computing Centre, Academia Sinica, Taiwan </DC:PUBLISHER>
<DC:TYPE        xml:lang="en">Text.Article </DC:TYPE>
<DC:DATE                     >1998-01-14 </DC:DATE>
<DC:RIGHTS                   >http://www.ascc.net/xml/en/utf-8/legal.html </DC:RIGHTS>