This FAQ is about using XML for Chinese-language documents. For general English-language XML or SGML FAQs see:
XML (eXtensible Markup Language) is a simple language for marking up structures in text documents. It is based on an International Standard -- Standard Generalized Markup Language (SGML) -- International Organization for Standardization (ISO) ISO 8879:1986. It looks like HTML. You can create and use your own tags and document structures with it. You can also use it for serializing from databases.
ISO SGML orginally came out of IBM, but has had a lot of input from different companies. XML is developed by the World Wide Web Consortium (W3C); the director of W3C is Tim Berners-Lee, who invented the WWW. The XML project leader was Jon Bosak, chief Information Architect at Sun Microsystems. The XML specification was co-written by representatives of Netscape, Microsoft, and a large academic project called the Text Encoding Initiative (TEI). The W3C Special Interest Group (W3C XML SIG) had representatives from over 100 different companies and invited experts.
Microsoft, Netscape, Sun, IBM, Corel, Adobe, Oracle, RealAudio...almost everyone.
Most XML is hidden; you write your own specialist markup language using it. There are many specialist markup languages built using XML now. For example,
Yes. All conforming XML processors must support ISO 10646 characters. ISO 10646 is a big character set that is an ISO standard. These include the characters from Big5 and GB2312.
However, XML is just starting. So most XML software has not been tested on Chinese data yet (December, 1999).
Yes. All conforming XML processors must allow you to use Han ideographs for element names. But you can only use the characters defined in the character set you are using; you cannot use numeric character references in names.
There is a particular problem with Big5. See Question B.12.
XHTML is the name given to HTML, when it uses XML syntax. XML syntax is stricter than old HTML: you cannot leave out most end-tags, it is case-sensitive, and you must have quotes on attribute values.
An XML document is well-formed if its syntax is correct: all the required delimiters are correct, and end-tags correspond to their start-tags. All XML processors are required to inform the user if a document is not well-formed; processing will probably halt. (HTML browsers forgive syntax errors; XML processors do not.)
An XML document is valid if its element structure conforms to the "content model" of the optional "markup declarations" in the "prolog" of the document. There are special tools caled validators which can help you.
A content model is like an "assert" statement in C or C++ programming languages: it lets you check that the XML document does have the structure you expect or require it to have. For example, some software may expect that the element "HTML:img" has an attribute "src". If you can make assertions about the structure of the document, there are fewer cases you need to program for.
Maybe. A conforming XML processor must support UTF-8 and UTF-16 (Unicode). But XML can be encoded using almost any character set. It is the parser-writer's decision which other character set encodings to support.
If you use Big5, every file ("parseable entity") used in your XML document must start with the following header: <?xml version="1.0" encoding="Big5"?>
Why is it important to set the character set correctly? XML will be used for commercial information. If your XML document is not labelled with the correct character set information, an XML processor will reject it. XML moves away from character set guessing (i.e., what HTML does) to explicit markup of character sets.
There is a particular problem with Big5. See Question B.12.
Maybe. A conforming XML processor must support UTF-8 and UTF-16 (Unicode). But XML can be encoded using almost any character set. It is the parser-writer's decision which encodings to support.
If you use GB2312, every file ("parseable entity") used in your XML document must start with the following header: <?xml version="1.0" encoding="GB2312"?>
All XML software must support UTF-8 (and therefore ASCII) and UCS-2 (Unicode). Much XML software will also support other character encodings. In any case, you can always convert ("transcode" ) your XML document into UTF-8 and use any XML software.
You can use any character defined by the ISO 10646 universal character set. If Big5 (or GB2312) does not have that character, you can use a "numeric character reference". This looks like 媼where the "5ABC" is a hexadecimal number giving the ISO10646 number. Windows NT's "Character Map" utility lets you see all the ISO 10646 characters in any font.
If the character is a variant on an ISO10646 character, you can make up an element type (and attribute) so that it will display properly. For example, you could borrow the "SPAN" element type from HTML and then use a stylesheet (e.g. Cascading Stylesheets: CSS) to select the font you need for that character.
If the character is not in ISO10646, then you have to use the private-use character area. This is not an improvement on what you have to do now! We hope some better system will be included in XML and ISO10646 in the future.
Some Japanese Web servers, proxies or browsers automatically convert between Japanese character sets (e.g., between Shift-JIS and EUC-J). This also apparantly occurs in some other languages (e.g. Russian.) As far as we know, this "transcoding" does not happen automatically for Chinese Web servers (e.g., between Big5 and GB2312; between the "traditional" characters and the simplified characters.)
Big5 to GB2312 conversion is not perfect: some characters are missing. It is possible to create an XML-aware "lossless" transcoder--this a transcoder that will convert unavailable characters into Numeric Character References (NCRs). (We have made some in Academia Sinica Computing Centre, for example.)
In order to prevent transcoding, you can send the document using the MIME type application/xml. That is supposed to prevent the document being transcoded. If you are using Apache, then the following may be useful to look at: AddType application/xml XML xml or ForceType application/xml or DefaultType application/xml
However, please note that for HTML, the popular Web browsers use lots of tricks to guess which character set encoding is used. This will probably continue even with HTML-in-XML (also known as "Voyager"). So using application/xml will prevent proxies from transcoding, but the receiving system may have "lossy" transcoding built-in anyway! (See question B.13.)
Many (most) web servers do not send the correct charset in the HTTP/MIME headers. In fact, many Web Servers do not allow you to specify the character set at all!
Here are some guidelines:--
If you are using a recent version of the Apache server, then your Webmaster must give you "AllowOverride FileInfo" permission. Then you can put a file called .htaccess in any directory. (Note that in MIME terminology, "encoding" means "compression". In the XML encoding header, "encoding" means "coded character set") Here are some lines that may be useful--
AddType application/xml XML xml
AddType "text/xml; charset=Big5" XML xml
Why is it important to set the language correctly? Because it can help with searching later.
Why do you want to send XML with the MIME type application/xml rather than text/xml? Because then the file will not be transcoded. Transcoding can make characters disappear when going from Big5 to GB2312. (See also question B.6)
Every XML element can have an attribute called xml:lang. It lets you set the language you are using. You can use this to help searching and typesetting. Put this attribute on the top-level element in your Chinese XML document. The values you can use for Chinese include:
Of course, this attribute is very simple, but it is important to label all your documents with what language they use. Then a Chinese Web-Robot can automatically add you text to a WWW index, and a Western Web-Robot will know that it should not add the information. Or an automatic translation service can be invoked. Some words are used differently in different Chinese locales (e.g. Taiwan and China) so it can help with automated translation and searching too.
Yes. Every XML element can take an attribute called "xml:lang" which says which language the element is. This is not the character encoding (e.g. Big5 or GB2312), but the language: for example
means that the element is in Chinese, as used in Taiwan. By implication, the element p should use traditional characters.
means that the element is in Hong Kong Chinese.
means that the element is in Singapore English.
<p xml:lang="zh-CN-YUH"....<z xml:lang="en">blah</z>...</p>
means that the element p is in Cantonese Chinese ("YUH" is the code for Cantonese in the SIL Enthnologue: http://www.sil.org/ethnologue/countries/Chin.html ) but the subelement z is in English. (Some characters are used phonetically; these kind of characters are dialect-specific and unreadable outside the dialect.) By implication, the element p should use simplified characters.
Of course, you can also invent your own attributes to do anything you like:
<p xml:lang="zh-HK-simplified" traditional="OK">...</p>
means that the element contains data in Hong Kong Chinese, but it should use simplified characters. But the attribute specification 'traditional="OK"' is your attribute: you can use it to say that it is a OK to also use the traditional glyph (image).
In XML, you use markup to describe all the interesting information about the data. Then you write a program or stylesheet or report generator to implement what you need, using the markup.
The Unicode Consortium are a group of companies (including the Japanese company Justsystem, and companies with large Japanese joint operations, like (Fuji-)Xerox) that decided to make a big character set which had all the world's characters. They took the ISO character set ISO 10646 and have added other information: standard names and characteristics. Unicode includes all the characters from GB2312 and (probably) all the characters from Big5. Plus it includes many other characters. (ISO 10646 has several encodings: UTF-8 is 8-bit and UTF-16 is 16 bit. Unicode is a form of UTF-16.)
So Unicode is better than Big5 and GB:2312--it has more characters.
But, there are problems with the ISO 10646 encodings:
So XML files do not have to be encoded in UTF-8 or UTF-16. You can use Big5 or GB2312. But not much XML software supports the Chinese character sets. So it is good advice to move to UTF-8 or UTF-16 in the long run.
Big5 is an "7-bit unsafe" "ASCII-family" coded character set.
This means that there is a lot of XML software which will work with XML documents encoded in Big5. But it is an accident, because, strictly, if an XML-system does not understand the encoding given in the XML encoding header, the XML processor should signal an error. In particular, such systems will probably not handle numeric character references (NCR) correctly (See question B.4). But they may be useful anyway, of course, even if they are non-conforming.
There is a particular problem with Big5. See Question B.12.
The second byte of Big5 characters can cause problems on some systems. Big5 is not "8-bit safe" (see Question B.11.)
The problems will show up only on systems which do not convert the Big5-encoded documents into an 8-bit safe internal format (e.g., Unicode, or UTF-8 or UTF-16.) On these systems, some bytes of the Big5 code will be interpreted as the wrong characters.
The first problem occurs when you are using Native Language Markup (e.g., you are using Han Ideograms for element names, attribute names, ID attributes, etc.) There is no way to fix this problem. If you must use that kind of software, then you must avoid using (as in markup) any Big5 character whose second byte is not a valid name character.
The second problem occurs in the very rare case that you are using one of the following Han ideographs in a CDATA section, and that character is followed by the string "]>". To fix this problem, you can split the CDATA section into two CDATA sections, and sandwich the naughty character between them. The following characters all have the byte 5D as their second byte (in Big5): this is ASCII "]".
兡也包因沘氓侷柵苗孫孫財 崧淫設弼琶跑愍窟榜蒸奭稽 霄瓢館縲擻鼕孃魔釁佉沎岠 狋垚柛胅娭涘罞偟惈牻荺傒 焱菏酡廅滘絺赩塴榗箂踃嬁 澕蓴醊獧螗餟燱螬駸礑鎞瀧 鄿瀯騬醹躕鱕
(Note: If you cannot see all the characters, then see question B.13.)
If you cannot see all the characters, then
Try changing the "Encoding" menu item (e.g. to Big5 or UTF-8): it is under a different menu on different browsers. )
EUDC (Extended User-defined Characters) is the general name used in Hong Kong for standard sets of user-defined characters (sometimes called by the Japanese term gaiji). They include R&D EUDC, HKUST EUDC and GCCS EUDC.
Big5 was developed in Taiwan. The "traditional" characters are also used in Hong Kong. But Hong Kong also uses other characters which are rarer in Taiwan. The Hong Kong Government has made the Government Chinese Character Set (GCCS), which is Big5 plus an additional 3049 characters. It seems to be in widespread use.
Taiwan standards committees have also recented added an extra 7000 or so characters to Big5, calling it Big5Plus. We cannot be very sure how much it is being used.
Big5/GCCS, EUDC and Big5plus do not have registered IANA encoding names for the Internet. So be careful.
For future interoperability, it is important that all WWW software gives the correct headers in the HTML and XML files. How can you trust your e-commerce data if you cannot know the character set? If you use these new versions of Big5, always put an extra comment or processing instruction at the head of the document to document it. For XML, we suggest that you put, as the second tag in the document, an XML processing instruction with target "ascc-hint" and an attribute "non-IANA".
<?xml version="1.0" encoding="Big5" ?> <?ascc-hint non-IANA="Big5plus" ?>
<?xml version="1.0" encoding="Big5" ?> <?ascc-hint non-IANA="GCCS" ?>
For Chinese XML, the best browser at the moment (April 1999) is probably Internet Explorer 5.0. The best XML parser is probably IBM's XML Parser for Java. The best XML/SGML parser is probably James Clark's SP software (C++). An XML version of Perl is coming too!
For a listing of current XML tools, see Robin Cover's web pages at http://www.oasis-open.org/cover/xml.html#xmlSoftware. (1999-04-13)
There are many XML tools which tell you whether an XML entity is well-formed. There are fewer tools which also check document validity against a DTD: Microsoft has a useful tool available at http://www.microsoft.com/xml/ (under "XML & XSL Demos" and "XML Validator").(1999-04-13)
For a listing of current validators tools, see Robin Cover's web pages at http://www.oasis-open.org/cover/xml.html#xmlValResources.
All XSL tools are experimental betas, at the moment. The tools from James Clark (XT) and from IBM (LotusXSL) are probably the best. (1999-04-13)
For a listing of current XSL tools, see Robin Cover's web pages at http://www.oasis-open.org/cover/xsl.html#xslSoftware.
All XHTML tools are experimental betas, at the moment. Dave Ragget's tidy program at http://www.w3.org/People/Raggett/ can help convert HTML to XHTML. (1999-04-13)
Try the Chinese XML Now! page at Academia Sinica, Taipei.
This is a small project from the Computing Center at Academia Sinica, Taipei. It aims at providing information to developers of Chinese XML Software. It is sometimes difficult for non-Chinese-reading software developers to find useful information on the WWW; and when the project began, there was not much Chinese information on XML either.
The project tries to support material equally in English and Chinese, and in UTF-8, Big5 and GB2312.
Thanks for corrections to Sidney Lu, John Cowan (plus apologies for the previous typo in his name), and Toshinori Numata
<DC:TITLE xml:lang="en">The Chinese XML FAQ (English version) </DC:TITLE> <DC:CREATOR >Rick Jelliffe </DC:CREATOR> <DC:CONTRIBUTOR xml:lang="zh-TW-Lt">Chin-Tang Chang</DC:CONTRIBUTOR> <DC:SUBJECT xml:lang="en">XML, SGML, Chinese, FAQ, Big5, GB2312, Unicode, ISO 10646, UTF-8, UTF-16, Apache, Voyager </DC:SUBJECT> <DC:DESCRIPTION xml:lang="en">Frequently Asked Questions about using XML for Chinese </DC:DESCRIPTION> <DC:PUBLISHER xml:lang="en">Computing Centre, Academia Sinica, Taiwan </DC:PUBLISHER> <DC:TYPE xml:lang="en">Text.Article </DC:TYPE> <DC:DATE >1999-04-10 </DC:DATE> <DC:RIGHTS >http://www.ascc.net/xml/en/utf-8/legal.html </DC:RIGHTS>