<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<?xml:stylesheet type="text/css" href="qaml.css" ?>
<!-- DTD removed because IE5 freezes with it!!! -->
<!--DOCTYPE faq SYSTEM "http://www.ascc.net/xml/resource/qaml-xml.dtd" -->
 
<faq>
  <head>
  
    <title>
      Chinese XML FAQ
    </title>
    <maintain><name>Rick Jelliffe</name>
        <email>ricko@gate.sinica.edu.tw</email>
        </maintain>
    <hdr>
       <type>Programming</type>
       <content>XML, SGML, Chinese, FAQ, Big5, 
    GB2312, Unicode, ISO 10646, UTF-8, UTF-16, Apache, Voyager</content>
    </hdr>
    <archive href="http://xml.ascc.net/xml/en/utf-8/faq.xml"/>
  </head>

  <body xml:lang="en" >                            
    <section id="intro" class="intro">
         <logo href="../../graphics/xml.gif" 
         alt="The XML Logo (from the XML FAQ)" />
      <title>
        Chinese XML FAQ
      </title>
 <p>
      This FAQ is about using XML for Chinese-language documents.
      For general English-language XML or SGML FAQs see:
    </p>
    <div class="ul">
      <p class="li">
        John Lamp and Dave Megginson's <link href= 
        "http://lamp.man.deakin.edu.au/sgml/sgmlfaq.txt">
        comp.text.sgml FAQ</link> at <span class="tt">
        http://lamp.man.deakin.edu.au/sgml/sgmlfaq.txt</span>;
      </p>
      <p class="li">
        Peter Flynn's <link href="http://www.ucc.ie/xml/">XML FAQ</link>
        at <span class="tt">http://www.ucc.ie/xml/</span> (this is also available
        in Japanese, Korean and Spanish: does anyone want to
        translate it into Chinese?).
      </p>
      </div>
        <p>An XHTML Chinese version of this FAQ is also available in the following
        character encodings:
        <link href="http://xml.ascc.net/xml/zh/big5/faq.html">Big 5</link>,
        <link href="http://xml.ascc.net/xml/zh/utf-8/faq.html">UTF-8</link>, and
        <link href="http://xml.ascc.net/xml/zh/gb/faq.html">GB2312</link>.
      </p>
        </section>
       <section id="sect-1">
    <title>A. XML Questions</title>
    <qna  id="zh_xml_q1">
      <q>
        A.1. What is XML?
      </q>
        <author>
                <name>Rick Jelliffe</name>
                <email>ricko@gate.sinica.edu.tw</email>
        </author>
        <a gist="eXtensible Markup Language">
      <p>
        <link href="http://www.w3.org/xml/">XML</link> (eXtensible Markup
        Language) is a simple language for marking up structures in
        text documents. It is based on an International Standard --
        Standard Generalized Markup Language (<span  class="b">SGML)</span > --
        International Organization for Standardization (<span  class="b">ISO</span >)
        ISO 8879:1986. It looks like <link href= 
        "http://www.w3.org/MarkUp/">HTML</link>. You can create and
        use your own tags and document structures with it. You can
        also use it for serializing from databases.
      </p>
    </a></qna>
    <qna  id="zh_xml_q2">
      <q>
        A.2. Who developed it?
      </q>
        <author>
                <name>Rick Jelliffe</name>
                <email>ricko@gate.sinica.edu.tw</email>
        </author>
        <a gist="World Wide Web Consortium">
      <p>
        ISO SGML orginally came out of <span  class="b">IBM</span >, but has had a
        lot of input from different companies. XML is developed by
        the World Wide Web Consortium (<span  class="b">W3C</span >); the director of
        W3C is Tim Berners-Lee, who invented the WWW. The XML
        project leader was Jon Bosak, chief Information Architect
        at <span  class="b">Sun</span > Microsystems. The XML specification was
        co-written by representatives of <span  class="b">Netscape</span >, <span  class="b">
        Microsoft</span >, and a large academic project called the <span class="i">
        Text Encoding Initiative</span > (<span  class="b">TEI</span >). The W3C Special
        Interest Group (W3C XML SIG) had representatives from over
        100 different companies and invited experts.
      </p>
    </a></qna>
    <qna  id="zh_xml_q3">
      <q>
        A.3. Which companies are using it or supporting it?
      </q>
        <author>
                <name>Rick Jelliffe</name>
                <email>ricko@gate.sinica.edu.tw</email>
        </author>
        <a gist="Almost everyone">
      <p>
        Microsoft, Netscape, Sun, IBM, Corel, Adobe, Oracle,
        RealAudio...almost everyone.
      </p>
    </a></qna>
    <qna  id="zh_xml_q4">
      <q>
        A.4. How do I know if I am using XML?
      </q>
        <author>
                <name>Rick Jelliffe</name>
                <email>ricko@gate.sinica.edu.tw</email>
        </author>
        <a gist="Usually hidden" >
      <p>
        Most XML is hidden; you write your own specialist markup
        language using it. There are many specialist markup
        languages built using XML now. For example,
      </p>
      <div class="ul">
        <p class="li">
          RealAudio's new RealPlayer uses W3C <span  class="b">SMIL</span >
          (Synchronized Media Interchange Language);
        </p>
        <p class="li">
          Netscape's "What's Related" uses W3C <span  class="b">RDF</span > (Resource
          Description Framework);
        </p>
        <p class="li">
          Microsoft's Internet Explorer "Channels" uses <span  class="b">CDF</span >
          (Channel Definition Format).
        </p>
        </div>
 
    </a></qna>
    <qna  id="zh_xml_q5">
      <q>
        A.5. Can I use Chinese Data?
      </q>
        <author>
                <name>Rick Jelliffe</name>
                <email>ricko@gate.sinica.edu.tw</email>
        </author>
        <a gist="Yes">
      <p>
        Yes. All conforming XML processors must support <span  class="b">ISO
        10646</span > characters. ISO 10646 is a big character set that
        is an ISO standard. These include the characters from Big5
        and GB2312.
      </p>
      <p>
        However, XML is just starting. So most XML software has not
        been tested on Chinese data yet (December, 1998).
      </p>
    </a></qna>
    <qna  id="zh_xml_q6" date="1999-01-04" >
      <q>
        A.6. Can I use Chinese element names?
      </q>
        <author>
                <name>Rick Jelliffe</name>
                <email>ricko@gate.sinica.edu.tw</email>
        </author>
        <a gist="Yes">
      <p>
        Yes. All conforming XML processors must allow
        you to use Han ideographs for element names. But you can
        only use the characters defined in the character set you
        are using; you cannot use numeric character references in
        names.
      </p>
      <p>
        There is a particular problem with Big5. See Question B.12.
      </p>
    </a></qna>         
    
    <qna id="zh_xml_qa7" date="1999-04-06">
      <q>
        A.7. What is XHTML?
      </q>
        <author>
                <name>Rick Jelliffe</name>
                <email>ricko@gate.sinica.edu.tw</email>
        </author>
        <a gist="eXtensible HyperText Markup Language" >
      <p>XHTML is the name given to HTML, when it uses XML syntax.
      XML syntax is stricter than old HTML: you cannot leave out
      most end-tags, it is case-sensitive, and you must have
      quotes on attribute values.
      </p>
      </a>
      </qna>
    <qna id="zh_xml_q8" date="1999-04-06">
      <q>
        A.8. What is Well-formed and Valid?
      </q>
        <author>
                <name>Rick Jelliffe</name>
                <email>ricko@gate.sinica.edu.tw</email>
        </author>
        <a gist="Terms relating to correct syntax" >
      <p>An XML document is well-formed if its syntax is correct:
      all the required delimiters are correct, and end-tags correspond
      to their start-tags. All XML processors are required to inform
      the user if a document is not well-formed; processing will probably
      halt. (HTML browsers forgive syntax errors; XML processors do not.)
      </p>
      <p>An XML document is valid if its element structure conforms to
      the "content model" of the optional "markup declarations" in the "prolog"
      of the document. There are special tools called validators which can
      help you. 
      </p>
      <p>A content model is like an "assert" statement in C or C++ programming
      languages: it lets you check that the XML document does have the structure
      you expect or require it to have. For example, some software may expect
      that the element "HTML:img" has an attribute "src". If you can make assertions
      about the structure of the document, there are fewer cases you need to program
      for.
      </p>
      </a>
      </qna>
    
    </section>
       
    <section id="sect-2">
    <title>B. Character Sets and Encodings</title>
    <qna id="zh_xml_q7">
      <q>
        B.1. Can I use Big5?
      </q>
        <author>
                <name>Rick Jelliffe</name>
                <email>ricko@gate.sinica.edu.tw</email>
        </author>
        <a gist="maybe" >
      <p>
        Maybe. A conforming XML processor must support UTF-8 and
        UTF-16 (Unicode). But XML can be encoded using almost any
        character set. It is the parser-writer's decision which
        other character set encodings to support.
      </p>
      <div class="ul"> 
        <p class="li">
          Outside the XML processor, the document is encoded in
          Big5.
        </p>
        <p class="li">
          Inside the XML system, the document is encoded as
          ISO10646. (ISO10646 is the character set that Java uses.
          It is the same character set as Unicode.)
        </p>
 
        </div>
      <p>If you use Big5, every file ("parseable entity") used in your
      XML document <span  class="b">must</span > start with the following header:
      <span class="tt"><![CDATA[<?xml version="1.0" encoding="Big5"?> ]]></span> 
      </p>
      <p>
        Why is it important to set the character set correctly? XML
        will be used for commercial information. If your XML
        document is not labelled with the correct character set
        information, an XML processor will reject it. XML moves
        away from character set <span  class="b">guessing</span > (i.e., what HTML
        does) to <span  class="b">explicit markup</span > of character sets.
      </p>
      <p>
        There is a particular problem with Big5. See Question B.12.
      </p>
    </a></qna>
    <qna  id="zh_xml_q8">
      <q>
        B.2. Can I use GB2312?
      </q>
        <author>
                <name>Rick Jelliffe</name>
                <email>ricko@gate.sinica.edu.tw</email>
        </author>
        <a gist="Maybe">
      <p>
        Maybe. A conforming XML processor must support UTF-8 and
        UTF-16 (Unicode). But XML can be encoded using almost any
        character set. It is the parser-writer's decision which
        encodings to support.
      </p>
       <div class="ul">
        <p class="li">
          Outside the XML processor, the document is encoded in
          GB2312 (e.g. the EUC encoding of GB2312 and ASCII, also
          called cn-euc);
        </p>
        <p class="li">
          Inside the XML system, the document is encoded as
          ISO10646. (ISO10646 is the character set that Java uses.
          It is the same character set as Unicode.)
        </p>
      </div> 
      <p>
        If you use GB2312, every file ("parseable entity") used in
        your XML document <span  class="b">must</span > start with the following
        header: <span class="tt"><![CDATA[<?xml version="1.0"
        encoding="GB2312"?>]]></span>
      </p>
    </a></qna>
    <qna  id="zh_xml_q9" date="1999-04-10">
      <q>
        B.3. How do I know what character sets the XML software
        supports?
      </q>
        <author>
                <name>Rick Jelliffe</name>
                <email>ricko@gate.sinica.edu.tw</email>
        </author>
        <a gist="UTF-8 is always supported">
        <p>All XML software must support UTF-8 (and therefore ASCII) and
        UCS-2 (Unicode). Much XML software will also support other character 
        encodings. In any case, you can always
        convert ("transcode" ) your XML document into UTF-8 and use any XML
        software.</p>
    </a></qna>
    <qna  id="zh_xml_q10">
      <q>
        B.4. What if I am using Big5 and the character I need is not
        there?
      </q>
        <author>
                <name>Rick Jelliffe</name>
                <email>ricko@gate.sinica.edu.tw</email>
        </author>
        <a gist="Use a Numeric Character Reference" >
      <p>
        You can use any character defined by the ISO 10646
        universal character set. If Big5 (or GB2312) does not have
        that character, you can use a "numeric character
        reference". This looks like
       <span class="tt"><![CDATA[&#x5ABC;]]></span>where the "5ABC" is a hexadecimal number giving the
        ISO10646 number. Windows NT's "Character Map" utility lets
        you see all the ISO 10646 characters in any font.
      </p>
      <p>
        If the character is a variant on an ISO10646 character, you
        can make up an element type (and attribute) so that it will
        display properly. For example, you could borrow the "SPAN"
        element type from HTML and then use a stylesheet (e.g.
        Cascading Stylesheets: CSS) to select the font you need for
        that character.
      </p>
      <p>
        If the character is not in ISO10646, then you have to use
        the private-use character area. This is not an improvement
        on what you have to do now! We hope some better system will
        be included in XML and ISO10646 in the future.
      </p>
    </a></qna>                                 
    <qna  id="zh_xml_q12">
      <q>
        B.5. What about Web systems that "Transcode"?
      </q>
        <author>
                <name>Rick Jelliffe</name>
                <email>ricko@gate.sinica.edu.tw</email>
        </author>
        <a gist="Configure your Web server well" >
      <p>
        Some Japanese Web servers, proxies or browsers
        automatically convert between Japanese character sets
        (e.g., between Shift-JIS and EUC-J). This also apparantly
        occurs in some other languages (e.g. Russian.) As far as we
        know, this "transcoding" does not happen automatically for
        Chinese Web servers (e.g., between Big5 and GB2312; between
        the "traditional" characters and the simplified
        characters.)
      </p>
      <p>
        Big5 to GB2312 conversion is not perfect: some characters
        are missing. It is possible to create an <span class="i">XML-aware
        "lossless" transcoder</span >--this a transcoder that will
        convert unavailable characters into Numeric Character
        References (NCRs). (We have made some in Academia Sinica
        Computing Centre, for example.)
      </p>
      <p>
        In order to prevent transcoding, you can send the document
        using the MIME type <span class="tt">application/xml</span>. That is <span class="i">
        supposed</span > to prevent the document being transcoded. If
        you are using Apache, then the following may be useful to
        look at: <span class="tt">AddType application/xml XML xml</span>
                or
        <span class="tt">ForceType application/xml</span> or 
        <span class="tt">DefaultType application/xml</span>
      </p>
      <p>
        However, please note that for HTML, the popular Web
        browsers use lots of tricks to guess which character set
        encoding is used. This will probably continue even with
        HTML-in-XML (also known as "Voyager"). So using <span class="tt">
        application/xml</span> will prevent proxies from transcoding,
        but the receiving system may have "lossy" transcoding
        built-in anyway! (See question B.13.)
      </p>
    </a></qna>
    <qna  id="zh_xml_q13" >
      <q>
        B.6. What if my Web Server sends the wrong charset?
      </q>
        <author>
                <name>Rick Jelliffe</name>
                <email>ricko@gate.sinica.edu.tw</email>
        </author>
        <a gist="Configure your webserver well" >
      <p>
        Many (most) web servers do not send the correct charset in
        the HTTP/MIME headers. In fact, many Web Servers do not
        allow you to specify the character set at all!
      </p>
      <p>
        Here are some guidelines:--
      </p>
        <div class="ul">
        <p class="li">
          in the future, Web Servers will look at the XML encoding
          header (but not yet; XML is new);
        </p>
        <p class="li">
          if your site only serves one encoding, make sure that
          your webserver sends that as the default;
        </p>
        <p class="li">
          if your webserver supports HTTP 1.1 content negotiation
          (e.g. Apache) and you have many different languages, the
          server will have some system for selecting files using
          language (e.g. using filenames like <span class="tt">file.xml.en</span>
          and <span class="tt">file.xml.cn</span>); or
        </p>
        <p class="li">
          use a different directory for each file, and use the <span class="tt">
          .htaccess</span> control file to set the language. (If you
          are using Apache, your Webmaster must give you
          "AllowOverride FileInfo" permission). See next question.
        </p>
 
        </div>
    </a></qna>
    <qna  id="zh_xml_q14">
      <q>
        B.7. How do I send the correct MIME/HTTP headers using
        Apache .htaccess files?  
      </q>
        <author>
                <name>Rick Jelliffe</name>
                <email>ricko@gate.sinica.edu.tw</email>
        </author>
        <a gist="AddType application/xml xml" >
      <p>
        If you are using a recent version of the Apache server,
        then your Webmaster must give you "AllowOverride FileInfo"
        permission. Then you can put a file called 
        <span class="tt">.htaccess</span> in any
        directory. (Note that in MIME terminology,
        "<span class="i">encoding</span >" means "<span class="i">compression</span >". In the XML
        encoding header, "<span class="i">encoding</span >" means "<span class="i">coded
        character set"</span >) Here are some lines that may be useful--
      </p>
      <p>
        <span class="tt">DefaultLanguage zh</span>
      </p>
      <p>
        <span  class="tt">AddType application/xml XML xml</span >
      </p>
      <p>or</p>
      <p>
        <span  class="tt">AddType "text/xml; charset=Big5" XML xml</span >
      </p>
      <p>
        Why is it important to set the language correctly? Because
        it can help with searching later.</p>
      <p>
        Why do you want to send XML with the MIME type <span class="tt">
        application/xml</span> rather than <span class="tt">text/xml</span>? Because
        then the file will not be transcoded. Transcoding can make
        characters disappear when going from Big5 to GB2312. (See
        also question B.6)</p>
    </a></qna>
    <qna  id="zh_xml_q15">
      <q>
        B.8. What is the standard attribute xml:lang for?
      </q>
        <author>
                <name>Rick Jelliffe</name>
                <email>ricko@gate.sinica.edu.tw</email>
        </author>
        <a gist="Shows the language of an element">
      <p>
        Every XML element can have an attribute called <span class="tt">
        xml:lang</span>. It lets you set the language you are using.
        You can use this to help searching and typesetting. Put
        this attribute on the top-level element in your Chinese XML
        document. The values you can use for Chinese include:
      </p>
       <div class="ul">
        <p class="li">
           <span class="tt">xml:lang="zh"</span> for any Chinese text;
        </p>
        <p class="li">
           <span class="tt">xml:lang="zh-TW"</span> for Chinese text from Taiwan (i.e.,
          traditional characters);
        </p>
        <p class="li">
           <span class="tt">xml:lang="zh-HK"</span> for Chinese text from Hong Kong (i.e.
          probably traditional characters) ;
        </p>
        <p class="li">
           <span class="tt">xml:lang="zh-CN"</span> for Chinese text from China (i.e.,
          simplified characters);
        </p>
        <p class="li">
           <span class="tt">xml:lang="zh-SG"</span> for Chinese text from Singapore
        </p>
        </div>
      <p>
        Of course, this attribute is very simple, but it is
        important to label all your documents with what language
        they use. Then a Chinese Web-Robot can automatically add
        you text to a WWW index, and a Western Web-Robot will know
        that it should not add the information. Or an automatic
        translation service can be invoked. Some words are used
        differently in different Chinese locales (e.g. Taiwan and
        China) so it can help with automated translation and
        searching too.
      </p>
    </a></qna>
    <qna  id="zh_xml_q15a" date="1998-12-15">
      <q>
        B.9. Can I mix different kinds of Chinese in the same
        document?
      </q>
        <author>
                <name>Rick Jelliffe</name>
                <email>ricko@gate.sinica.edu.tw</email>
        </author>
        <a gist="Yes">
 
      <p>
        Yes. Every XML element can take an attribute called
        "xml:lang" which says which language the element is. This
        is not the character encoding (e.g. Big5 or GB2312), but
        the language: for example
      </p>
      <p class="pre"><![CDATA[<p xml:lang="zh-TW">...</p>]]></p>
      <p>
      means that the element is in Chinese, as used in Taiwan. By
      implication, the element p should use traditional characters.
      </p>
      
<p class="pre"><![CDATA[
        <p xml:lang="zh-HK">...</p>
        ]]>
</p>
<p>means that the element is in Hong Kong Chinese. </p>

<p class="pre"><![CDATA[
        <p xml:lang="en-SG">...</p>
        ]]>
</p>
<p>means that the element is in Singapore English.</p> 
<p class="pre"><![CDATA[
        <p xml:lang="zh-CN-YUH"....<z xml:lang="en">blah</z>...</p>
        ]]>
</p>
      <p>means that the element p is in Cantonese Chinese ("YUH" is
      the code for Cantonese in the SIL Enthnologue: <link href= 
      "http://www.sil.org/ethnologue/countries/Chin.html">
      http://www.sil.org/ethnologue/countries/Chin.html</link> ) but
      the subelement z is in English. (Some characters are used
      phonetically; these kind of characters are dialect-specific
      and unreadable outside the dialect.) By implication, the
      element p should use simplified characters. 
     </p>
      <p>
        Of course, you can also invent your own attributes to do
        anything you like:
      </p>
<p class="pre"><![CDATA[
        <p xml:lang="zh-HK-simplified" traditional="OK">...</p>
        ]]>
</p>
      <p>means that the element contains data in Hong Kong Chinese,
      but it should use simplified characters. But the attribute
      specification '<span class="tt">traditional="OK"</span>' is your attribute:
      you can use it to say that it is a OK to also use the
      traditional glyph (image). </p>
      <p>
        In XML, you use markup to describe all the interesting
        information about the data. Then you write a program or
        stylesheet or report generator to implement what you need,
        using the markup.
      </p>
    </a></qna>
    <qna  id="zh_xml_q16">
      <q>
        B.10. I heard that Unicode is not a good character set for
        Chinese!
      </q>
        <author>
                <name>Rick Jelliffe</name>
                <email>ricko@gate.sinica.edu.tw</email>
        </author>
        <a gist="Unicode is fine" >
      <p>
        The <link href="http://www.unicode.org/">Unicode
        Consortium</link> are a group of companies (including the
        Japanese company Justsystem, and companies with large
        Japanese joint operations, like (Fuji-)Xerox) that decided
        to make a big character set which had all the world's
        characters. They took the ISO character set ISO 10646 and
        have added other information: standard names and
        characteristics. Unicode includes all the characters from
        GB2312 and (probably) all the characters from Big5. Plus it
        includes many other characters. (ISO 10646 has several
        encodings: <span  class="b">UTF-8</span > is 8-bit and <span  class="b">UTF-16</span > is 16
        bit. <span  class="b">Unicode</span > is a form of UTF-16.)
      </p>
      <p>
        So Unicode is better than Big5 and GB:2312--it has more
        characters.
      </p>
      <p>
        But, there are problems with the ISO 10646 encodings:
      </p>
      <div class="ul">
        <p class="li">
          The 16-bit fixed-length encoding (UTF-16 or <span class="i">
          Unicode</span >) takes up no more space than Big5 or GB2312.
          But the 8-bit variable-length encoding uses 3 bytes per
          Chinese character. This means that an XML file may be 50%
          larger using UTF-8 than using Big5. But this number will
          be less if ASCII markup is used (e.g. if the DTD comes
          from the West). Markup can be up to 50% of a document's
          text. And, in any case, the best way to keep file sizes
          down is by compression....perhaps.
        </p>
        <p class="li">
          ISO 10646 does not use the same order as any Chinese
          character set...you cannot use a simple algorithm to
          convert from Big5 and GB2312 into ISO10646. You must use
          a big table. But, on the other hand, ISO 10646 puts the
          Chinese characters into an order that may be more useful
          for sorting. And it removes duplicated characters, so
          searching may be better too. (I have been told that GBK
          is a character set which has all the ISO 10646 characters
          but keeps GB2312 characters in their same codepoints.
          That may be a good character set in some cases.)
        </p>

        </div>
      <p>
        So XML files do not have to be encoded in UTF-8 or UTF-16.
        You can use Big5 or GB2312. But not much XML software
        supports the Chinese character sets. So it is good advice
        to move to UTF-8 or UTF-16 in the long run.
      </p>
    </a></qna>
    <qna  id="zh_xml_q17">
      <q>
        B.11. Why does software xxx work with Big5: the documentation
        says it does not?
      </q>
        <author>
                <name>Rick Jelliffe</name>
                <email>ricko@gate.sinica.edu.tw</email>
        </author>
        <a gist="A happy accident" > 
      <p>
        Big5 is an "<span class="i">7-bit unsafe</span >" "<span class="i">ASCII-family</span >"
        coded character set.
      </p>
       <div class="ul">
        <p class="li">
          "ASCII-family" coded character sets (ASCII, ISO646,
          ISO8859-*, UTF-8, EUC, Big5, GB2312) means all the sets
          which have the ASCII characters in the ASCII codepoints.
          (E.g., where "A" has the codepoint 65 (0x41).) All ASCII
          characters have a value less than decimal 128 (0x80).
        </p>
        <p class="li">
          An "8-bit safe" characters encodings is one in which, <span  class="b">
          if</span > a byte appears which has a value less than 128,
          <span  class="b">then</span > that byte always means the ASCII character.
          Shift-JIS and Big5 are not 8-bit unsafe, because the
          second byte of a multiple-byte character code can have a
          value less than 128 (0x80). The advantage of 8-bit safe
          encodings is that they are compatible with software which
          only looks at the ASCII characters for markup
          recognition.
        </p>
        <p class="li">
          A "7-bit safe" character encoding is one in which, <span  class="b">
          if</span > a byte appears which has a value less than 64
          (0x40), <span  class="b">then</span > that byte always means the ASCII
          character. Shift-JIS and Big5 are not 8-bit safe (because
          the second byte of a multiple-byte character code can
          have a value less than 0x128 (0x80)) but they are 7-bit
          safe (the second byte is always greater than 63 (0x3F)).
          7-bit safe encodings are compatible with software which
          only looks at the ASCII characters less than 64 (0x40)
          for delimiter recognition. In XML, all the XML delimiters
          [<![CDATA[<>&]]>%"'] have values less than 64 (0x40).
        </p>
      </div> 
      <p>
        This means that there is a lot of XML software which will
        work with XML documents encoded in Big5. But it is an
        accident, because, <span class="i">strictly</span >, if an XML-system does
        not understand the encoding given in the XML encoding
        header, the XML processor should signal an error. In
        particular, such systems will probably not handle numeric
        character references (NCR) correctly (See question B.4). But
        they may be useful anyway, of course, even if they are
        non-conforming.
      </p>
      <p>
        There is a particular problem with Big5. See Question B.12.
      </p>
    </a></qna>
    <qna  id="zh_xml_q18">
      <q>
        B.12. Some Big5 documents fail with strange errors? Why?
      </q>
        <author>
                <name>Rick Jelliffe</name>
                <email>ricko@gate.sinica.edu.tw</email>
        </author>
        <a gist="Big5 is a poor encoding" >
      <p>
        The second byte of Big5 characters can cause problems on
        some systems. Big5 is not "8-bit safe" (see Question B.11.)
      </p>
      <p>
        The problems will show up only on systems which do not
        convert the Big5-encoded documents into an 8-bit safe
        internal format (e.g., Unicode, or UTF-8 or UTF-16.) On
        these systems, some bytes of the Big5 code will be
        interpreted as the wrong characters.
      </p>
      <p>
        The first problem occurs when you are using Native Language
        Markup (e.g., you are using Han Ideograms for element
        names, attribute names, ID attributes, etc.) There is no
        way to fix this problem. If you must use that kind of
        software, then you must avoid using (as in markup) any Big5
        character whose second byte is not a valid name character.
      </p>
      <p>
        The second problem occurs in the very rare case that you
        are using one of the following Han ideographs in a CDATA
        section, and that character is followed by the string
        <![CDATA["]>"]]>. To fix this problem, you can split the CDATA
        section into two CDATA sections, and sandwich the naughty
        character between them. The following characters all have
        the byte 5D as their second byte (in Big5): this is ASCII
        "]".
      </p>
      <p class="pre" xml:lang="zh-TW" xml:space="preserve" >
        &#20833;&#20063;&#21253;&#22240;&#27800;&#27667;&#20407;&#26613;&#33495;&#23403;&#23403;&#36001;     
&#23847;&#28139;&#35373;&#24380;&#29750;&#36305;&#24845;&#31391;&#27036;&#33976;&#22893;&#31293;
&#38660;&#29922;&#39208;&#32306;&#25851;&#40725;&#23363;&#39764;&#37313;&#20297;&#27790;&#23712;
&#29387;&#22426;&#26587;&#32965;&#23085;&#28056;&#32606;&#20575;&#24776;&#29307;&#33658;&#20626;
&#28977;&#33743;&#37217;&#24261;&#28376;&#32122;&#36201;&#22644;&#27031;&#31618;&#36355;&#23297;
&#28565;&#34036;&#37258;&#29543;&#34711;&#39199;&#29169;&#34732;&#39416;&#30993;&#37790;&#28711;
&#37183;&#28719;&#39468;&#37305;&#36501;&#40021;
      </p> 
      <p>
        (Note: If you cannot see all the characters, then see
        question B.13.)
      </p>
    </a></qna>
    <qna   id="zh_xml_q19">
      <q>
        B.13. I cannot see all the characters on my HTML browser!
        Why?
      </q>
        <author>
                <name>Rick Jelliffe</name>
                <email>ricko@gate.sinica.edu.tw</email>
        </author>
        <a gist="Many reasons" >
      <p>
        If you cannot see all the characters, then
      </p>
       <div class="ul">
        <p class="li">
          your browser does not treat Numeric Character References
          correctly (according to HTML 4 or XML rules); or
        </p>
        <p class="li">
          you do not have the correct font installed or selected;
        </p>
        <p class="li">
          you browser uses the "encoding" to determine which font
          to use, and the font it has selected does not have the
          characters.
        </p>
        </div>
   
      <p>Try changing the "Encoding" menu item (e.g. to Big5 or
      UTF-8): it is under a different menu on different browsers. )</p>
    </a></qna>
    <qna   id="zh_xml_q20" date="1999-12-31">
      <q>
        B.14. What is Big5/GCCS, EUDC, and Big5plus?
      </q>
        <author>
                <name>Rick Jelliffe</name>
                <email>ricko@gate.sinica.edu.tw</email>
        </author>
        <a gist="Extended character sets" >
       
      <p>
        EUDC (Extended User-defined Characters) is the general name
        used in Hong Kong for standard sets of user-defined
        characters (sometimes called by the Japanese term <span class="i">
        gaiji</span >). They include R&amp;D EUDC, HKUST EUDC and GCCS
        EUDC.
      </p>
      <p>
        Big5 was developed in Taiwan. The "traditional" characters
        are also used in Hong Kong. But Hong Kong also uses other
        characters which are rarer in Taiwan. The Hong Kong
        Government has made the Government Chinese Character Set
        (GCCS), which is Big5 plus an additional 3049 characters.
        It seems to be in widespread use.
      </p>
      <p>
        Taiwan standards committees have also recented added an
        extra 7000 or so characters to Big5, calling it Big5Plus.
        We cannot be very sure how much it is being used.
      </p>
      <p>
        Big5/GCCS, EUDC and Big5plus do not have registered IANA
        encoding names for the Internet. So be careful.
      </p>
      <p>
        For future interoperability, it is <span  class="b">important</span > that
        all WWW software gives the correct headers in the HTML and
        XML files. How can you trust your e-commerce data if you
        cannot know the character set? If you use these new
        versions of Big5, always put an extra comment or processing
        instruction at the head of the document to document it. For
        XML, we suggest that you put, as the second tag in the
        document, an XML processing instruction with target
        "ascc-hint" and an attribute "non-IANA".
      </p>
<p class="pre"><![CDATA[
        <?xml version="1.0" encoding="Big5" ?>
        <?ascc-hint non-IANA="Big5plus" ?>
        ]]>
      
</p>
      <p>and</p> 
<p class="pre"><![CDATA[
        <?xml version="1.0" encoding="Big5" ?>
        <?ascc-hint non-IANA="GCCS" ?>
        ]]>
</p>
    </a></qna>
    </section>
    <section id="sect-3" >
    <title>C. XML Software</title>
    <qna  id="zh_xml_qc1">
      <q>
        C.1. What is the best free XML software for Chinese at the
        moment?
      </q><a gist="IE 5">
      <p>
        For Chinese XML, the best browser at the moment (April 1999) is
        probably Internet Explorer 5.0. The best XML parser is
        probably IBM's XML Parser for Java. The best XML/SGML
        parser is probably James Clark's SP software (C++). An XML
        version of Perl is coming too!
      </p>
 
      <p>For a listing of current XML tools, see Robin Cover's web pages
      at
      <link href="http://www.oasis-open.org/cover/xml.html#xmlSoftware"
      >http://www.oasis-open.org/cover/xml.html#xmlSoftware</link>. (1999-04-13)
      </p>
    </a></qna>
      
    <qna  id="zh_xml_qc2">
      <q>
        C.2. What is the best free XML validators for Chinese at the
        moment?
      </q><a gist="Microsoft">                                    
      <p>There are many XML tools which tell you whether an XML entity is
      well-formed. There are fewer tools which also check document
      validity against a DTD: Microsoft has a useful tool available at
      <link href="http://www.microsoft.com/xml/"
      >http://www.microsoft.com/xml/</link> (under "XML &amp; XSL Demos" and
      "XML Validator").(1999-04-13)</p>
      
      <p>For a listing of current validators tools, see Robin Cover's web pages
      at
      <link href="http://www.oasis-open.org/cover/xml.html#xmlValResources"
      >http://www.oasis-open.org/cover/xml.html#xmlValResources</link>.
      </p>
    </a></qna>
    
    <qna  id="zh_xml_qc3">
      <q>
        C.3. What is the best free XSL software for Chinese at the
        moment?
      </q><a gist="XT">
      <p>                             
      All XSL tools are experimental betas, at the moment. The tools from
      James Clark (XT) and from IBM (LotusXSL) are probably the best. (1999-04-13)
      </p>
                                     
      <p>For a listing of current XSL tools, see Robin Cover's web pages
      at
      <link href="http://www.oasis-open.org/cover/xsl.html#xslSoftware"
      >http://www.oasis-open.org/cover/xsl.html#xslSoftware</link>. 
      </p>
    </a></qna>
    <qna  id="zh_xml_qc4">
      <q>
        C.4. What is the best free XHTML software for Chinese at the
        moment?
      </q><a gist="Tidy">
      <p>                             
      All XHTML tools are experimental betas, at the moment.
      Dave Ragget's tidy program at <link href="http://www.w3.org/People/Raggett/"
      >http://www.w3.org/People/Raggett/</link> can
      help convert HTML to XHTML.  (1999-04-13)
      </p>
       
                                       
    </a></qna>
    
    
    
    </section>
    <section>
    <title>FAQ Information</title>
    <qna>
    <q>
      Where can I get more Information?
    </q>
        <a gist="Links page">
    <p>
      Try the <link href="http://www.ascc.net/xml/">Chinese XML
      Now!</link> page at Academia Sinica, Taipei.
    </p>
    </a>
    </qna>
    <qna>
    <q>
      What is the "Chinese XML Now!" project?
    </q><a gist="Academia Sinica Computing Center, Taiwan">
    <p>
      This is a small project from the Computing Center at Academia
      Sinica, Taipei. It aims at providing information to
      developers of Chinese XML Software. It is sometimes difficult
      for non-Chinese-reading software developers to find useful
      information on the WWW; and when the project began, there was
      not much Chinese information on XML either.
    </p>
    <p>
      The project tries to support material equally in English and
      Chinese, and in UTF-8, Big5 and GB2312.
    </p>
 
    </a>
    </qna>
    <qna>
    <q>
      Who should can I contact about this FAQ?
    </q><a gist="We welcome responses" >
    <p>
      We welcome corrections, questions and ideas. The contact for
      the English language material is Rick Jelliffe: <link href= 
      "mailto:ricko@gate.sinica.edu.tw">
      ricko@gate.sinica.edu.tw</link>. The contact for Chinese
      language is Chin-Tang Chang: <link href=
      "mailto:ctchang@gate.sinica.edu.tw">ctchang@gate.sinica.edu.tw</link>.
    </p>

    </a>
    </qna>
        </section>
    <section>
    <title>
      Contributions
    </title>
    <p>
      Thanks for corrections to Sidney Lu, John Cowan (plus apologies for
      the previous typo in his name), and Toshinori
      Numata
    </p>
    </section>
    <section class="dc">
    <title>
      Cataloging Information (Dublin Core)
    </title>
<p class="pre-dc" xml:space="preserve" > <![CDATA[
<DC:TITLE       xml:lang="en">The Chinese XML FAQ (English version) </DC:TITLE>
<DC:CREATOR                  >Rick Jelliffe </DC:CREATOR>
<DC:CONTRIBUTOR xml:lang="zh-TW-Lt">Chin-Tang Chang</DC:CONTRIBUTOR>
<DC:SUBJECT     xml:lang="en">XML, SGML, Chinese, FAQ,
                              Big5, GB2312, Unicode, ISO 10646, UTF-8, UTF-16, 
                              Apache, Voyager </DC:SUBJECT>
<DC:DESCRIPTION xml:lang="en">Frequently Asked Questions about using XML for Chinese </DC:DESCRIPTION>
<DC:PUBLISHER   xml:lang="en">Computing Centre, Academia Sinica, Taiwan </DC:PUBLISHER>
<DC:TYPE        xml:lang="en">Text.Article </DC:TYPE>
<DC:DATE                     >1999-04-10 </DC:DATE>
<DC:RIGHTS                   >]]><link href="http://www.ascc.net/xml/en/utf-8/legal.html"
>http://www.ascc.net/xml/en/utf-8/legal.html</link><![CDATA[ </DC:RIGHTS>]]>
        
    
  </p>
  </section>
  </body>
</faq>

