Chinese Text Processing FAQ 0.1 Rick Jelliffe ricko@gate.sinica.edu.tw Chinese Text Processing FAQ (maintenance) Programming localization, Chinese, internationalization, l10n, i18n, typesetting, XML, HTML, SGML, XHTML, text processing, line break, linebreak, word wrap, word wrap, index, search, bigram
Chinese Text Processing FAQ

This FAQ gives information on localizing text processing systems to work on Chinese text. The focus is on systems which generate or use XML, HTML, SGML or XHTML.

This FAQ does not deal with locale issues such as dates, time, money, user interface language, user interface cultural conventions.

For more information on related topics, see the Chinese XML FAQ http://www.ascc.net/xml/en/utf-8/faq.html.

A. General Is it difficult to support Chinese text processing?

No, but depends on what you are localizing, and what you are comparing it with. Chinese is probably easier to support than Japanese or Korean (though they have many similar needs) because there are fewer scripts involved and it is easy to understand the basic functionality required.

Where do I start?

Your software must be able to handle large characters sets. Chinese has thousands of characters. (But that does not mean it has to handle 16-bit characters: there are encodings of Chinese that use 8-bit encodings; see the next question.)

If it is XML (or XHTML) then it will undoubtedly use Unicode (ISO 10646) internally. It should be able to accept and generate XML (or XHTML) in UTF-8 and perhaps UCS2 encodings.

The main XML (and XHTML) processors also support the two most popular Chinese encodings: Big5 (used in Taiwan and Hong Kong for traditional characters) and GB2312 (used in mainland China and in Singapore.) They convert ("transcode") from these character encodings into Unicode characters internally.

There is more information about this in the Chinese XML FAQ http://www.ascc.net/xml/en/utf-8/faq.html.

But my program uses 8-bit characters! Can I do Chinese?

Most Chinese character sets use mixed-size encodings: so they will work to some extent with 8-bit systems. The safest thing to do is to transcode the character sets into UTF-8: this makes sure that the codes less than <127 are only used for ASCII characters.

For more information, see Ken Lunde's book "Chinese, Japanese, Korean, Vietnamese Information Processing", O'Reilly ISBN 1-56592-224-7.

Western programmers need to be careful with terminology: when "character" is said, sometimes it means a code number, sometimes an encoded sequence of bytes, and sometimes the idea of the character (rather than the "glyph" in a font). So "character" does not equal "byte".

How can users enter Chinese characters in non-Chinese computer systems?

If you are not using a computer with a Chinese input method (a piece of operating system software which helps people type Chinese characters) then you have to provide some alternative.

One easy thing is this: allow users to type a numeric character reference in search dialog boxes, to provide a last resort. I recommend the XML, HTML and XHTML convention, which is "&#x" followed by the hexadecimal code of the character in Unicode, followed by ";". Some Japanese dictionaries now include the Unicode codes; perhaps Chinese will follow. But it will be useful for testing in any case.

What else should I look for?

Check your system to make sure you don't do uppercasing anywhere just using AND or OR operations. If your software has been written for ASCII (7bits), or for 8-bits and your document is in Big5 and not UTF-8, you may corrupt characters when your software tries to uppercase a character that it thinks is lower case, but is actually just one byte of a multibyte character.

Your software should never uppercase a byte-string by iterating one-byte at a time. It should first check if it is a multibyte character or not. One simple thing to do is to add a little switch (from the command line, or keyed by the encoding) so that uppercasing never happens if the previous character was greater-than 127. This should work for Big5 (and the Japanese shiftJIS).

B. Line Breaks How should I generate line breaks for Chinese text?

It depends if the text is marked-up data (XML, HTML, SGML, XHTML) or it is the typeset text. They serve different purposes.

How should I generate line breaks for Chinese marked-up text?

Chinese does not have spaces. Western word-wrap programs cause a problem in Chinese: they get to the hard line-length limit and then stick in a break. This makes the marked-up text look OK in a text editor, but when it is typeset, there can sometimes be spurious spaces. Also, Western search routines respect white-space: if a word in Chinese text was broken by an automatic line-wrap program, searches for that word will fail.

It is best to try to break lines at Chinese punctuation characters: the lines will look uneven in the marked-up document, but when the document is typeset (or rendered to a screen) you will not have introduced any spurious spaces. Unless it is the usual way your program works, don't introduce breaks on ASCII-range (< 127) characters: that may mess up embedded scripts and URIs).

Here is C function that tests if a Big5 character is breakable. In this example, we are using 16-bit characters (we have made the Big5 encoded characters all 16-bit characters by padding the ASCII characters with a high byte of 0x00). It returns 0 if the char is not breakable, 1 if the char can be broken after, and 2 if the char can be broken before.

Here is a C function that tests if UCS-2 (Unicode) char is breakable. It returns 0 if the char is not breakable, 1 if the char can be broken after, and 2 if the char can be broken before.

= 0x2000) { if (((c >= 0x2000)&& ( c<= 0x2006 )) || ((c >= 0x2008) && ( c<= 0x2010 )) || ((c >= 0x2011) && ( c<= 0x2046 )) || ((c >= 0x207D) && ( c<= 0x207E )) || ((c >= 0x208D) && ( c<= 0x208E )) || ((c >= 0x2329) && ( c<= 0x232A )) || ((c >= 0x3001) && ( c<= 0x3003 )) || ((c >= 0x3008) && ( c<= 0x3011 )) || ((c >= 0x3014) && ( c<= 0x301F )) || ((c >= 0xFD3E) && ( c<= 0xFD3F )) || ((c >= 0xFE30) && ( c<= 0xFE44 )) || ((c >= 0xFE49) && ( c<= 0xFE52 )) || ((c >= 0xFE54) && ( c<= 0xFE61 )) || ((c >= 0xFE6A) && ( c<= 0xFE6B )) || ((c >= 0xFF01) && ( c<= 0xFF03 )) || ((c >= 0xFF05) && ( c<= 0xFF0A )) || ((c >= 0xFF0C) && ( c<= 0xFF0F )) || ((c >= 0xFF1A) && ( c<= 0xFF1B )) || ((c >= 0xFF1F) && ( c<= 0xFF20 )) || ((c >= 0xFF3B) && ( c<= 0xFF3D )) || ((c >= 0xFF61) && ( c<= 0xFF65 ))) { breakable = YES; } else switch (c) { case 0xFE63: case 0xFE68: case 0x3030: case 0x30FB: case 0xFF3F: case 0xFF5B: case 0xFF5D: breakable=YES; } /* but break before a left punctuation */ if( breakable==YES ) { if (((c >= 0x201A) && ( c <= 0x201C )) || ((c >= 0x201E) && ( c <= 0x201F )) ) { return ( 2 ); } else switch ( c ) { case 0x2018: case 0x2039: case 0x2045: case 0x207D: case 0x208D: case 0x2329: case 0x3008: case 0x300A: case 0x300C: case 0x300E: case 0x3010: case 0x3014: case 0x3016: case 0x3018: case 0x301A: case 0x301D: case 0xFD3E: case 0xFE35: case 0xFE37: case 0xFE39: case 0xFE3B: case 0xFE3D: case 0xFE3F: case 0xFE41: case 0xFE43: case 0xFE59: case 0xFE5B: case 0xFE5D: case 0xFF08: case 0xFF3B: case 0xFF5B: case 0xFF62: return ( 2 ); } } if ( breakable == YES ) return ( 1 ); } return ( 0 ); } ]]>

How wide is a Chinese character in constant-width fonts?

Marked-up text is usually displayed in constant-width fonts. Line length calculations are based on this.

In order to figure out line-wrap positions with Chinese data, you either have to allow indent + ( linelength - indent )/2 characters per line. This works OK if the data is all Chinese.

If the data is mixed Chinese and ASCII (as it often is in marked-up text) then you should increment the linelength counter by 2 when you get a Chinese character.

How should I generate line breaks for Chinese typeset text?

Draft quality typesetting of Chinese is really easy: you just break the line when you cannot fit anymore characters in it!.

What about letter-quality Chinese publications?

For better quality typesetting, you should also check the breakability of the characters at the end of the line and its neighbour, using the functions above.

If the breakability is "break before", then the character is something like an opening bracket, and you should move it to the start of the next line. (Perhaps you should adjust the justification of the line so that it keeps fully justified. There are different styles of justification possible: in general the idea is to try to maintain the grid-placement effect.) If the previous character is also a "break-before" then move that too.(An if there are three of these characters...??)

If the breakability of the character after the point is "break after", then that character is probably a closing bracket. Just like parentheses in Western text, it would look ugly if it was at the start of a line. So keep it on the current line, and break after. If the next character is also a "break-after", then move that too. (An if there are three of these characters...??) You are allowed to increase the line length slightly to allow these to fit ("dangling punctuation").

What about for good quality Chinese publications?

The rules for typesetting good quality Chinese text are quite similar to the Japanese rules (the "Kinsoku" rules). If you already handle Japanese line-breaking, you program will not need much to also handle Chinese.

The best source is Nadine Kado's book from Microsoft Press (it has "international" in the title). Also, Ken Lunde's book (see above) has a description of the Japanese rules. His book also discusses dangling punctuation and justification.

How should I treat whitespaces in Chinese text?

As a general rule, inside Chinese text you should strip ASCII whitespace (space, tab, newline). It has probably been introduced into the document spuriously from markup, for example when a text processor automatically broke a line.

However, do not strip out spaces that are not the ASCII spaces: the non-breaking space and the "ideographic space" (which is the size of a Chinese ideographic character).

If the XML, HTML or XHTML document has been well marked up, then there will be language attribute applicable to the current element (called "lang" or "xml:lang"). If its value starts with "zh" (or "ZH" in bad practise) then the language of the element is Chinese (the "zh" comes from "zhongr", which s the first part of one of the ways Chinese say "Chinese" in Chinese...clear?). So a well designed program will key whether to strip spaces based on the language attribute.

If your data includes Western text, and it does not have language tagging, you can put in a test so that spaces are not stripped between non-Chinese characters. For example, in the simplest case of handling English, you can do a little test like (if you are using Big5) or (if you are using Unicode).

Cataloging Information (Dublin Core)

Chinese Text Processing FAQ Rick Jelliffe localization, Chinese, internationalization, l10n, i18n, typesetting, XML, HTML, SGML, XHTML, text processing, line break, linebreak, word wrap, word wrap, index, search, bigram Frequently Asked Questions about processing Chinese text Computing Centre, Academia Sinica, Taiwan Text.Article 1999-05-12 ]]>http://www.ascc.net/xml/en/utf-8/legal.html]]>