Chinese Text Processing FAQ

The XML Logo
Maintained by: Rick Jelliffe

Table of Contents

Chinese Text Processing FAQ

A. General

Is it difficult to support Chinese text processing?
Where do I start?
But my program uses 8-bit characters! Can I do Chinese?
How can users enter Chinese characters in non-Chinese computer systems?
What else should I look for?

B. Line Breaks

How should I generate line breaks for Chinese text?
How should I generate line breaks for Chinese marked-up text?
How wide is a Chinese character in constant-width fonts?
How should I generate line breaks for Chinese typeset text?
What about letter-quality Chinese publications?
What about for good quality Chinese publications?
How should I treat whitespaces in Chinese text?

C. Searching and Indexing

How do I search or index Chinese documents?
How do I search or index using characters?
What I don't want to make an index, just search?
How do I search or index using words?

Cataloging Information (Dublin Core)

The XML Logo

Chinese Text Processing FAQ

This FAQ gives information on localizing text processing systems to work on Chinese text. The focus is on systems which generate or use XML, HTML, SGML or XHTML.

This FAQ does not deal with locale issues such as dates, time, money, user interface language, user interface cultural conventions.

For more information on related topics, see the Chinese XML FAQ http://www.ascc.net/xml/en/utf-8/faq.html.

A. General

Is it difficult to support Chinese text processing?

No, but depends on what you are localizing, and what you are comparing it with. Chinese is probably easier to support than Japanese or Korean (though they have many similar needs) because there are fewer scripts involved and it is easy to understand the basic functionality required.

Where do I start?

Your software must be able to handle large characters sets. Chinese has thousands of characters. (But that does not mean it has to handle 16-bit characters: there are encodings of Chinese that use 8-bit encodings; see the next question.)

If it is XML (or XHTML) then it will undoubtedly use Unicode (ISO 10646) internally. It should be able to accept and generate XML (or XHTML) in UTF-8 and perhaps UCS2 encodings.

The main XML (and XHTML) processors also support the two most popular Chinese encodings: Big5 (used in Taiwan and Hong Kong for traditional characters) and GB2312 (used in mainland China and in Singapore.) They convert ("transcode") from these character encodings into Unicode characters internally.

There is more information about this in the Chinese XML FAQ http://www.ascc.net/xml/en/utf-8/faq.html.

But my program uses 8-bit characters! Can I do Chinese?

Most Chinese character sets use mixed-size encodings: so they will work to some extent with 8-bit systems. The safest thing to do is to transcode the character sets into UTF-8: this makes sure that the codes less than <127 are only used for ASCII characters.

For more information, see Ken Lunde's book "Chinese, Japanese, Korean, Vietnamese Information Processing", O'Reilly ISBN 1-56592-224-7.

Western programmers need to be careful with terminology: when "character" is said, sometimes it means a code number, sometimes an encoded sequence of bytes, and sometimes the idea of the character (rather than the "glyph" in a font). So "character" does not equal "byte".

How can users enter Chinese characters in non-Chinese computer systems?

If you are not using a computer with a Chinese input method (a piece of operating system software which helps people type Chinese characters) then you have to provide some alternative.

One easy thing is this: allow users to type a numeric character reference in search dialog boxes, to provide a last resort. I recommend the XML, HTML and XHTML convention, which is "&#x" followed by the hexadecimal code of the character in Unicode, followed by ";". Some Japanese dictionaries now include the Unicode codes; perhaps Chinese will follow. But it will be useful for testing in any case.

What else should I look for?

Check your system to make sure you don't do uppercasing anywhere just using AND or OR operations. If your software has been written for ASCII (7bits), or for 8-bits and your document is in Big5 and not UTF-8, you may corrupt characters when your software tries to uppercase a character that it thinks is lower case, but is actually just one byte of a multibyte character.

Your software should never uppercase a byte-string by iterating one-byte at a time. It should first check if it is a multibyte character or not. One simple thing to do is to add a little switch (from the command line, or keyed by the encoding) so that uppercasing never happens if the previous character was greater-than 127. This should work for Big5 (and the Japanese shiftJIS).

B. Line Breaks

How should I generate line breaks for Chinese text?

It depends if the text is marked-up data (XML, HTML, SGML, XHTML) or it is the typeset text. They serve different purposes.

How should I generate line breaks for Chinese marked-up text?

Chinese does not have spaces. Western word-wrap programs cause a problem in Chinese: they get to the hard line-length limit and then stick in a break. This makes the marked-up text look OK in a text editor, but when it is typeset, there can sometimes be spurious spaces. Also, Western search routines respect white-space: if a word in Chinese text was broken by an automatic line-wrap program, searches for that word will fail.

It is best to try to break lines at Chinese punctuation characters: the lines will look uneven in the marked-up document, but when the document is typeset (or rendered to a screen) you will not have introduced any spurious spaces. Unless it is the usual way your program works, don't introduce breaks on ASCII-range (< 127) characters: that may mess up embedded scripts and URIs).

Here is C function that tests if a Big5 character is breakable. In this example, we are using 16-bit characters (we have made the Big5 encoded characters all 16-bit characters by padding the ASCII characters with a high byte of 0x00). It returns 0 if the char is not breakable, 1 if the char can be broken after, and 2 if the char can be broken before.

int isbreakableBig5(WCHAR c) {
  if ((( c & 0xFF00) == 0xA100 ) & !(mode & PREFORMATTED )) { 
    /* opening brackets have odd codes: break before them */ 
    if ((c 0x5C ) && ( c < 0xAD) && (( c & 1 ) == 1)) 
      return ( 2 );
    return ( 1 );
  }
  return ( 0 );
}
     

Here is a C function that tests if UCS-2 (Unicode) char is breakable. It returns 0 if the char is not breakable, 1 if the char can be broken after, and 2 if the char can be broken before.

int isbreakableUnicode(WCHAR c) {
  int breakable = NO; 

  /* Break after any punctuation or spaces characters */ 
  if ((c >= 0x2000)  { 
    if (((c >= 0x2000)&& ( c<= 0x2006 )) 
     || ((c >= 0x2008) && ( c<= 0x2010 )) 
     || ((c >= 0x2011) && ( c<= 0x2046 )) 
     || ((c >= 0x207D) && ( c<= 0x207E )) 
     || ((c >= 0x208D) && ( c<= 0x208E )) 
     || ((c >= 0x2329) && ( c<= 0x232A )) 
     || ((c >= 0x3001) && ( c<= 0x3003 )) 
     || ((c >= 0x3008) && ( c<= 0x3011 )) 
     || ((c >= 0x3014) && ( c<= 0x301F )) 
     || ((c >= 0xFD3E) && ( c<= 0xFD3F ))
     || ((c >= 0xFE30) && ( c<= 0xFE44 )) 
     || ((c >= 0xFE49) && ( c<= 0xFE52 )) 
     || ((c >= 0xFE54) && ( c<= 0xFE61 )) 
     || ((c >= 0xFE6A) && ( c<= 0xFE6B )) 
     || ((c >= 0xFF01) && ( c<= 0xFF03 )) 
     || ((c >= 0xFF05) && ( c<= 0xFF0A )) 
     || ((c >= 0xFF0C) && ( c<= 0xFF0F )) 
     || ((c >= 0xFF1A) && ( c<= 0xFF1B )) 
     || ((c >= 0xFF1F) && ( c<= 0xFF20 )) 
     || ((c >= 0xFF3B) && ( c<= 0xFF3D )) 
     || ((c >= 0xFF61) && ( c<= 0xFF65 ))) { 
      breakable = YES; 
    } 

    else switch (c) { 
      case 0xFE63: case 0xFE68: case 0x3030: 
      case 0x30FB: case 0xFF3F: case 0xFF5B: 
      case 0xFF5D: 
        breakable=YES; 
    } 

    /* but break before a left punctuation */ 
    if( breakable==YES ) { 
      if (((c >= 0x201A) && ( c <= 0x201C )) 
       || ((c >= 0x201E) && ( c <= 0x201F )) ) { 
       return ( 2 ); 
      } 
      else switch ( c ) { 
        case 0x2018: case 0x2039: case 0x2045: 
        case 0x207D: case 0x208D: case 0x2329: 
        case 0x3008: case 0x300A: case 0x300C: 
        case 0x300E: case 0x3010: case 0x3014: 
        case 0x3016: case 0x3018: case 0x301A: 
        case 0x301D: case 0xFD3E: case 0xFE35: 
        case 0xFE37: case 0xFE39: case 0xFE3B: 
        case 0xFE3D: case 0xFE3F: case 0xFE41: 
        case 0xFE43: case 0xFE59: case 0xFE5B: 
        case 0xFE5D: case 0xFF08: case 0xFF3B: 
        case 0xFF5B: case 0xFF62: 
          return ( 2 ); 
      } 
    } 
    if ( breakable == YES ) return ( 1 );
  } 
  return ( 0 );
}
     

How wide is a Chinese character in constant-width fonts?

Marked-up text is usually displayed in constant-width fonts. Line length calculations are based on this.

In order to figure out line-wrap positions with Chinese data, you either have to allow indent + ( linelength - indent )/2 characters per line. This works OK if the data is all Chinese.

If the data is mixed Chinese and ASCII (as it often is in marked-up text) then you should increment the linelength counter by 2 when you get a Chinese character.

How should I generate line breaks for Chinese typeset text?

Draft quality typesetting of Chinese is really easy: you just break the line when you cannot fit anymore characters in it!.

What about letter-quality Chinese publications?

For better quality typesetting, you should also check the breakability of the characters at the end of the line and its neighbour, using the functions above.

If the breakability is "break before", then the character is something like an opening bracket, and you should move it to the start of the next line. (Perhaps you should adjust the justification of the line so that it keeps fully justified. There are different styles of justification possible: in general the idea is to try to maintain the grid-placement effect.) If the previous character is also a "break-before" then move that too.(An if there are three of these characters...??)

If the breakability of the character after the point is "break after", then that character is probably a closing bracket. Just like parentheses in Western text, it would look ugly if it was at the start of a line. So keep it on the current line, and break after. If the next character is also a "break-after", then move that too. (An if there are three of these characters...??) You are allowed to increase the line length slightly to allow these to fit ("dangling punctuation").

What about for good quality Chinese publications?

The rules for typesetting good quality Chinese text are quite similar to the Japanese rules (the "Kinsoku" rules). If you already handle Japanese line-breaking, you program will not need much to also handle Chinese.

The best source is Nadine Kado's book from Microsoft Press (it has "international" in the title). Also, Ken Lunde's book (see above) has a description of the Japanese rules. His book also discusses dangling punctuation and justification.

How should I treat whitespaces in Chinese text?

As a general rule, inside Chinese text you should strip ASCII whitespace (space, tab, newline). It has probably been introduced into the document spuriously from markup, for example when a text processor automatically broke a line.

However, do not strip out spaces that are not the ASCII spaces: the non-breaking space and the "ideographic space" (which is the size of a Chinese ideographic character).

If the XML, HTML or XHTML document has been well marked up, then there will be language attribute applicable to the current element (called "lang" or "xml:lang"). If its value starts with "zh" (or "ZH" in bad practise) then the language of the element is Chinese (the "zh" comes from "zhongr", which s the first part of one of the ways Chinese say "Chinese" in Chinese...clear?). So a well designed program will key whether to strip spaces based on the language attribute.

If your data includes Western text, and it does not have language tagging, you can put in a test so that spaces are not stripped between non-Chinese characters. For example, in the simplest case of handling English, you can do a little test like if (c < 0x7F) return(DONT_STRIP); (if you are using Big5) or if (c < 0x2000) return(DONT_STRIP); (if you are using Unicode).

C. Searching and Indexing

How do I search or index Chinese documents?

YOu decide whether you will implement searching or indexing by character of by word.

Chinese is very simple to search/index, if you use character-based searching/indexing, rather than word-based searching/indexing. It is very difficult to detect (or get people to agree to) word boundaries, so unless you have specialists available, character-based searching or indexing will be enough.

How do I search or index using characters?

Chinese words are usually made from one to four characters; most are made from two characters.

Because there are no reliable indications of word boundaries in most Chinese text, when you search you cannot use the speed up of skipping to the next word when a match fails. You have to just do a linear search.

Because of the large number of characters, there are many ways of indexing text which you would not attempt in English. The most straightforward is to say "one character == one word", and to index every character (i.e., a fully permuted index.) You would never do this in English: imagine an index that found every occurance of the letter "e" in a document!

You can also index on bigrams (two consecutive characters), trigrams, or even 4-grams. I read somewhere that anything more than 4-grams is not much use. The cost of these is, unfortuately, that because you don't know the word boundaries, you will get lots of spurious n-grams. On the other hand, because word boundaries are fairly subjective in Chinese (like in English: some people hypenate, some don't) it is probably good to err on the side of having too many n-grams anyway.

What I don't want to make an index, just search?

If you dont want to make an index, and if linear searching is not good enough, then there are various things you can do.

The simplest is to make a stop list for your document or collection. It tells you if a character is used in the document. This allows fast, constant-time searching when the search string is not in the document. You can implement this in a small bit array: a 2 K array lets you have a stop-list for 16,000 characters.

This seems a good approach. (You can use the same code for English: remove any accents from all characters, convert the cases and remove the punctuation: you get a telegraph code, using 5 bits. Then index on all the 10-bit bigrams from these codes: it only takes a 1K array. Now you can tell if your document has "Exxon" really fast. Obviously that kind of speed up will only be worthwhile if people will be searching for strange words: program nmemonics or foreign words.)

The next variant of this is to make an index giving the address of the first occurance of the character. You don't need to give the 16-bit address: you can just divide the document into 255 blocks (0 or 256 means "not found")and then start your search from that position. 16,000 characters can be indexed this way. It has the same constant-time, fast performance for if the character is not found, and better performance for the first match. (Of course, if you are searching for a string, you start in the end-most block that has been indexed.)

The next wrinkle is to provide a "last occurance" index, which again helps to reduce the number of linear searches that would fail.

Of course, some documents will use a more restricted vocabulary than others: I have read that 1,500 to 3,000 characters are typical vocabulary limits. Furthermore, users of GB2312 have far fewer characters available than users of Big5 and Unicode. So when you are evaluating which speed-up to pick, you should consider how big your collection is and which character set they are using.

How do I search or index using words?

Finding word boundaries in Chinese is a matter of very active research. One problem is that the character-occurrence statistics of text are very different from those of proper names (people, places, jargon).

So statistical (dictionary) systems may work well in text without proper nouns, but not so well in other text.

There are other systems which use gramatical heuristics to parse the sentence, and figure out word-boundaries that way. These might be successful, but at a heavy processing cost.

I personally feel that the answer is neither dictionaries nor parsing but markup. When entering the data, there should either be markup of all proper nouns or markup of all word boundaries.

The last is simplest: when typing in Chinese, put in a space between words! That system works well in the West, keyboards already have spacebars, and it is simple enough for people to do when they have the habit. The spaces should be ignored by the publication system: they should be treated as "zero-width spaces".

Cataloging Information (Dublin Core)

 
<DC:TITLE       >Chinese Text Processing FAQ</DC:TITLE>
<DC:CREATOR     >Rick Jelliffe</DC:CREATOR>
<DC:SUBJECT     >localization, Chinese, internationalization, l10n, i18n,
   typesetting, XML, HTML, SGML, XHTML, text processing,
   line break, linebreak, word wrap, word wrap, index, search,
   bigram</DC:SUBJECT>
<DC:DESCRIPTION >Frequently Asked Questions about processing Chinese text</DC:DESCRIPTION>
<DC:PUBLISHER   >Computing Centre, Academia Sinica, Taiwan</DC:PUBLISHER>
<DC:TYPE        >Text.Article</DC:TYPE>
<DC:DATE        >1999-05-12</DC:DATE>
<DC:RIGHTS      >http://www.ascc.net/xml/en/utf-8/legal.html </DC:RIGHTS>