The XML Logo (from the XML
      FAQ)

Chinese XML Now!

Lossless Transcoders


Transcoding

A transcoder is software which converts text from one character encoding to another. If a character in the input text is not available in the output encoding, transcoders will usually:

  • strip the character out; or
  • replace it with some string like "??"; or
  • replace it with a special missing-character character.

In most cases, for example with the Java libraries, transcoder failures are not reported as errors to the caller.

A recoder will attempt to find the nearest character or some mnemonic if the output encoding does not have a particular character.

A lossless transcoder will generate some kind of numeric character reference instead of the character. Most text formats have some kind of numeric character reference: it is increasingly popular to support Unicode (ISO 10646) characters this way: HTML 4, XML, CSS, Java support this. But each have different formats: xml-tcs was proof-of-concept software to explore lossless trancsoding. It is available below.


Modal lossless Transcoding

While xml-tcs is useful, it has encouraged us in a different area. We have seen from the first year of implementations of XML systems that programmers find it very hard to generate text with the correct delimiters. It is a discipline and testing problem; but that in turn means that it is something that APIs need to help programmers with.

Furthermore, the nature of the problem is complicated because most markup languages and programming language are modal: at certain points numeric character references are allowed, at others they are not. An XTHML document could have several appropriate modes:

  • in content (of an element or attribute);
  • in markup;
  • in a CDATA marked section;
  • in a CSS script;
  • in a JavaScript script.

In each of these modes, different delimiters and delimiting policy is appropriate.

We are currently working on a trial enhancement of GNU libc's iconv() function to support modal lossless transcoding. The programmer can register delimiters and transcoding behaviour (hex, decimal, halt-with-fail, strip) for each mode. Then when the programmer generates the output data, they only need to first set the appropriate mode: e.g. set_iconv_mode(my_converter, CDATA);.

We hope to release this software in due course. We commend this approach for writers of transcoder libraries.


xml-tcs tcs is a command-line transcoding utility from Bell Labs, associated with Plan 9. The source code is available on the WWW and distributed with some versions of Linux. However, we have not received a response from them concerning certain copyright questions, so we have not released here the combined software: instead, you should download the source code from the Website (check the copyright notices), apply the patch file (using patch utility, tar and an unzip) and compile the source code yourself.
  • Source code from Bell Labs: tcs
  • Patch file for tcs from ASCC: xml-tcs
  • Source code for patch from Larry Wall: patch
  • Source code for GNU gzip from FSF: GNU gzip
xml-tcs lossless transcoder utility has several options to allow different start and end delimiters for the numeric character reference.
Update 199-07-01: xml-tcs now also can generate decimal NCRs, suitable for SGML and Netscape. xml-tcs now supports the following NCRs:
  • STRIP: no delimiter,
  • UNKNOWN: put in unknown character indicator "?" or FFFD
  • UNICODE: Unicode-style U+HHHH
  • JAVA: Java-style \uHHHH
  • JAVA_DD: Java-style \\uHHHH
  • XML: XML-style &#xHHHH;
  • XML_DD: XML-style &#xHHHH;
  • SPREAD1: Old SPREAD &U-HHHH;
  • SPREAD1_DD: Old SPREAD &U-HHHH;
  • SPREAD2: New SPREAD &UHHHH;
  • SPREAD2_DD: New SPREAD &UHHHH;
  • CSS1: CSS1 \HHHH
  • CSS1_DD: CSS1 \\HHHH
  • CSS2: CSS2 \\00HHHH (space following is delimiter)
  • CSS2_DD: CSS2 \\00HHHH (space following is delimiter)
  • SGML: SGML-, HTML (< 4) and Netscape style decimal &#DDDDDD;
  • SGML_DD: SGML-style &amp;#DDDDDD;
A manual page is available in UNIX man page format and text.

dencr.c dencr.c is a simple C program to convert decimal numeric character references (XML, HTML, SGML) into UTF-8 characters. We needed this utility because the XP program only generated ASCII characters and NCRs for non-ASCII characters, when outputting HTML. Before we can transcode back to Big5, we have to de-reference the NCRs.
Source code dencr.c

Academia Sinica

[in English][in Chinese]

[Legal Notices]

[Chinese XML Now!]