GLUE Loses User Encodings!

About GLUE


GLUE GLUE is a library of data and utilities to simplify the task of creating transcoder software. It grows out of the Chinese XML Now! project at Academia Sinica Computing Centre, and earlier work on lossless transcoders. GLUE also will allow the trial of new functions required for Chinese documents.

Why
GLUE?
Q. There are many exellent transcoding systems available, why do we need GLUE?

A. Because all the current transcoders are optimised or targeted at particular uses or systems:

  • Ulrich Depper's GNU libc iconv() is integrated into the new GNU libc and is not widely available in different platforms; it is not modular;
  • Bell Lab's Plan 9s' tcs has copyright and distribution restrictions;
  • The iconv() on UNIX systems has different standard configurations in different locales and brands;
  • The transcoding libraries available on other operating systems are similarly platform-dependent;
  • The transcoding libraries in Java (which are being redeveloped) miss some key functionality, in particular that they do no generate errors when an encoding error is detected;
  • The transcoding libraries of major open source applications (James Clark's SP, Netscape's Navigator) are well-decomposed but not readily severable into parts that can be used in isolation.

Furthermore:

  • The transcoders are not designed to support some particular East Asian requirements, in particular, support for user-defined characters and extensibility;
  • The transcoders are not designed to integrate into XML and other systems which use numeric character references; this effects both program design and can causes missing characters during transcoding;
  • Transcoding is a complicated process that can involve both mapping tables and algorithmic mapping; furthermore, many character encodings use variable numbers of bytes per encoded character: this means that mapping tables (such as the excellent tables provided by the Unicode Consortium) are incomplete specifications of multibyte encodings: it is not possible to use these tables to automatically generate complete transcoders...and transcoding is a function that screams for automation!
  • It is clear that programmers who do not come from a text processing background are very sloppy with getting delimiters correct: the initial round of XML tools shows that even the largest corporations get it wrong. But perhaps rather than saying "this is the fault of the programmers" we can say "this means that the APIs being used are not at a helpful level of abstraction": one hope for the GLUE project is to develop more programmer-friendly transcoding APIs, in which delimiter handling can be made more transparent to the programmer by incorporating treating it as a part of transcoding.
  • Current transcoders do not rewrite encoding declarations in headers, which would be useful for XML entities.
  • Character encodings have variants: typically there is a (inter)national standard set (e.g. ISO 8859-1) then some corporate extension (e.g. Microsoft's "ANSI" extensions) and then perhaps some private-use extensions. For example, Big5 is a de facto national standard, Hong Kong government has its GCCS extensions, and then private individuals may have extensions on that. But no transcoders reflect this arrangement.

The result is that the current generation of transcoders (in particular, the open source transcoders) are not suitable for someone trying to add a modest level of transcoding in a platform-independent way. A sign of this difficulty is that transcoder support in specialist languages is almost non-existant.


How
GLUE?
The GLUE approach is to put all information relating to an encoding into a single XML document: names, mapping tables, multi-byte encoding detection, range-checking, transformations, and anything else that is needed. These XML documents can then be transformed into many kinds of programs and utilities, using XSL, OmniMark, Perl, Python, Java, JavaScript, Common LISP, or any language in which DOM has been implemented.

In a sense, GLUE represents a very conventional separation of specification from implementation. However, it must be admitted that writing programming language code inside programming languages often produces messy- or ugly-looking code :-)


What
GLUE?
We want to provide:
  • implementation-neutral, declarative specifications of encodings (with respect to Unicode 3);
  • example transformation utilities that you can steal and customize;
  • readily portable and unencumbered versions of standard or well-known C and UNIX transcoders; the goal is portability rather than optimality;
  • implementations of the kinds of extra functionality we think it would be useful for all transcoding systems to have.

We want to encourage software systems which

  • use ISO 10646 internally and which
  • import and export text safely in any regional encoding.

GLUE
Transcoding
Specification
Library
The specifications allow the generation of transcoders both "to" and "from" ISO 10646 unless noted. Basic Transcoder specifications are available for the following encodings and variants:

Here is the current DTD used.

When mapping tables were unavailable on ftp.unicode.org we used:

  • Microsoft tables for Big5 user-defined characters
  • Roman Czyborra'a pages for latest 8859
  • man -s5 iconv on Sun for iso646 variants
 

Academia Sinica

[in English][in Chinese]

[Legal Notices]

[Chinese XML Now!]