GLUE Loses User Encodings!

GLUE Software (beta)

Version: 1999-07-30 seems to be good. We are not providing a zipped or tarred distribution this week.


GLUE The GLUE project has now available:
  • Abstract specifications for transcoders in XML.
  • Command line utilities implementing individual transcoders, and an implementation of the UNIX iconv command line utility (a.k.a. tcs). The C source is generated from the abstract specifications using XSL. Only transcoders to UTF-8 are currently available. The current source is geared to UNIX but should be trivially portable.

Projects currently underway include:

  • Transcoders from UTF-8.
  • An implementation of the standard C iconv() routine.
  • Augmentations of the transcoders to cope with specific issues found in XML and other Web documents.
Specification
in XML
Individual Transcoder Utilities
to UTF-8
ASCII
  • ISO 646de
  • ISO 646en
  • ISO 646es
  • ISO 646fr
  • ISO 646it
  • ISO 646sv
source & header (C program)
asciiIn (SPARC Binary)
ISO 8859-1 (Latin 1)
  • CP1252 variant (Windows "ANSI")
source & header (C program)
iso8859_1In (SPARC Binary)
ISO 8859-2 (Latin 2)
  • CP 1250 variant (Windows)
source & header (C program)
iso8859_2In (SPARC Binary)
ISO 8859-3 (Latin 3) source & header (C program)
iso8859_3In (SPARC Binary)
iso8859_3In.exe (Win32 Executable)
ISO 8859-4 (Latin 4) source & header (C program)
iso8859_4In (SPARC Binary)
iso8859_4In.exe (Win32 Executable)
ISO 8859-5 (Cyrillic) source & header (C program)
iso8859_5In (SPARC Binary)
iso8859_5In.exe (Win32 Executable)
ISO 8859-6 (Arabic) source & header (C program)
iso8859_6In (SPARC Binary)
iso8859_3In.exe (Win32 Executable)
ISO 8859-7 (Greek) source & header (C program)
iso8859_7In (SPARC Binary)
iso8859_7In.exe (Win32 Executable)
ISO 8859-8 (Hebrew) source & header (C program)
iso8859_8In (SPARC Binary)
iso8859_8In.exe (Win32 Executable)
ISO 8859-9 (Latin 5) source & header (C program)
iso8859_9In (SPARC Binary)
iso8859_9In.exe (Win32 Executable)
ISO 8859-10 (Latin 6) source & header (C program)
iso8859_10In (SPARC Binary)
iso8859_10In.exe (Win32 Executable)
ISO 8859-11 (Thai) source & header (C program)
iso8859_11In (SPARC Binary)
iso8859_11In.exe (Win32 Executable)
ISO 8859-13 (Latin 7) source & header (C program)
iso8859_13In (SPARC Binary)
iso8859_12In.exe (Win32 Executable)
ISO 8859-14 (Latin 8) source & header (C program)
iso8859_14In (SPARC Binary)
iso8859_14In.exe (Win32 Executable)
ISO 8859-15 (Latin 9) source & header (C program)
iso8859_15In (SPARC Binary)
iso8859_15In.exe (Win32 Executable)
MacRoman
  • MacRoman with Euro
source & header (C program)
macromanIn (SPARC Binary)
macromanIn.exe (Win32 Executable)
UTF-8 source & header (C program)
utf_8In (SPARC Binary)
UTF-16 (little endian)  
UTF-16 (big endian)  
Big5 (Chinese, including user-defined area) source & header (C program)
big5In (SPARC Binary)
VISCII (Vietnamese) source & header (C program)
visciiIn (SPARC Binary)
Support make.sh (shell script)
mkintranscoder.xsl (XSL)
mkcheader.xsl (XSL)
mkvariants.xsl (XSL)

transcode.h

variants.sh (generated shell script)

iconv (Bourne shell script)

regress.sh (Bourne Shell script for testing)


Note: Current Win32 executables require cygwin1.dll from http://sourceware.cygnus.com/cygwin/

Note: The C Source code generated has some GNU specific extensions that some ANSI compilers will complain about: const in prototypes, and element initializers in arrays. The first is mainly needed to shut GCC error messages up, the second is needed because XSL does not support array data structures: we definitely want to initialize tables automatically.
To convert it to conservative ANSI C, we have a perl script ansify.pl that should be run against all .c and .h files. If you need ANSI code, run this perl script.


Individual
Transcoder
Utilities
The transcoder utilities convert from many different character sets to UTF-8. Reverse transcoders are planned for the near future. The transcoders are generated from XML using XSL scripts. Note: the code is beta and is expected to change in the near future.

The utilities use their names to invoke particular variant transcoders: for example, if you link (or copy) the iso8859_1In transcoder to the name cp1252In, it will transcode using the CP1252 character encoding. (To see the names of the allowed transcoders, see variants.sh file.)

Usage: xxxxIn < infile > outfile


iconv The iconv command line utility is a front end for the individual transcoders. It is a Bourne shell script. (Edit the shell variable in it to the correct path for the individual transcoders.)

Usage: iconv -f fromcharset -t tocharset file?


Software
Quality
Assurance
  • Transcoder specifications to be valid XML against charset.dtd. Status: WF
  • XSL specifications to be well-formed XML, and to generate no errors from XT. XSL as per 19990725 draft to be used. Status: OK
  • Generated C code to generate no errors or warnings with gcc -ansi -Wall -Wstrict-prototypes -Wcast-qual -Wcast-align -Wmissing-prototypes -Wredundant-decls -Wconversion -Wmissing-declarations Status: OK
  • Generated C code to be tested with lint or gcc -pedantic. Errors and causes noted to clarify portability. Create perl script ansify.pl to convert to conservative ANSI that is acceptable by lint or fcc -pedantic Status: OK
  • Generated C code to pass through indent program with no errors. Status: OK
  • Round-trip regression tests to be implemented. Comparison using diff program to show no differences. Status: pending outgoing transcoders.
  • Comparisons with other transcoders to be implemented (tcs, iconv, trans120). Comparison using diff program to show no differences. Status: OK for BIG5 and data in ASCII range.
  • Note: For Big5 and any other program with maps, it is a common technique to provide a count of the characters mapped. Because we are using explicit element initializers for arrays (in , this is probably not a useful test, but it may be good to do for completeness. Status: No action required
  • Transcoder specifications to include maximum and minimum values specifications, able to be used as post-condition tests after transcoding. C code to provide implementations for these, in conditional sections. Status: OK

Academia Sinica

[in English][in Chinese]

[Legal Notices]

[Chinese XML Now!]