The XML Logo (from the 
      XML FAQ)

XXX: eXperimental Xml leXer

XXX is experimental software for implementing XML parsers. It is currently under development: the source code is provided for browsing interest and at any given time may not work. It is available here.

There is a small paper XXX Notation Processors that may give better idea of the underlying rationale for this software.

The basic idea of XXX is that XML can be parsed using a recursive descent parser made from a highly parameterized general-purpose lexical analyser. The lexical analyser corresponds to a state in a state machine; the parameters correspond to transitions between states. Thus it is possible to go from a state-machine analysis of XML to the implementation of a lexer.

Because the lexical analyser is general purpose, the same analyser may be used for scanning other languages: the XML internal and external prologs (DTDs) or CSS, for example. The lexical analyser is merely invoked with different initial functions.

Analysis of XML

This leads to viewing XML as having four independent syntactical layers:

XXX addresses the first only. The latter syntaxes can be addressed by regular-expression validators running on the output of the token-level syntax.

This view of parsing markup languages on a token basis seems to have been underlying the SGML model, which however did not have a concept of well-formedness and does not attempt to divide the layers. In SGML, the layers are not independent.

According to our analysis, XML can be tokenized by a lexer which uses the following model:

  tag :==  startDelimiter, whiteSpace?,
     ( subtag1 | subtag2 | subtag3 | subtag4 |
      subtag5 | subtag6 | subtag7 | subtag8 |
                (token whiteSpace?) | CDATA )*
      ( altEndDelimiter | endDelimter )

General Purpose Lexical Analyser

Rather than implementing the XML tokenizer directly, we have made a general-purpose lexical analyser (scanner) driven by tables.

struct lexrules {
 unsigned char *name;
  TXCHAR *startDelimiter;
 /* 0 means no delimiter (succeed)
 -1 is for future use to implement RCDATA and modal parsing */
  TXCHAR *endDelimiter;
    /* 0 means no delimiter (skip),
    -1 means the start delimiter is a sole delimiters  */
  TXCHAR *altEndDelimiter;
    /* 0 means no delimiter (skip),
    -1 means ommissible end-delimiter */
  struct lexrules *sublexer1; /* longest match first */
  struct lexrules *sublexer2;
  struct lexrules *sublexer3;
  struct lexrules *sublexer4;
  struct lexrules *sublexer5;
  struct lexrules *sublexer6;
  struct lexrules *sublexer7;
  struct lexrules *sublexer8;
  enum { noStrip = 0, 
         strip} striptStartWhiteSpace;
  enum { CDATA = 0, 
         oneTokenNoSpace, 
         oneTokenThenOptSpace, 
         multipleTokens } tokenize;
  enum { noEOF =0, 
         allowEOF } allowEOF;
  int (*tokenchartest)
        (struct entity *, struct lexrules *);  
        /* test function for tokenizing */
  int (*whitespacechartest)
        (struct entity *, struct lexrules *); 
        /* test function for whitespace handling */
  int (*taghandler)
        (struct entity *, struct lexrules *, int); 
        /* handler for that tag */
  int (*datahandler)
        (struct entity *, struct lexrules *); 
        /* handler for raw data */
  int (*whitespacehandler)
        (struct entity *, struct lexrules *); 
        /* handler for whitespace*/
  int (*tokenhandler)
        (struct entity *, struct lexrules *); 
        /* handler for tokens*/       
};

For example, the transition table for describing the characters allowed inside a literal (in an attribute) is:


static struct lexrules litLexical = { "lit",
  "\"\0\0\0", "\"\0\0\0", 0,
  &hncrLexical, &ncrLexical, &erefLexical,
  NULL, NULL, NULL,
  NULL, NULL,
  strip, multipleTokens, noEOF,
  &nonnmtokenchar, &nonwhitespacechar, &xmltag,
  &xmldata, &xmlwhitespace, &xmltoken
};

Implementation

The current implementation of XXX is a simple C program.

The current implementation of the lexer is mildly unusual in that transitions are tested by calling each candidate state in sequence, rather than having the current state test the transitions.

This approach bring out the nature of the various transitions better: which transitions are stack-pushes or pops, which transitions are substates (e.g. name tokenizing and whitespace tokenizing).

The approach is inefficient for states with many transitions (XML requires 11 possible transitions in PCDATA); however there are trivial speedups especially for non-Latin processing which should bring the lexer to be as fast as possible (i.e., one test and a branch) for PCDATA.

What is the Purpose of XXX?

Status

The lexer works on XML instances in UTF-8, 8859-n and ASCII. Encoding detection is partially in place but is not priority. The DTD routines have beed stubbed out. Next step is to put in some entity-handling.

Future

We want to try XXX on DTDs, CSS, URLs, ISO 8601 dates, SGML and JavaScript.

Availablility

XXX is available here as a single C file. We have only compiled it under GCC on a SPARC, but there are no OS-dependencies, and external libraries are only used for file management and error reporting.

Please note, again, that this is just experimental software at the moment, so it is not appropriate for use in real systems. If you use it or upgrade it, please let us know.

[in English]
English

(UTF-8)
[in Chinese]
Chinese

(UTF-8)
[in Chinese]
Traditional

Chinese
(Big5)
[in Chinese]
Simplified

Chinese
(GB 2312)

Academia Sinica

[in English][in Chinese]

[Legal Notices]