lineDataWrap: An Element Set for Line-Delimited Records

Rick Jelliffe
Academia Sinica
Taipei, Taiwan
1999-02-01

Abstract

Much of the world's data is serialized into line-delimited records. lineDataWrap is an element set to allow convenient integration of line-delimited records into XML (or SGML) documents. The record and field format can be described using attributes, and Dublin Core metadata added.

This note also gives a transformation, by which the line-delimited records can be converted by a generic tool to and from XML elements. This transformation also provides enough information for the line-oriented records to be accessed as DOM element nodes.

This approach may be useful for simple database data in which XML markup would swamp the data size (terseness is of minimal importance was one of the XML design goals). This may occur when transmission efficiency is important, or when the dataset size is very large, or when the dataset is text and still needs to be searchable using line-based tools such as UNIX grep. The approach will also be useful for incorperating existing data into XML systems (rather like the Red Hat Package Manager, RPM).

Simple Examples

You can use these element types like this:

<ascc:delimitedLineData id="pets" fieldNames="species name" commentDelimiter="#">
# Fluffy is pregnant: make more table space!
cat Fluffy
dog Poochy
owl DesmondHutu
</ascc:delimitedLineData>
<ascc:fixedFieldLineData id="source" schema="http://www.xxx.com/assembler" 
fieldNames="number label code" tab="1 9 17" commentDelimiter=";" 
DC:TYPE="software" DC:TITLE="Assembler Fragment" >
0000001 START   ACC  1 @5       ; blah blah
0000002         STO  4 @R1
0000003         BNE  4 START
</ascc:fixedFieldLineData>

A leading and trailing newline are not significant. When cutting and pasting data into these wrappers, you should replace "&" with "&amp;", "<" with "&lt;" and ">" with "&gt;", to prevent false delimiter recognition.

Element Set

This section defines two related element types: delimitedLineData and fixedFieldLineData. The first is used for data organised into records by lines, where each field is delimited by a delimiter string. The second is also used for data organised into records by lines, where the fields are determined positionally, with each field occupying a fixed range of (fixed-width, spacing) characters. The delimitedLineData and fixedFieldLineData elements is declared as ANY, but can only contain be ( #PCDATA ) in the normal case, or ( l* ) if the line data has been transformed, see below. (The element types x, l and comment are defined below.)

<!-- lineDataWrap: Element Types for Wrapping Line-Oriented Records   -->
<!-- Defined in http://www.ascc.net/xml/en/utf-8/lineDataWrap.html -->
<!-- Copyright (C) 1999 Academia Sinica Computing Center              -->
<!-- Permission granted to use for any purpose under GPL or MPL       -->
<!-- Created: Rick Jelliffe, ricko@gate.sinica.edu.tw, 1999-02-01     -->
<!-- FPI: +//IDN www.ascc.net//ELEMENTS lineDataWrap//EN"       -->
<!ELEMENT ascc:delimitedLineData      ANY >
<!ELEMENT ascc:fixedFieldLineData     ANY >
<!ELEMENT x                       ANY >
<!ELEMENT l                       ANY >
<!ELEMENT comment                 ( #PCDATA )>

The wrapper elements use the namespace "http://www.ascc.net/xml/resource/ldw". They must have an unique identifier id and may have other aliases identifiers. The version number is the version of the element type; please do not alter it. You may use the XML namespace mechanism to add any further attributes you care to. xml:lang is the human language of the attributes and the element; if the element's contents use a different language, use the DC:LANGUAGE attribute to mark this up.

<!-- Namespace, ID and Housekeeping Attributes -->
<!ATTLIST ascc:delimitedLineData 
        xmlns:ascc CDATA    #FIXED 
                "http://www.ascc.net/xml/resource/lineDataWrapper" 
        id         ID       #REQUIRED
        aliases    IDREFS   #IMPLIED 
        version    CDATA    #FIXED "1.0" 
        xml:lang   CDATA    #IMPLIED>
<!ATTLIST ldw:fixedFieldLineData
        xmlns:ascc CDATA    #FIXED 
                "http://www.ascc.net/xml/resource/lineDataWrapper" 
        id         ID       #REQUIRED
        aliases    IDREFS   #IMPLIED 
        version    CDATA    #FIXED "1.0" 
        xml:lang   CDATA    #IMPLIED >
<!ATTLIST l
        n          NMTOKEN   #IMPLIED >

Next, we declare notation-related attributes, which a transformation processor can use to convert the fields into elements. The notation declarations are fomalities; please do not alter them. The schema attribute can be used for a URI to any schema declarations; the notation of the schemas is not defined here. The sep attribute lets you specify the seperator delimiter, for delimited data; by default it is any blank sequence: tabs or space (\s | \t)*. The tab attribute lets you specify the start columns (starting from 1, counted absolutely) for each tabstop; in a fixedFieldLineData element, each tabstop is a new field (whether or not a tab character was used); the default is 8 characters. The commentDelimiter gives the delimiter used for comments; the default is "", or no comment; a comment starts after the delimiter until the end of the line. The fieldNames attribute is a list of white-space seperated names, one for each field; if no name is given for a field, the element name "x" may be used. The fieldNotations element allows you to specify a list of notation attributes, for example to specify that a field is a date format; the notation CDATA is defined

<!-- Notation (lexical) and schema (semantic & syntactic) attributes -->
<!NOTATION CDATA
        PUBLIC
        "+//IDN www.assc.net//NOTATION Delimited Line Data//EN"
        "http://www.ascc.net/xml/resource/lineDataWrapper#delimitedLineData" >
<!NOTATION delimitedLineData
        PUBLIC
        "+//IDN www.assc.net//NOTATION Delimited Line Data//EN"
        "http://www.ascc.net/xml/resource/lineDataWrapper#delimitedLineData" >
<!NOTATION fixedFieldLineData
        PUBLIC
        "+//IDN www.assc.net//NOTATION Fixed Field Line Data//EN"
        "http://www.ascc.net/xml/resource/lineDataWrapper#fixedFieldLineData" >
<!ATTLIST ascc:delimitedLineData     
        notation         NOTATION (delimitedLineData) #FIXED "delimitedLineData"
        schema           CDATA    #IMPLIED
          xml:space       CDATA    "preserve"
        tab              NMTOKENS    "9 17 25 33 41 49 57 65"
        sepDelimiter     CDATA    #IMPLIED
        fieldNames       NMTOKENS #IMPLIED
        fieldNotations   NMTOKENS #IMPLIED
        commentDelimiter CDATA    "" >
<!ATTLIST ascc:fixedFieldLineData  
        notation         NOTATION (fixedFieldLineData) #FIXED "fixedFieldLineData"
          xml:space       CDATA    "preserve"
        schema           CDATA    #IMPLIED
        tab              NMTOKENS #REQUIRED
        fieldNames       NMTOKENS #IMPLIED
        fieldNotations   NMTOKENS #IMPLIED
        commentDelimiter CDATA    "" >

Finally, Dublin Core attributes are available on the elements. Refer to the Dublin Core documentation for details. Note that after transformation into XML elements, the attribute DC:FORMAT should be given the value "text/xml", or some more specific MIME type. Note also that declaration here of the DC:RELATION attribute does not preclude the use of this element in RDF.

<!-- Dublin Core attributes -->
<!ATTLIST ascc:delimitedLineData
        xmlns:DC         CDATA #FIXED "http://purl.oclc.org/dc/"
        DC:TITLE         CDATA #IMPLIED
        DC:CREATOR       CDATA #IMPLIED
        DC:CONTRIBUTOR   CDATA #IMPLIED
        DC:SUBJECT       CDATA #IMPLIED
        DC:DATE          CDATA #IMPLIED
        DC:DESCRIPTION   CDATA #IMPLIED
        DC:PUBLISHER     CDATA #IMPLIED
        DC:RIGHTS        CDATA #IMPLIED
        DC:TYPE          CDATA "dataset"
        DC:FORMAT        CDATA "text/plain"
        DC:LANGUAGE      CDATA #IMPLIED
        DC:SOURCE        CDATA #IMPLIED
        DC:IDENTIFIER    CDATA #IMPLIED 
        DC:RELATION      CDATA #IMPLIED
        DC:COVERAGE      CDATA #IMPLIED
>
<!ATTLIST ascc:fixedFieldLineData
        xmlns:DC         CDATA #FIXED "http://purl.oclc.org/dc/"
        DC:TITLE         CDATA #IMPLIED
        DC:CREATOR       CDATA #IMPLIED
        DC:CONTRIBUTOR   CDATA #IMPLIED
        DC:SUBJECT       CDATA #IMPLIED
        DC:DATE          CDATA #IMPLIED
        DC:DESCRIPTION   CDATA #IMPLIED
        DC:PUBLISHER     CDATA #IMPLIED
        DC:RIGHTS        CDATA #IMPLIED
        DC:TYPE          CDATA "dataset"
        DC:FORMAT        CDATA "text/plain"
        DC:LANGUAGE      CDATA #IMPLIED
        DC:SOURCE        CDATA #IMPLIED
        DC:IDENTIFIER    CDATA #IMPLIED 
        DC:RELATION      CDATA #IMPLIED
        DC:COVERAGE      CDATA #IMPLIED
>

Transformations into Elements

Line data marked up with the lineDataWrapper elements may be transformed into XML elements. This can occur either by preprocessing the XML data, or by postprocessing the element nodes after the document has been read into DOM. Since DOM is a logical interface, no data conversion actually needs to take place: a system could also interpret DOM requests against the text stored in lines, as a means of implementation.

The element type l follows the TEI element of the same name. It gives a single line; whitespace between child elements is not significant. To preserve minimum line length restrictions, child elements should be seperated by a newline followed by a tab character, in this form. The element typecomment may be as the only or last child element of any line. The element type x may be used for elements when no field name has been specified in the fields attribute.

The example given before will be transformed as follows (introduced markup is given in red):

<ascc:delimitedLineData id="pets" fieldNames="species name" commentDelimiter="#"
        DC:FORMAT="text/xml">
<l n="1"><comment># Fluffy is pregnant: make more table space!</comment></l>
<l n="2"><species>cat</species>
        <name>Fluffy</name></l>
<l n="2"><species>dog</species>
        <name>Poochy</name></l>
<l n="3"><species>owl</species>
        <name>DesmondHutu</name></l>
</ascc:delimitedLineData>
<ascc:fixedFieldLineData id="source" schema="http://www.xxx.com/assembler" 
fieldNames="number label code" tab="1 9 17" commentDelimiter=";" 
DC:TYPE="software" DC:TITLE="Assembler Fragment" DC:FORMAT="text/xml">
<l n="1"><number>0000001<number>
        <label>START</label>
        <code>ACC  1 @5</code>
        <comment>blah blah</coment></l>
<l n="2"><number>0000002<number>
        <label></label>
        <code>STO 4 @R1</code></l>
<l n="3"><number>0000003<number>
        <label></label>
        <code>BNE  4 START</code></l>
</ascc:fixedFieldLineData>

Note that XML software may often strip tabs, or convert them to spaces. So it may be desirable in some cases to preprocess data to replace tabs with space characters (for fixed field data) or some other delimiter character (for delimited field data). Within a field, leading and trailing whitespace are stripped; internal spaces should be preserved.

Note, the functionality of delimiterLineData is directly available to users of SGML systems which support "short references" (except for the automatic line numbering), with the appropriate markup declarations.

Availability

The most recent version of the lineDataWrap DTD is kept in http://www.ascc.net/xml/resource/lineDataWrap.dtd


Copyright (C) 1999 Rick Jelliffe. You may publish and use this paper, and the element sets within it, in any medium for any purpose, but please keep my name on it.