Chinese Metadata F.A.Q.

The XML Logo (from the XML FAQ)
Maintained by: Rick Jelliffe

Table of Contents

Chinese Metadata F.A.Q.

A. General

A.1. What is Metadata?

B. WWW Metadata "Standards"

B.1. What is the Dublin Core (DC)?
B.2. What is Qualified Dublin Core (QDC)?
B.3. Can I use Dublin Core with XML or HTML?
B.4. Can I use Dublin Core for Chinese Metadata?
B.5. What is the Resource Description Framework (RDF)?
B.6. Can I use RDF with XML or HTML?
B.7. What is the safest way to use Dublin Core and RDF?
B.8. What is the Warwick Framework?
B.9. What is the Electronic Archive Description (EAD)?
B.10. What is PICS?
B.11. What is the relation between EAD, PICS, Dublin Core, RDF and the Warwick Framework?
B.12. What is MARC?
B.13.What is ANSI Z39.50?
B.14 What is TEI?
B.15. What is CIMI?
B.16. What is XPointer?
B.17. What is FGDC?
B.18. What is CML?
B.19. What is BSML?

C. Implementation

C.1 I'm confused! Do we really need all these kinds of metadata?
C.2 Can I use a CMARC Code if I really want to?
C.3 Can I use CCCII with XML
C.4 Do we need to choose a single metadata standard?

D. XML Questions

D.1. What is XML?
D.2 Is the standard XML attribute xml:lang good enough for metadata?
D.3 How can I represent "pinyin" or "traditional" or "simplified"

Cataloging Information (Dublin Core)

The XML Logo (from the XML FAQ)

Chinese Metadata F.A.Q.

This FAQ is about using XML-based metadata (e.g., Dublin Core, RDF and EAD), including for Chinese-language metadata.

The FAQ is created to help answer question arising in the Digital Library/Museum project, at Academia Sinica, Taiwan.

A. General

A.1. What is Metadata?

Rick Jelliffe

Metadata is "data about data"; it is data for the purposes of cataloging, searching, archiving, electronic discovery, displaying, and so on. The key indication of the direction of the WWW on metadata comes from the inventor of the WWW, Tim Berners-Lee (in Metadata Architecture, at http://www.w3.org/DesignIssues/Metadata.html) "Metadata is machine understandable information about web resources or other things" but "metadata is data."

B. WWW Metadata "Standards"

For a good review of Metadata standards, see "Review of Metadata Formats" by Rachel Heery, 1996 (at http://www.oasis-open.org/cover/heery-review.html )

B.1. What is the Dublin Core (DC)?

Rick Jelliffe

The Dublin Core home page (http://purl.oclc.org/dc/index.htm) says "The Dublin Core is a metadata element set intended to facilitate discovery of electronic resources. Originally conceived for author-generated description of Web resources, it has attracted the attention of formal resource description communities such as museums, libraries, government agencies, and commercial organizations."

At the moment, there are many different formats used to catalog information. Dublin Core provides a very simple subset (or classification) of them. In the future, we anticipate all catalogs will support access using Dublin Core metadata. Dublin Core defines 15 core element names.

For an example of Dublin Core metadata, see the end of this file.

The Dublin Core is not finalized. It is still being developed. All Dublin Core implementations are therefore, to a certain extent, experimental. Please also beware that many of the "draft" Dublin Core material is itself based on other drafts--so quite a lot of the information is unreliable.

B.2. What is Qualified Dublin Core (QDC)?

Rick Jelliffe

The simple Dublin Core elements are too simple for many users. They have started to develop subclasses of elements, based on the Dublin Core. So now there is "simple" Dublin Core, and "qualified" Dublin Core.

Note: The Qualified Dublin Core elements are not finalized. They are still being developed. All Dublin Core implementations are therefore, to a certain extent, experimental.

B.3. Can I use Dublin Core with XML or HTML?

Rick Jelliffe

Yes. Dublin Core is merely a set of field names and descriptions. It is not a file format (or notation.) However, you can use implement Dublin Core using the following notations:

B.4. Can I use Dublin Core for Chinese Metadata?

Rick Jelliffe

Yes. Dublin Core is merely a set of field names and descriptions. It is not a file format. XML, SGML and HTML 4 all allow Han ideographs. There is more information on Chinese XML at the Chinese XML FAQ at http://www.ascc.net/xml/en/utf-8/faq.html. This FAQ will have specific answers to questions about Chinese metadata and Dublin Core too.

B.5. What is the Resource Description Framework (RDF)?

Rick Jelliffe

The RDF home page (http://www.w3.org/RDF/Overview.html) says "RDF is designed to provide an infrastructure to support metadata across many web-based activities." It is very verbose, when used to mark up elements. RDF is also quite complex to understand.

The lesson of SGML is that there are great benefits in marking up documents generically, according to document type, rather than than with specific details for each element. This is the only way to handle large data sets, in many cases. Applying this lesson to RDF, it is probable that RDF will be most beneficial when applied to element types rather than to element instances: the element type becomes a kind of macro for the RDF.

RDF is frequently mentioned in material about Dublin Core. But you can use Dublin Core without RDF.

RDF provides extra value to Dublin Core metadata. Generic RDF tools (for example, for visualization and discovery) will be able to use DC+RDF data even though they do not understand Dublin Core elements.

B.6. Can I use RDF with XML or HTML?

Rick Jelliffe

Yes. RDF is a conceptual model to let you make assertions about elements of data. It is not a file format. However, you can implement RDF systems using XML (or SGML). You can implement some kinds of RDF using HTML too.

RDF is best for database-style "fielded records" rather than for adding to existing free text data, or data ordered by the requirements of publishing systems.

There is a DTD for RDF available at http://www.ascc.net/xml/en/utf-8/resource-index.html.

B.7. What is the safest way to use Dublin Core and RDF?

Rick Jelliffe

The Dublin Core working group on Data Models has published some guidelines which seem safe. See http://www.mailbase.ac.uk/lists/dc-datamodel/1998-09/0029.html.

These basic guidelines can be summarized:

B.8. What is the Warwick Framework?

Rick Jelliffe

The Warwick Framework is a method to bundle packages (of metadata) together (into a container). The packages can be of different formats. The Warwick Frameworks lets you specify the relationship between packages in the container; this helps transmission and queries. The containers can be in

The paper defining the Warwick Framework is "The Warwick Framework: A Container Architecture for Aggregating Sets of Metadata" by Carl Lagoze, Clifford A. Lynch, Ron Daniel Jr., (refer http://lists.w3.org/Archives/Public/www-disw/msg00017.html)

B.9. What is the Electronic Archive Description (EAD)?

Rick Jelliffe

Electronic Archive Description (EAD) is an XML/SGML DTD for specifying finding aids for archived objects. It started with Library at the University of California, Berkeley and now is being developed by Society of American Archivists and (Network Development and MARC Standards Office of) the Library of Congress. See http://www.loc.gov/ead/ead.html. It has been in use and testing for several years, and has been officially released in 1998. EAD is a traditional XML/SGML DTD, and provides very detailed fields for metadata.

The EAD is very different from the Dublin Core and RDF. Perhaps we can say that Dublin Core is an incomplete "top-down" design, while EAD is a completed "bottom-up" design.

EAD won the 1998 "Coker Prize for Description" from the Society of American Archivists. See http://www.archivists.org/awards/coker.html

A good site is the Berkeley Digital Library http://sunsite.berkeley.edu/SGML/index.html. An example of EAD in use is http://sunsite2.berkeley.edu:28008/dynaweb/oac/bampfa/@Generic__CollectionView

B.10. What is PICS?

Rick Jelliffe

PICS is the Platform for Internet Content Selection. The PICS home page (http://www.w3.org/PICS/) says PICS "enables labels (metadata) to be associated with Internet content. It was originally designed to help parents and teachers control what children access on the Internet, but it also facilitates other uses for labels, including code signing and privacy. "

There are several related specifications for PICS :

PICS is thus concerned with delivering metadata for use in establishing sessions and data access, not for archiving or data manipulation.

Note that there are also some attempts to extend PICS as a more general Schema language. For example, to describe Dublin Core: http://metadata.net/dstc/DC-10-EN/schema.txt.

B.11. What is the relation between EAD, PICS, Dublin Core, RDF and the Warwick Framework?

Rick Jelliffe

EAD, Dublin Core/RDF and PICS are all serve different stages of document production. Lets use the terminology of the Movie industry (Ted Nelson has said that Hypertext is a kind of movie-making), but applied to electronic archives Web sites:--

EAD is therefore aimed at pre-production needs--archivists need to store all the relevant information that they have, whether or not it fits into nice Dublin Core categories or not. (The EAD Design Principles put it this way: "The needs of public users, curatorial and reference staff, and finding aid authors were given priority in the standard's design." Furthermore, "Finding aids are not objects of study but rather tools leading to such objects." see http://www.loc.gov/ead/eaddsgn.html.)

Dublin Core is aimed at (post-)production needs--it is a simple interface for allowing interchange and access, but it does not attempt to provide any higher-level structures: it does not treat the differences in objects as significant: a stick is the same as a library.

RDF is aimed at post-production needs--providing a way to tie together information from lots of different schemas, which may have nothing to do with archiving.

PICS is aimed at session needs--deciding whether a type of information is appropriate for a particular user or situation.

The Warwick Framework is a format by which EAD, Dublin Core, RDF or PICs schemas can be bundled together and interchanged. (However, this functionality may be, to a certain extent, duplicated by RDF and perhaps EAD. I would expect RDF to win popular support, if they are competitors.)

B.12. What is MARC?

Rick Jelliffe

MARC is the Big Daddy. But he has multiple personalities. MARC is a metadata record format for libraries.

MARC records use uses the ISO 2709: 1981 format. However, there is recent work to also provide an XML format.

B.13.What is ANSI Z39.50?

Rick Jelliffe

ANSI Z39.50 is a database query protocol, for querying a catalog about library holdings. It is very suitable for MARC data . Holding information is (typically) returned in the OPAC format. More recently, CIMI has defined a "profile" to support retrieval of museum information too. See question B.15 below.

Z39.50 uses ISO ASN.1 (Abstract Syntax Notation 1) rather than SGML (See XML, question C.1.) This makes it much more efficient for many small transactions.

For a good bibliography, see Lynch, Clifford A. (1994). RFC 1729, Using the Z39.50 Information Retrieval Protocol in the Internet Environment. http://info.internet.isi.edu:80/in-notes/ rfc/files/rfc1729.txt

B.14 What is TEI?

Rick Jelliffe

The Text Encoding Initiative (TEI ) is "an international project to develop guidelines for the preparation and interchange of electronic texts for scholarly research, and to satisfy a broad range of uses by the language industries more generally." TEI have made a family of DTDs for all sorts of electronic texts. These include the ability to have some metadata, in the TEI headers. A text or object may have multiple TEI headers.

In complexity or richness, the TEI headers provide more than Dublin Core, but certainly less than EAD or MARC. Perhaps Qualified Dublin Core will be similar. TEI headers seem to be intended to allow suppliers of material to give a good headstart to catalogers, rather than being a relentless enumeration of every possibile bibliographic possibility.

The TEI Home Page is http://www.uic.edu/orgs/tei/ The current specifications can be found at at http://etext.lib.virginia.edu/TEI.html. The specification for the header are in Chapter 5: http://etext.virginia.edu/bin/tei-tocs?div=DIV1&id=HD

An XML version of the TEI Lite DTD can be found at Chinese XML Now! site, at the Resources page.

B.15. What is CIMI?

Rick Jelliffe

CIMI is the Consortium for the Computer Interchange of Museum Information. Their home page is http://www.cimi.org/

CIMI'a approach is to foster standards: in particular "SGML for structuring information and Z39.50 for information interchange." In particular, CIMI uses a TEI-based DTD (document type definition) for defining structured documents. It seems that they may be also keen on supporting the Dublin Core, to some extent (how?).

The CIMI material includes some good mechanisms to feed post-production or session stages: for example "wall text" to accompany objects on display. The main one is the "access point" attribute.

For querying, CIMI define a profile of Z39.50. It builds on the Library of Congress Collections Profile (see http://lcweb.loc.gov/z3950/agency/profiles/collections.html )

A history can be found at http://www.cimi.org/about/history.html

B.16. What is XPointer?

Rick Jelliffe

XPointer is a standard mechanism for locating elements in document structures using various criteria (attribute values, absolute and relative hierarchical position, etc). It is based on W3C URI hyperlinks, on ISO HyTime hyperlink navigation models, and on TEI location syntax. It is currently in draft at http://www.w3.org/TR/ An XPointer can be the data for a query or the result of a query.

B.17. What is FGDC?

Rick Jelliffe

Federal Geographic Data Committee. 1994. Content standards for digital geospatial metadata (June 8). Washington, D.C. http://geology.usgs.gov/tools/metadata/standard/metadata.html

B.18. What is CML?

Rick Jelliffe

CML is the Chemical Markup Language. It is an XML DTD for metadata for chemical documents.

B.19. What is BSML?

Rick Jelliffe

BSML is the Bioinformatic Sequence Markup Language. It is an XML DTD for metadata for genetic information. It has a rich set of presentation elements, so it may perhap be regarded as a presentation DTD more than a metadata DTD. However, it features the ability to invoke data from many different formats.

C. Implementation

C.1 I'm confused! Do we really need all these kinds of metadata?

Rick Jelliffe

A lot of the metadata standards are duplicates. There is sometimes no convincing reason to select one standard compared to another. And, even for big standards like MARC, you may still find you need to add your own element types, because of local requirements.

C.2 Can I use a CMARC Code if I really want to?

Rick Jelliffe

Sure. You can add any kind of attributes you like to XML-based metadata. You can make up a DTD using CMARC codes in the element type names, or you can allow elements to have an attribute in which the appropriate CMARC code can be specified.

C.3 Can I use CCCII with XML

At the moment, there is no way to use XML with CCIII without losing information about variants: XML must use ISO 10646 (Unicode) as its document character set.

However, XML is just one possible subset of SGML. It is completely legitimate to create your own subset of SGML, which follows XML in everything *except* that it uses CCCII as the document character set. Such a markup language would be called "CCCII-XML": no one has done this, but several libraries have asked about it.

C.4 Do we need to choose a single metadata standard?

Rick Jelliffe

In many cases, different metadata standards have the same data, but use different structures and names for it. A good aproach is to use the most specific DTD you can, for each different class of data.

You can always map from specific to more general. It is impossible to map from general to more specific.

(You should plan to transform your data: there are many text processing tools available to help you: Perl, Python, XSL (LotusXSL, XT, Koala, etc.), Cost, OmniMark, Balise. And the DOM programming interface makes transformations available at your browser, using Java or Javascript.)

This approach of using a very specific and targeted DTDs more closely follows that recommended in Light and Burnard's study Three SGML Metadata Formats: TEI, EAD and CIMI (http://hosted.ukoln.ac.uk/biblink/wp1/sgml/ ): it recommends keeping data in a specific DTD. It makes the interesting observation that the three DTDs could be combined: "one might use the EAD scheme to describe individual archival holdings down to the item level and then use TEI headers to describe individual documents, where these were deemed of sufficient importance to warrant the effort. Equally, one could embed CIMI topic descriptors within an otherwise purely TEI conformant document." (in Conclusion, 5.3 Use of schemes in combination-- http://hosted.ukoln.ac.uk/biblink/wp1/sgml/conclusion.html ) This is the approach in use at the Bodleian Library at the University of Oxford.

Note. The mappings between MARC and Dublin Core (and GILS) are available at "Crosswalk" site http://www.loc.gov/marc/dccross.html. Use these mappings if possible. (See also Mappings between Metadata Formats compiled by Michael Day http://www.ukoln.ac.uk/metadata/interoperability/ )

D. XML Questions

D.1. What is XML?

Rick Jelliffe

XML ( http://www.w3.org/XML/) is a version of the ISO standard generalized markup language SGML. Most new WWW markup languages are written using XML now. See question B.3.

D.2 Is the standard XML attribute xml:lang good enough for metadata?

XML ( http://www.w3.org/XML/) provides a standard attribute xml:lang, which can be used on any element to set the language. See the Chinese XML FAQ for details (http://www.ascc.net/xml/en/utf-8/faq.html#zh_xml_q15). HTML also provides an identical attribute: html:lang.

The xml:lang attribute specifies the language used in an element's content (and, presumably, its attributes' values). If the element is also a link (e.g., <a href="xxx.xml" xml:lang="en">a link</a>) the attribute specifies the language of "a link", and not (except by implication) the value of the target file "xxx.xml".

This attribute uses the format of the Internet standard: RFC 1766 Tags for the Identification of Languages ( http://info.internet.isi.edu:80/in-notes/rfc/files/rfc1766.txt ), which is best used with the following conventions:

(( "x-" lll ) | ll )( "-" CC ( "-" xx )* )? 

where

But note that "the two-character language codes of ISO 639 are recognized as being inadequate for use as SGML language attributes when tagging text" (Robin Cover, http://www.oasis-open.org/cover/iso639a.html). This means that, for meta-data, the xml:lang attribute is mainly geared to providing information in a format that WWW tools will use. See question C.3 for more.

Software which uses the lang attribute should match based on partial patterns, not exact matches based on the full pattern. In other words, if your software is looking for any Chinese text, it should accept "zh-TW", "zh-HK", etc., as well as simple "zh".

ISO 639 has been extended with 2 (!) slightly different sets of 3-letter codes (see http://www.oasis-open.org/cover/iso639a.html): one based on MARC/NISO/Z38.53 codes (see http://www.oasis-open.org/cover/bib-mn.html#nisoZ3953-1994 and http://lcweb.loc.gov/marc/langann.html) and the other based on the native pronunciation of the language's name (e.g., for "Chinese", the former gives "chi" and the latter gives "zho".) These three-letter codes cannot be used in RFC 1677 attributes like xml:lang. Which three letter code should you use? If you need backwards compatability with MARC or Z39.53, then those codes are best: this is probably the case with many libraries. However, the other codes are not so "English biased" and may be better for future systems. (The general WWW principle of "be conservative in what you send, and generous in what you accept" means that good systems in the future should try to accept both.)

D.3 How can I represent "pinyin" or "traditional" or "simplified"

Rick Jelliffe

There is a big difference between "language" and "script". However, WWW internationalization treats the two together. That is simpler, but is probably not good enough for serious metadata and cataloging. Many languages can be written in multiple scripts: especially languages of nations which have experienced colonization of various kinds (political, economic, cultural, religious, etc).

There is now an ISO standard for names of scripts: ISO 15924, Code for the representation of names of scripts. Information can be found at http://www.oasis-open.org/cover/related.html#iso15924 This standard was not available at the time the XML specification was written. (As of December 1998 it is "Committee Draft", which is the final stage before being accepted as a standard. See http://www.indigo.ie/egt/standards/iso15924/document/cd15924.pdf .)

For Chinese-related scripts:

3-letter code : 2-letter code : code number : English name :

Bod : Bo : 330 : Tibetan

Bpm : Bp : 285 : Bopomofo (Chinese)

Han : Hn : 500 : Han ideographs

Hgl : Hg : 420 : Hangul (Korean)

Hrg : Hr : 410 : Hiragana (Japanese)

Khn : Kh : 931 : Hgl + Han (Korean)

Jap : Ja : 930 : Han + Hrg +Kkn (Korean)

Kkn : Kn : : Katakana (Japanese, Okinawan)

Lat : Lt : : Latin letters (e.g. for Pinyin, Vietnamese, Japanese romaji)

This standard is very helpful. But it does not provide a way to say "simplified" or "traditional".

One useful approach might be

One possible format might be an extended form of RFC 1766 Tags for the Identification of Languages ( http://info.internet.isi.edu:80/in-notes/rfc/files/rfc1766.txt )

Sss ( "-" lll ( "-" CC ( "-" xx )*)?)?

where

So, for example, for simplified Chinese writing official (i.e., the default) Mandarin dialect

<p xml:lang="zh-CN" script="Han-zho-CN-simplified">&#x4E2D;</p>

For Pinyin:

<p xml:lang="zh-CN" script="Lat-zho-CN-pinyin" >

For traditional Chinese script, writing the Tawanese dialect of the Min Nam Chinese language (if that is important),

<p xml:lang="zh-TW-CFR" script="Han-zho-TW-traditional">&#x4E2D;</p>

For traditional Chinese script, writing the (Tawanese Aboriginal Austronesian) Amis language (?does this ever happen?),

<p xml:lang="x-map-TW-ALV" 
script="Han-zho-TW-traditional">&#x4E2D;</p>

For Central Okinawan (see Ethnologue http://www.sil.org/ethnologue/countries/Japa.html), writing in katakana:

xml:lang="ja-JP-RYU" script="Kat"

If you use the script attribute in this form, you can use the following namespace declaration

xmlns:ascc-dcfaq="http://www.ascc.net/xml/en/utf-8/dc-faq.html"

and then use the attribute name ascc-dcfaq:script

Using namespaces, you can keep compatability with different methods of marking up scripts and language. Until some good system comes, you may need multiple attributes.

Cataloging Information (Dublin Core)

<DC:TITLE       xml:lang="en">The Chinese Metadata FAQ </DC:TITLE>
<DC:CREATOR                  >Rick Jelliffe </DC:CREATOR>
<DC:SUBJECT     xml:lang="en">Dublin Core, DC, Resource Description Framework
                              RDF, EAD, Electronic Archive Description,
                              Warwick Framework, XML, SGML, Chinese, FAQ,
                              </DC:SUBJECT>
<DC:DESCRIPTION xml:lang="en">Frequently Asked Questions about using XML-based metadata, 
                              including for Chinese </DC:DESCRIPTION>
<DC:PUBLISHER   xml:lang="en">Computing Centre, Academia Sinica, Taiwan </DC:PUBLISHER>
<DC:TYPE        xml:lang="en">Text.Article </DC:TYPE>
<DC:DATE                     >1998-03-06 </DC:DATE>
<DC:RIGHTS                   >http://www.ascc.net/xml/en/utf-8/legal.html</DC:RIGHTS>