Metadata F.A.Q.

This FAQ is about using XML-based metadata (e.g., Dublin Core, RDF and EAD), including for Chinese-language metadata.

The FAQ is created to help answer question arising in the Digital Library/Museum project, at Academia Sinica, Taiwan.

A. General

A.1. What is Metadata?

Metadata is "data about data"; it is data for the purposes of cataloging, searching, archiving, electronic discovery, displaying, and so on. The key indication of the direction of the WWW on metadata comes from the inventor of the WWW, Tim Berners-Lee (in Metadata Architecture, at http://www.w3.org/DesignIssues/Metadata.html) "Metadata is machine understandable information about web resources or other things" but "metadata is data."

B. WWW Metadata "Standards"

For a good review of Metadata standards, see "Review of Metadata Formats" by Rachel Heery, 1996 (at http://www.oasis-open.org/cover/heery-review.html )

B.1. What is the Dublin Core (DC)?

The Dublin Core home page (http://purl.oclc.org/dc/index.htm) says "The Dublin Core is a metadata element set intended to facilitate discovery of electronic resources. Originally conceived for author-generated description of Web resources, it has attracted the attention of formal resource description communities such as museums, libraries, government agencies, and commercial organizations."

At the moment, there are many different formats used to catalog information. Dublin Core provides a very simple subset (or classification) of them. In the future, we anticipate all catalogs will support access using Dublin Core metadata. Dublin Core defines 15 core element names.

For an example of Dublin Core metadata, see the end of this file.

The Dublin Core is not finalized. It is still being developed. All Dublin Core implementations are therefore, to a certain extent, experimental. Please also beware that many of the "draft" Dublin Core material is itself based on other drafts--so quite a lot of the information is unreliable.

B.2. What is Qualified Dublin Core (QDC)?

The simple Dublin Core elements are too simple for many users. They have started to develop subclasses of elements, based on the Dublin Core. So now there is "simple" Dublin Core, and "qualified" Dublin Core.

The Qualified Dublin Core elements are not finalized. They are still being developed. All Dublin Core implementations are therefore, to a certain extent, experimental.

B.3. Can I use Dublin Core with XML or HTML?

Yes. Dublin Core is merely a set of field names and descriptions. It is not a file format (or notation.) However, you can use implement Dublin Core using the following notations:

B.4. Can I use Dublin Core for Chinese Metadata?

Yes. Dublin Core is merely a set of field names and descriptions. It is not a file format. XML, SGML and HTML 4 all allow Han ideographs. There is more information on Chinese XML at the Chinese XML FAQ at http://www.ascc.net/xml/en/utf-8/faq.html. This FAQ will have specific answers to questions about Chinese metadata and Dublin Core too.

B.5. What is the Resource Description Framework (RDF)?

The RDF home page (http://www.w3.org/RDF/Overview.html) says "RDF is designed to provide an infrastructure to support metadata across many web-based activities." It is very verbose, when used to mark up elements. RDF is also quite complex to understand.

The lesson of SGML is that there are great benefits in marking up documents generically, according to document type, rather than than with specific details for each element. This is the only way to handle large data sets, in many cases. Applying this lesson to RDF, it is probable that RDF will be most beneficial when applied to element types rather than to element instances: the element type becomes a kind of macro for the RDF.

RDF is frequently mentioned in material about Dublin Core. But you can use Dublin Core without RDF.

RDF provides extra value to Dublin Core metadata. Generic RDF tools (for example, for visualization and discovery) will be able to use DC+RDF data even though they do not understand Dublin Core elements.

RDF is not finalized. It is still being developed. The Proposed Recommendation is due soon. All RDF implementations are therefore, to a certain extent, experimental.

B.6. Can I use RDF with XML or HTML?

Yes. RDF is a conceptual model to let you make assertions about elements of data. It is not a file format. However, you can implement RDF systems using XML (or SGML). You can implement some kinds of RDF using HTML too.

The proposed form of RDF, for use in XML (and SGML), makes use of wrapper elements. This has important consequences for XML (and SGML) systems:

It may be that RDF is best for database-style "fielded records" rather than for adding to existing free text data, or data ordered by the requirements of publishing systems.

B.7. What is the safest way to use Dublin Core and RDF?

The Dublin Core working group on Data Models has published some guidelines which seem safe. See http://www.mailbase.ac.uk/lists/dc-datamodel/1998-09/0029.html.

These basic guidelines can be summarized:

B.8. What is the Warwick Framework?

The Warwick Framework is a method to bundle packages (of metadata) together (into a container). The packages can be of different formats. The Warwick Frameworks lets you specify the relationship between packages in the container; this helps transmisstion and queries. The containers can be in

The paper defining the Warwick Framework is "The Warwick Framework: A Container Architecture for Aggregating Sets of Metadata" by Carl Lagoze, Clifford A. Lynch, Ron Daniel Jr., (refer http://lists.w3.org/Archives/Public/www-disw/msg00017.html)

B.9. What is the Electronic Archive Description (EAD)?

Electronic Archive Description (EAD) is an XML/SGML DTD for specifying finding aids for archived objects. It started with Library at the University of California, Berkeley and now is being developed by Society of American Archivists and (Network Development and MARC Standards Office of) the Library of Congress. See http://www.loc.gov/ead/ead.html. It has been in use and testing for several years, and has been officially released in 1998. EAD is a traditional XML/SGML DTD, and provides very detailed fields for metadata.

The EAD is very different from the Dublin Core and RDF. Perhaps we can say that Dublin Core is an incomplete "top-down" design, while EAD is a completed "bottom-up" design.

EAD won the 1998 "Coker Prize for Description" from the Society of American Archivists. See http://www.archivists.org/awards/coker.html

A good site is the Berkeley Digital Library http://sunsite.berkeley.edu/SGML/index.html. An example of EAD in use is http://sunsite2.berkeley.edu:28008/dynaweb/oac/bampfa/@Generic__CollectionView

B.10. What is PICS?

PICS is the Platform for Internet Content Selection. The PICS home page (http://www.w3.org/PICS/) says PICS "enables labels (metadata) to be associated with Internet content. It was originally designed to help parents and teachers control what children access on the Internet, but it also facilitates other uses for labels, including code signing and privacy. "

There are several related specifications for PICS :

PICS is thus concerned with delivering metadata for use in establishing sessions and data access, not for archiving or data manipulation.

Note that there are also some attempts to extend PICS as a more general Schema language. For example, to describe Dublin Core: http://metadata.net/dstc/DC-10-EN/schema.txt.

B.11. What is the relation between EAD, PICS, Dublin Core, RDF and the Warwick Framework?

EAD, Dublin Core/RDF and PICS are all serve different stages of document production. Lets use the terminology of the Movie industry (Ted Nelson has said that Hypertext is a kind of movie-making), but applied to electronic archives Web sites:--

EAD is therefore aimed at pre-production needs--archivists need to store all the relevent information that they have, whether or not it fits into nice Dublin Core categories or not. (The EAD Design Principles put it this way: "The needs of public users, curatorial and reference staff, and finding aid authors were given priority in the standard's design." Furthermore, "Finding aids are not objects of study but rather tools leading to such objects." see http://www.loc.gov/ead/eaddsgn.html.)

Dublin Core is aimed at (post-?)production needs--it is a simple interface for allowing interchange and access, but it does not attempt to provide any higher-level structures: it does not treat the differences in objects as significant: a stick is the same as a library.

RDF is aimed at post-production needs--providing a way to tie together information from lots of different schemas, which may have nothing to do with archiving.

PICS is aimed at session needs--deciding whether a type of information is appropriate for a particular user or situation.

The Warwick Framework is a format by which EAD, Dublin Core, RDF or PICs schemas can be bundled together and interchanged. (However, this functionality may be, to a certain extent, duplicated by RDF and perhaps EAD. I would expect RDF to win popular support, if they are competitors.)

(Note: Some of the material in this question originally presented by Rick Jelliffe at seminars in Taiwan on behalf of III, Ministry of Defense, ApexSun and Adobe Systems. Used by permission.)

B.12. What is MARC?

MARC is the Big Daddy. But he has multiple personalities.

MARC records use uses the ISO 2709: 1981 format. However, there is recent work to also provide an XML format.

B.13.What is ANSI Z39.50?

ANSI X39.50 is a database query protocol, for querying a catalog about library holdings. It is very suitable for MARC data . Holding information is (typically) returned in the OPAC format. More recently, CIMI has defined a "profile" to support retrieval of museum information too. See question B.15 below.

Z39.50 uses ISO ASN.1 (Abstract Syntax Notation 1) rather than SGML (See XML, question C.1.) This makes it much more efficient for many small transactions.

For a good bibliography, see Lynch, Clifford A. (1994). RFC 1729, Using the Z39.50 Information Retrieval Protocol in the Internet Environment. http://info.internet.isi.edu:80/in-notes/ rfc/files/rfc1729.txt

B.14 What is TEI?

The Text Encoding Initiative (TEI ) is "an international project to develop guidelines for the preparation and interchange of electronic texts for scholarly research, and to satisfy a broad range of uses by the language industries more generally." TEI have made a family of DTDs for all sorts of electronic texts. These include the ability to have some metadata, in the TEI headers. A text or object may have multiple TEI headers.

In complexity or richness, the TEI headers provide more than Dublin Core, but certainly less than EAD or MARC. Perhaps Qualified Dublin Core will be similar. TEI headers seem to be intended to allow suppliers of material to give a good headstart to catalogers, rather than being a relentless enumeration of every possibile bibliographic possibility.

The TEI Home Page is http://www.uic.edu/orgs/tei/ The current specifications can be found at at http://etext.lib.virginia.edu/TEI.html. The specification for the header are in Chapter 5: http://etext.virginia.edu/bin/tei-tocs?div=DIV1&id=HD

B.15. What is CIMI?

CIMI is the Consortium for the Computer Interchange of Museum Information. Their home page is http://www.cimi.org/

CIMI'a approach is to foster standards: in particular "SGML for structuring information and Z39.50 for information interchange." In particular, CIMI uses a TEI-based DTD (document type definition) for defining structured documents. It seems that they may be also keen on supporting the Dublin Core, to some extent (how?).

The CIMI material includes some good mechanisms to feed post-production or session stages: for example "wall text" to accompany objects on display. The main one is the "access point" attribute.

For querying, CIMI define a profile of Z39.50. It builds on the Library of Congress Collections Profile (see http://lcweb.loc.gov/z3950/agency/profiles/collections.html )

A history can be found at http://www.cimi.org/about/history.html

B.16. What is XPointer?

XPointer is a standard mechanism for locating elements in document structures using various criteria (attribute values, absolute and relative hierarchical position, etc). It is based on W3C URI hyperlinks, on ISO HyTime hyperlink navigation models, and on TEI location syntax. It is currently in draft at http://www.w3.org/TR/ An XPointer can be the data for a query or the result of a query.

B.17. What is FGDC?

Federal Geographic Data Committee. 1994. Content standards for digital geospatial metadata (June 8). Federal Geographic Data Committee. Washington, D.C. http://geology.usgs.gov/tools/metadata/standard/metadata.html

B.18. What is CML?

CML is the Chemical Markup Language. It is an XML DTD for metadata for chemical documents.

B.19. What is BSML?

BSML is the Bioinformatic Sequence Markup Language. It is an XML DTD for metadata for genetic information. It has a rich set of presentation elements, so it may perhap be regarded as a presentation DTD more than a metadata DTD. However, it features the ability to invoke data from many different formats.

B.20. What do we need for the Digital Library/Museum and other Academia Sinica metadata?

There are so many standards and semi-standards: what can we do? This is really an issue of system architecture rather than metadata, but it determines everything else.

When we look at the standards, however, we find that the information in them is, to a large extent, in common. Looked at this way, we can see that a metadata format is itself just a particular presentation of data. In this way it is no different from normal SGML or XML problems: how to make data systems in output-independent formats.

The solution which has worked well in SGML is to optimise the DTD used for each individual stage, and then to transform it at each stage to the form needed by the next stage. Very typically, information that end-users require is much simpler than information that data managers or intermediate programs require. If this approach is taken we get

All data-entry is in the pre-production format(s). There is no need for backwards transformations (i.e. from HTML back). This aids maintainability: the EAD or CIMI DTD users do not need to know anything about RDF or HTML. The Dublin Core and RDF users do not need to know anything about EAD or CIMI.

So which metadata format should the pre-production stage use? CIMI or EAD or something else. The answer is, we probably do not care! If the transformation-based infrastructure is in place, then we can support multiple specialist metadata source languages. This allows each source of metadata to concentrate on consistency and appropriateness rather than completeness (as against some universal metadata schema.)

How? In general the method would be to

This architecture is very maintainable because the interfaces are clearly defined and conversions are automated. It is also very flexible, in that the data is made available in a wide range of formats. But most of all, it decouples input and output, which allows current data formats to continue to be used.

This approach of using a very specific and targetted DTD more closely follows that recommended in Light and Burnard's study Three SGML Metadata Formats: TEI, EAD and CIMI (http://hosted.ukoln.ac.uk/biblink/wp1/sgml/ ): it recommends keeping data in a specific DTD. It makes the interesting observation that the three DTDs could be combined: "one might use the EAD scheme to describe individual archival holdings down to the item level and then use TEI headers to describe individual documents, where these were deemed of sufficient importance to warrant the effort. Equally, one could embed CIMI topic descriptors within an otherwise purely TEI conformant document." (in Conclusion, 5.3 Use of schemes in combination-- http://hosted.ukoln.ac.uk/biblink/wp1/sgml/conclusion.html ) This is the approach in use at the Bodleian Library at the University of Oxford.

The approach of mapping or transforming data between DTDs is very common in the SGML electronic publishing world. Many commercial products support this kind of mapping (for example, DynaWeb, OmniMark). The largest Web site in the world for specialized data (Mead Central Data) uses this kind of approach. For mapping to RDF, The development of RDF-schema ( http://www.w3.org/TR ) is an example of the RDF being used to attach to document types rather than to specific elements, so it should not be regarded as a novel approach.

Note. The mappings between MARC and Dublin Core (and GILS) are already available at "Crosswalk" site http://www.loc.gov/marc/dccross.html. These mappings should be adopted where possible. However, the approach here has the advantage that both MARC and Dublin Core elements will largely be derived from more specific elements in the pre-production DTDs. It is trivial to go from specific to general automatically: it is just relabelling; is it impossible to go from general to specific without intelligent intervention. (See also Mappings between Metadata Formats compiled by Michael Day http://www.ukoln.ac.uk/metadata/interoperability/ )

Note. The kinds of issues raised by Len Burnard ( http://www.uic.edu/~cmsmcq/tech/metadata.factoring.html concerning records for multiple forms of the same work) are best treated as issues belonging to the pre-production stage. The subsequent stages should not need to be rewritten to accomodate these kinds of stages.

C. XML Questions

C.1. What is XML?

XML ( http://www.w3.org/XML/) is a version of the ISO standard generalized markup language SGML. Most new WWW markup languages are written using XML now. See question B.3.

C.2 Is the standard XML attribute xml:lang good enough for metadata?

XML ( http://www.w3.org/XML/) provides a standard attribute xml:lang, which can be used on any element to set the language. See the Chinese XML FAQ for details (http://www.ascc.net/xml/en/utf-8/faq.html#zh_xml_q15). HTML also provides an identical attribute: html:lang.

The xml:lang attribute specifies the language used in an element's content (and, presumably, its attributes' values). If the element is also a link (e.g., <a href="xxx.xml" xml:lang="en">a link</a>) the attribute specifies the language of "a link", and not (except by implication) the value of the target file "xxx.xml".

This attribute uses the format of the Internet standard: RFC 1766 Tags for the Identification of Languages ( http://info.internet.isi.edu:80/in-notes/rfc/files/rfc1766.txt ), which is best used with the following conventions:

(( "x-" lll ) | ll )( "-" CC ( "-" xx )* )? 

where

But note that "the two-character language codes of ISO 639 are recognized as being inadequate for use as SGML language attributes when tagging text" (Robin Cover, http://www.oasis-open.org/cover/iso639a.html). This means that, for meta-data, the xml:lang attribute is mainly geared to providing information in a format that WWW tools will use. See question C.3 for more.

Software which uses the lang attribute should match based on partial patterns, not exact matches based on the full pattern. In other words, if your software is looking for any Chinese text, it should accept "zh-TW", "zh-HK", etc., as well as simple "zh".

ISO 639 has been extended with 2 (!) slightly different sets of 3-letter codes (see http://www.oasis-open.org/cover/iso639a.html): one based on MARC/NISO/Z38.53 codes (see http://www.oasis-open.org/cover/bib-mn.html#nisoZ3953-1994 and http://lcweb.loc.gov/marc/langann.html) and the other based on the native pronunciation of the language's name (e.g., for "Chinese", the former gives "chi" and the latter gives "zho".) These three-letter codes cannot be used in RFC???? attributes like xml:lang. Which three letter code should you use? If you need backwards compatability with MARC or Z39.53, then those codes are best: this is probably the case with many libraries. However, the other codes are not so "English biased" and may be better for future systems. (The general WWW principle of "be conservative in what you send, and generous in what you accept" means that good systems in the future should try to accept both.)

C.3 How can I represent "pinyin" or "traditional" or "simplified"

There is a big difference between "language" and "script". However, WWW internationalization treats the two together. That is simpler, but is probably not good enough for serious metadata and cataloging. Many languages can be written in multiple scripts: especially languages of nations which have experienced colonization of various kinds (political, economic, cultural, religious, etc).

There is now an ISO standard for names of scripts: ISO 15924, Code for the representation of names of scripts. Information can be found at http://www.oasis-open.org/cover/related.html#iso15924 This standard was not available at the time the XML specification was written. (As of December 1998 it is "Committee Draft", which is the final stage before being accepted as a standard. See http://www.indigo.ie/egt/standards/iso15924/document/cd15924.pdf .)

For Chinese-related scripts:

3-letter code 2-letter code code number English name
Bod Bo 330 Tibetan
Bpm Bp 285 Bopomofo (Chinese)
Han Hn 500 Han ideographs
Hgl Hg 420 Hangul (Korean)
Hrg Hr 410 Hiragana (Japanese)
Khn Kh 931 Hgl + Han (Korean)
Jap Ja 930 Han + Hrg +Kkn (Korean)
Kkn Kn   Katakana (Japanese, Okinawan)
Lat Lt   Latin letters (e.g. for Pinyin, Vietnamese, Japanese romaji)

This standard is very helpful. But it does not provide a way to say "simplified" or "traditional".

(Rick: need to check the following against TEI and Chinese standards)

One useful approach might be

One possible format might be an extended form of RFC 1766 Tags for the Identification of Languages ( http://info.internet.isi.edu:80/in-notes/rfc/files/rfc1766.txt )

Sss ( "-" lll ( "-" CC ( "-" xx )*)?)?

where

So, for example, for simplified Chinese writing official (i.e., the default) Mandarin dialect

<p xml:lang="zh-CN" script="Han-zho-CN-simplified">&#x4E2D;<p>

For Pinyin:

<p xml:lang="zh-CN" script="Lat-zho-CN-pinyin" >

For traditional Chinese script, writing the Tawanese dialect of the Min Nam Chinese language (if that is important),

<p xml:lang="zh-TW-CFR" script="Han-zho-TW-traditional">&#x4E2D;<p>

For traditional Chinese script, writing the (Tawanese Aboriginal Austronesian) Amis language (?does this ever happen?),

<p xml:lang="x-map-TW-ALV" script="Han-zho-TW-traditional">&#x4E2D;<p>

For Central Okinawan (see Ethnologue http://www.sil.org/ethnologue/countries/Japa.html), writing in katakana:

xml:lang="ja-JP-RYU" script="Kat"

If you use the script attribute in this form, you can use the following namespace declaration

xmlns:ascc-dcfaq="http://www.ascc.net/xml/en/utf-8/dc-faq.html"

and then use the attribute name ascc-dcfaq:script

Using namespaces, you can keep compatability with different methods of marking up scripts and language. Until some good system comes, you may need multiple attributes.

C.3 Can I use a CMARC Code if I really want to?

Sure. You can add any kind of attributes you like to XML-based metadata.

C.4 My Browser keeps showing simplified characters

Some WWW browsers or servers use the locale of the user or the character encoding of the user to figure out whether to select traditional or simplified characters. In the future, more documents will use stylesheets such as CSS or XSL ( http://www.w3.org/Style/ ) which will let you specify a simplified or traditional character font, based on the value of the attribute.

Try to make all your XML and HTML data use the correct language and character set tags. More applications save their data in ISO 10646 text (Unicode, UTF-8, UTF-16). Unless you put in the correct markup, applications in other places on the World Wide Web will have to guess whether to use traditional or simplified fonts.

Please note that some fonts do not have all Han ideographs from ISO 10646. Try to make sure all your fonts are Unicode-compatible (or, at least, that they have all the characters in Big5/GB12345).


Cataloging Information (Dublin Core)

<DC:TITLE       xml:lang="en">The Chinese Metadata FAQ </DC:TITLE>
<DC:CREATOR                  >Rick Jelliffe </DC:CREATOR>
<DC:SUBJECT     xml:lang="en">Dublin Core, DC, Resource Description Framework
                              RDF, EAD, Electronic Archive Description,
                              Warwick Framework, XML, SGML, Chinese, FAQ,
                              </DC:SUBJECT>
<DC:DESCRIPTION xml:lang="en">Frequently Asked Questions about using XML-based metadata, 
                              including for Chinese </DC:DESCRIPTION>
<DC:PUBLISHER   xml:lang="en">Computer Center, Academia Sinica, Taiwan </DC:PUBLISHER>
<DC:TYPE        xml:lang="en">Text.Article </DC:TYPE>
<DC:DATE                     >1998-12-19 </DC:DATE>
<DC:RIGHTS                   >http://www.ascc.net/xml/en/utf-8/legal.html </DC:RIGHTS>