This FAQ is about using XML-based metadata (e.g., Dublin Core, RDF and EAD), including for Chinese-language metadata.
The FAQ is created to help answer question arising in the Digital Library/Museum project, at Academia Sinica, Taiwan.
Metadata is "data about data"; it is data for the purposes of cataloging, searching, archiving, electronic discovery, displaying, and so on. The key indication of the direction of the WWW on metadata comes from the inventor of the WWW, Tim Berners-Lee (in Metadata Architecture, at http://www.w3.org/DesignIssues/Metadata.html) "Metadata is machine understandable information about web resources or other things" but "metadata is data."
For a good review of Metadata standards, see "Review of Metadata Formats" by Rachel Heery, 1996 (at http://www.oasis-open.org/cover/heery-review.html )
The Dublin Core home page (http://purl.oclc.org/dc/index.htm) says "The Dublin Core is a metadata element set intended to facilitate discovery of electronic resources. Originally conceived for author-generated description of Web resources, it has attracted the attention of formal resource description communities such as museums, libraries, government agencies, and commercial organizations."
At the moment, there are many different formats used to catalog information. Dublin Core provides a very simple subset (or classification) of them. In the future, we anticipate all catalogs will support access using Dublin Core metadata. Dublin Core defines 15 core element names.
For an example of Dublin Core metadata, see the end of this file.
The Dublin Core is not finalized. It is still being developed. All Dublin Core implementations are therefore, to a certain extent, experimental. Please also beware that many of the "draft" Dublin Core material is itself based on other drafts--so quite a lot of the information is unreliable.
The simple Dublin Core elements are too simple for many users. They have started to develop subclasses of elements, based on the Dublin Core. So now there is "simple" Dublin Core, and "qualified" Dublin Core.
The Qualified Dublin Core elements are not finalized. They are still being developed. All Dublin Core implementations are therefore, to a certain extent, experimental.
Yes. Dublin Core is merely a set of field names and descriptions. It is not a file format (or notation.) However, you can use implement Dublin Core using the following notations:
Yes. Dublin Core is merely a set of field names and descriptions. It is not a file format. XML, SGML and HTML 4 all allow Han ideographs. There is more information on Chinese XML at the Chinese XML FAQ at http://www.ascc.net/xml/en/utf-8/faq.html. This FAQ will have specific answers to questions about Chinese metadata and Dublin Core too.
The RDF home page (http://www.w3.org/RDF/Overview.html) says "RDF is designed to provide an infrastructure to support metadata across many web-based activities." It is very verbose, when used to mark up elements. RDF is also quite complex to understand.
The lesson of SGML is that there are great benefits in marking up documents generically, according to document type, rather than than with specific details for each element. This is the only way to handle large data sets, in many cases. Applying this lesson to RDF, it is probable that RDF will be most beneficial when applied to element types rather than to element instances: the element type becomes a kind of macro for the RDF.
RDF is frequently mentioned in material about Dublin Core. But you can use Dublin Core without RDF.
RDF provides extra value to Dublin Core metadata. Generic RDF tools (for example, for visualization and discovery) will be able to use DC+RDF data even though they do not understand Dublin Core elements.
RDF is not finalized. It is still being developed. The Proposed Recommendation is due soon. All RDF implementations are therefore, to a certain extent, experimental.
Yes. RDF is a conceptual model to let you make assertions about elements of data. It is not a file format. However, you can implement RDF systems using XML (or SGML). You can implement some kinds of RDF using HTML too.
The proposed form of RDF, for use in XML (and SGML), makes use of wrapper elements. This has important consequences for XML (and SGML) systems:
It may be that RDF is best for database-style "fielded records" rather than for adding to existing free text data, or data ordered by the requirements of publishing systems.
The Dublin Core working group on Data Models has published some guidelines which seem safe. See http://www.mailbase.ac.uk/lists/dc-datamodel/1998-09/0029.html.
These basic guidelines can be summarized:
The Warwick Framework is a method to bundle packages (of metadata) together (into a container). The packages can be of different formats. The Warwick Frameworks lets you specify the relationship between packages in the container; this helps transmisstion and queries. The containers can be in
The paper defining the Warwick Framework is "The Warwick Framework: A Container Architecture for Aggregating Sets of Metadata" by Carl Lagoze, Clifford A. Lynch, Ron Daniel Jr., (refer http://lists.w3.org/Archives/Public/www-disw/msg00017.html)
Electronic Archive Description (EAD) is an XML/SGML DTD for specifying finding aids for archived objects. It started with Library at the University of California, Berkeley and now is being developed by Society of American Archivists and (Network Development and MARC Standards Office of) the Library of Congress. See http://www.loc.gov/ead/ead.html. It has been in use and testing for several years, and has been officially released in 1998. EAD is a traditional XML/SGML DTD, and provides very detailed fields for metadata.
The EAD is very different from the Dublin Core and RDF. Perhaps we can say that Dublin Core is an incomplete "top-down" design, while EAD is a completed "bottom-up" design.
EAD won the 1998 "Coker Prize for Description" from the Society of American Archivists. See http://www.archivists.org/awards/coker.html
A good site is the Berkeley Digital Library http://sunsite.berkeley.edu/SGML/index.html. An example of EAD in use is http://sunsite2.berkeley.edu:28008/dynaweb/oac/bampfa/@Generic__CollectionView
PICS is the Platform for Internet Content Selection. The PICS home page (http://www.w3.org/PICS/) says PICS "enables labels (metadata) to be associated with Internet content. It was originally designed to help parents and teachers control what children access on the Internet, but it also facilitates other uses for labels, including code signing and privacy. "
There are several related specifications for PICS :
PICS is thus concerned with delivering metadata for use in establishing sessions and data access, not for archiving or data manipulation.
Note that there are also some attempts to extend PICS as a more general Schema language. For example, to describe Dublin Core: http://metadata.net/dstc/DC-10-EN/schema.txt.
EAD, Dublin Core/RDF and PICS are all serve different stages of document production. Lets use the terminology of the Movie industry (Ted Nelson has said that Hypertext is a kind of movie-making), but applied to electronic archives Web sites:--
EAD is therefore aimed at pre-production needs--archivists need to store all the relevent information that they have, whether or not it fits into nice Dublin Core categories or not. (The EAD Design Principles put it this way: "The needs of public users, curatorial and reference staff, and finding aid authors were given priority in the standard's design." Furthermore, "Finding aids are not objects of study but rather tools leading to such objects." see http://www.loc.gov/ead/eaddsgn.html.)
Dublin Core is aimed at (post-?)production needs--it is a simple interface for allowing interchange and access, but it does not attempt to provide any higher-level structures: it does not treat the differences in objects as significant: a stick is the same as a library.
RDF is aimed at post-production needs--providing a way to tie together information from lots of different schemas, which may have nothing to do with archiving.
PICS is aimed at session needs--deciding whether a type of information is appropriate for a particular user or situation.
The Warwick Framework is a format by which EAD, Dublin Core, RDF or PICs schemas can be bundled together and interchanged. (However, this functionality may be, to a certain extent, duplicated by RDF and perhaps EAD. I would expect RDF to win popular support, if they are competitors.)
(Note: Some of the material in this question originally presented by Rick Jelliffe at seminars in Taiwan on behalf of III, Ministry of Defense, ApexSun and Adobe Systems. Used by permission.)
MARC is the Big Daddy. But he has multiple personalities.
MARC records use uses the ISO 2709: 1981 format. However, there is recent work to also provide an XML format.
ANSI X39.50 is a database query protocol, for querying a catalog about library holdings. It is very suitable for MARC data . Holding information is (typically) returned in the OPAC format. More recently, CIMI has defined a "profile" to support retrieval of museum information too. See question B.15 below.
Z39.50 uses ISO ASN.1 (Abstract Syntax Notation 1) rather than SGML (See XML, question C.1.) This makes it much more efficient for many small transactions.
For a good bibliography, see Lynch, Clifford A. (1994). RFC 1729, Using the Z39.50 Information Retrieval Protocol in the Internet Environment. http://info.internet.isi.edu:80/in-notes/ rfc/files/rfc1729.txt
The Text Encoding Initiative (TEI ) is "an international project to develop guidelines for the preparation and interchange of electronic texts for scholarly research, and to satisfy a broad range of uses by the language industries more generally." TEI have made a family of DTDs for all sorts of electronic texts. These include the ability to have some metadata, in the TEI headers. A text or object may have multiple TEI headers.
In complexity or richness, the TEI headers provide more than Dublin Core, but certainly less than EAD or MARC. Perhaps Qualified Dublin Core will be similar. TEI headers seem to be intended to allow suppliers of material to give a good headstart to catalogers, rather than being a relentless enumeration of every possibile bibliographic possibility.
The TEI Home Page is http://www.uic.edu/orgs/tei/ The current specifications can be found at at http://etext.lib.virginia.edu/TEI.html. The specification for the header are in Chapter 5: http://etext.virginia.edu/bin/tei-tocs?div=DIV1&id=HD
CIMI is the Consortium for the Computer Interchange of Museum Information. Their home page is http://www.cimi.org/
CIMI'a approach is to foster standards: in particular "SGML for structuring information and Z39.50 for information interchange." In particular, CIMI uses a TEI-based DTD (document type definition) for defining structured documents. It seems that they may be also keen on supporting the Dublin Core, to some extent (how?).
The CIMI material includes some good mechanisms to feed post-production or session stages: for example "wall text" to accompany objects on display. The main one is the "access point" attribute.
For querying, CIMI define a profile of Z39.50. It builds on the Library of Congress Collections Profile (see http://lcweb.loc.gov/z3950/agency/profiles/collections.html )
A history can be found at http://www.cimi.org/about/history.html
XPointer is a standard mechanism for locating elements in document structures using various criteria (attribute values, absolute and relative hierarchical position, etc). It is based on W3C URI hyperlinks, on ISO HyTime hyperlink navigation models, and on TEI location syntax. It is currently in draft at http://www.w3.org/TR/ An XPointer can be the data for a query or the result of a query.
Federal Geographic Data Committee. 1994. Content standards for digital geospatial metadata (June 8). Federal Geographic Data Committee. Washington, D.C. http://geology.usgs.gov/tools/metadata/standard/metadata.html
CML is the Chemical Markup Language. It is an XML DTD for metadata for chemical documents.
BSML is the Bioinformatic Sequence Markup Language. It is an XML DTD for metadata for genetic information. It has a rich set of presentation elements, so it may perhap be regarded as a presentation DTD more than a metadata DTD. However, it features the ability to invoke data from many different formats.
There are so many standards and semi-standards: what can we do? This is really an issue of system architecture rather than metadata, but it determines everything else.
When we look at the standards, however, we find that the information in them is, to a large extent, in common. Looked at this way, we can see that a metadata format is itself just a particular presentation of data. In this way it is no different from normal SGML or XML problems: how to make data systems in output-independent formats.
The solution which has worked well in SGML is to optimise the DTD used for each individual stage, and then to transform it at each stage to the form needed by the next stage. Very typically, information that end-users require is much simpler than information that data managers or intermediate programs require. If this approach is taken we get
All data-entry is in the pre-production format(s). There is no need for backwards transformations (i.e. from HTML back). This aids maintainability: the EAD or CIMI DTD users do not need to know anything about RDF or HTML. The Dublin Core and RDF users do not need to know anything about EAD or CIMI.
So which metadata format should the pre-production stage use? CIMI or EAD or something else. The answer is, we probably do not care! If the transformation-based infrastructure is in place, then we can support multiple specialist metadata source languages. This allows each source of metadata to concentrate on consistency and appropriateness rather than completeness (as against some universal metadata schema.)
How? In general the method would be to
This architecture is very maintainable because the interfaces are clearly defined and conversions are automated. It is also very flexible, in that the data is made available in a wide range of formats. But most of all, it decouples input and output, which allows current data formats to continue to be used.
This approach of using a very specific and targetted DTD more closely follows that recommended in Light and Burnard's study Three SGML Metadata Formats: TEI, EAD and CIMI (http://hosted.ukoln.ac.uk/biblink/wp1/sgml/ ): it recommends keeping data in a specific DTD. It makes the interesting observation that the three DTDs could be combined: "one might use the EAD scheme to describe individual archival holdings down to the item level and then use TEI headers to describe individual documents, where these were deemed of sufficient importance to warrant the effort. Equally, one could embed CIMI topic descriptors within an otherwise purely TEI conformant document." (in Conclusion, 5.3 Use of schemes in combination-- http://hosted.ukoln.ac.uk/biblink/wp1/sgml/conclusion.html ) This is the approach in use at the Bodleian Library at the University of Oxford.
The approach of mapping or transforming data between DTDs is very common in the SGML electronic publishing world. Many commercial products support this kind of mapping (for example, DynaWeb, OmniMark). The largest Web site in the world for specialized data (Mead Central Data) uses this kind of approach. For mapping to RDF, The development of RDF-schema ( http://www.w3.org/TR ) is an example of the RDF being used to attach to document types rather than to specific elements, so it should not be regarded as a novel approach.
Note. The mappings between MARC and Dublin Core (and GILS) are already available at "Crosswalk" site http://www.loc.gov/marc/dccross.html. These mappings should be adopted where possible. However, the approach here has the advantage that both MARC and Dublin Core elements will largely be derived from more specific elements in the pre-production DTDs. It is trivial to go from specific to general automatically: it is just relabelling; is it impossible to go from general to specific without intelligent intervention. (See also Mappings between Metadata Formats compiled by Michael Day http://www.ukoln.ac.uk/metadata/interoperability/ )
Note. The kinds of issues raised by Len Burnard ( http://www.uic.edu/~cmsmcq/tech/metadata.factoring.html concerning records for multiple forms of the same work) are best treated as issues belonging to the pre-production stage. The subsequent stages should not need to be rewritten to accomodate these kinds of stages.
XML ( http://www.w3.org/XML/) is a version of the ISO standard generalized markup language SGML. Most new WWW markup languages are written using XML now. See question B.3.
xml:lang good enough
for metadata?
XML (
http://www.w3.org/XML/) provides a standard attribute
xml:lang,
which can be used on any element to set the language. See the
Chinese XML FAQ for details (http://www.ascc.net/xml/en/utf-8/faq.html#zh_xml_q15).
HTML also provides an identical attribute: html:lang.
The xml:lang attribute specifies the language used in an
element's content (and, presumably, its attributes' values).
If the element is also a link (e.g., <a href="xxx.xml" xml:lang="en">a
link</a>) the attribute specifies the
language of "a link", and not (except by implication) the
value of the target file "xxx.xml".
This attribute uses the format of the Internet standard: RFC 1766 Tags for the Identification of Languages ( http://info.internet.isi.edu:80/in-notes/rfc/files/rfc1766.txt ), which is best used with the following conventions:
(( "x-" lll ) | ll )( "-" CC ( "-" xx )* )?
where
But note that "the two-character language codes of ISO
639 are recognized as being inadequate for use as SGML
language attributes when tagging text" (Robin Cover,
http://www.oasis-open.org/cover/iso639a.html). This means
that, for meta-data, the
xml:lang attribute is mainly geared to
providing information in a format that WWW tools will use.
See question C.3 for more.
Software which uses the
lang attribute should match based on partial
patterns, not exact matches based on the full pattern. In
other words, if your software is looking for any Chinese
text, it should accept "zh-TW", "zh-HK", etc., as well as
simple "zh".
ISO 639 has been extended with 2 (!) slightly different sets
of 3-letter codes (see
http://www.oasis-open.org/cover/iso639a.html): one based
on MARC/NISO/Z38.53 codes (see
http://www.oasis-open.org/cover/bib-mn.html#nisoZ3953-1994
and
http://lcweb.loc.gov/marc/langann.html) and the other
based on the native pronunciation of the language's name
(e.g., for "Chinese", the former gives "chi" and the latter
gives "zho".) These three-letter codes cannot be used in
RFC???? attributes like
xml:lang. Which three letter code should you
use? If you need backwards compatability with MARC or Z39.53,
then those codes are best: this is probably the case with
many libraries. However, the other codes are not so "English
biased" and may be better for future systems. (The general
WWW principle of "be conservative in what you send, and
generous in what you accept" means that good systems in
the future should try to accept both.)
There is a big difference between "language" and "script". However, WWW internationalization treats the two together. That is simpler, but is probably not good enough for serious metadata and cataloging. Many languages can be written in multiple scripts: especially languages of nations which have experienced colonization of various kinds (political, economic, cultural, religious, etc).
There is now an ISO standard for names of scripts: ISO 15924, Code for the representation of names of scripts. Information can be found at http://www.oasis-open.org/cover/related.html#iso15924 This standard was not available at the time the XML specification was written. (As of December 1998 it is "Committee Draft", which is the final stage before being accepted as a standard. See http://www.indigo.ie/egt/standards/iso15924/document/cd15924.pdf .)
For Chinese-related scripts:
| 3-letter code | 2-letter code | code number | English name | |
| Bod | Bo | 330 | Tibetan | |
| Bpm | Bp | 285 | Bopomofo (Chinese) | |
| Han | Hn | 500 | Han ideographs | |
| Hgl | Hg | 420 | Hangul (Korean) | |
| Hrg | Hr | 410 | Hiragana (Japanese) | |
| Khn | Kh | 931 | Hgl + Han (Korean) | |
| Jap | Ja | 930 | Han + Hrg +Kkn (Korean) | |
| Kkn | Kn | Katakana (Japanese, Okinawan) | ||
| Lat | Lt | Latin letters (e.g. for Pinyin, Vietnamese, Japanese romaji) | ||
This standard is very helpful. But it does not provide a way to say "simplified" or "traditional".
(Rick: need to check the following against TEI and Chinese standards)
One useful approach might be
xml:lang (or
html:lang) attribute to indicate
locale-based characteristics (language, locale, dialect)
script attribute to indicate specific script
characteristics.
One possible format might be an extended form of RFC 1766 Tags for the Identification of Languages ( http://info.internet.isi.edu:80/in-notes/rfc/files/rfc1766.txt )
Sss ( "-" lll ( "-" CC ( "-" xx )*)?)?
where
So, for example, for simplified Chinese writing official (i.e., the default) Mandarin dialect
<p xml:lang="zh-CN" script="Han-zho-CN-simplified">中<p>
For Pinyin:
<p xml:lang="zh-CN" script="Lat-zho-CN-pinyin" >
For traditional Chinese script, writing the Tawanese dialect of the Min Nam Chinese language (if that is important),
<p xml:lang="zh-TW-CFR" script="Han-zho-TW-traditional">中<p>
For traditional Chinese script, writing the (Tawanese Aboriginal Austronesian) Amis language (?does this ever happen?),
<p xml:lang="x-map-TW-ALV" script="Han-zho-TW-traditional">中<p>
For Central Okinawan (see Ethnologue http://www.sil.org/ethnologue/countries/Japa.html), writing in katakana:
xml:lang="ja-JP-RYU" script="Kat"
If you use the script attribute in this form, you can use the following namespace declaration
xmlns:ascc-dcfaq="http://www.ascc.net/xml/en/utf-8/dc-faq.html"
and then use the attribute name
ascc-dcfaq:script
Using namespaces, you can keep compatability with different methods of marking up scripts and language. Until some good system comes, you may need multiple attributes.
Sure. You can add any kind of attributes you like to XML-based metadata.
Some WWW browsers or servers use the locale of the user or the character encoding of the user to figure out whether to select traditional or simplified characters. In the future, more documents will use stylesheets such as CSS or XSL ( http://www.w3.org/Style/ ) which will let you specify a simplified or traditional character font, based on the value of the attribute.
Try to make all your XML and HTML data use the correct language and character set tags. More applications save their data in ISO 10646 text (Unicode, UTF-8, UTF-16). Unless you put in the correct markup, applications in other places on the World Wide Web will have to guess whether to use traditional or simplified fonts.
Please note that some fonts do not have all Han ideographs from ISO 10646. Try to make sure all your fonts are Unicode-compatible (or, at least, that they have all the characters in Big5/GB12345).
<DC:TITLE xml:lang="en">The Chinese Metadata FAQ </DC:TITLE>
<DC:CREATOR >Rick Jelliffe </DC:CREATOR>
<DC:SUBJECT xml:lang="en">Dublin Core, DC, Resource Description Framework
RDF, EAD, Electronic Archive Description,
Warwick Framework, XML, SGML, Chinese, FAQ,
</DC:SUBJECT>
<DC:DESCRIPTION xml:lang="en">Frequently Asked Questions about using XML-based metadata,
including for Chinese </DC:DESCRIPTION>
<DC:PUBLISHER xml:lang="en">Computer Center, Academia Sinica, Taiwan </DC:PUBLISHER>
<DC:TYPE xml:lang="en">Text.Article </DC:TYPE>
<DC:DATE >1998-12-19 </DC:DATE>
<DC:RIGHTS >http://www.ascc.net/xml/en/utf-8/legal.html </DC:RIGHTS>