classicosm logo

XML for Classic Books

Classic works should be available in a variety of formats (including plain text, HTML, PDF and various PDA editions), for a variety of platforms (Windows, Macintosh, Linux, Palm, PocketPC and beyond), should support every important "reading" method (visual, text-to-speech), and enable automated processing (search, filter, analyze, etc.). The best way to enable these diverse goals while maintaining consistency across the various representations is to generate each from a "master" document stored in XML -- preferably using structural markup.

However, XML (eXtensible Markup Language) isn't really a markup language per se, it's a "meta-language" or a set of rules so that standard tools can process any XML-compliant language or vocabulary (e.g. XHTML, SVG, RSS). So, the question remains, what XML vocabulary should be used to markup classic texts?

Here are the major choices, either to adopt "as is" or to use as a starting point:

  1. link_outXHTML: eXtensible HTML
  2. link_outOEBPS: Open EBook Publication Structure
  3. link_outDTB: Digital Talking Book (also called link_outDAISY: Digital Accessible Information SYstem)
  4. link_outTEI: Text Encoding Initiative
  5. ad hoc: custom tags designed around specific types of documents, e.g. plays, poetry, prose (e.g. using DTDs from the no-longer-active Gutenberg at HTML Writers Guild project)


The chief virtue of starting with XHTML is that it's most familiar to the widest audience. Rich markup for classic texts would certainly require adding tags and/or attributes, but I think it's very difficult to make a compelling case for adopting tags and attributes that have the same purpose as those in XHTML but a different name. Note that both OEBPS and DTB make use of defined subsets of XHTML.

For metadata, the logical starting point is link_outDublin Core.


"The goal of OEBPS is to provide this comprehensive support not by developing yet another standard, but by specifying subsets of well-established standards, most importantly: XML, XHTML, CSS, MIME, Dublin Core, MARC, and Unicode. In addition, OEBPS adds some specific constraints necessary for interoperability, and defines several new mechanisms (such as the “OEBPS Package”, and “fallbacks”) which were needed by publishers, but which did not have equivalents in existing standards." (source: link_outOEBPS FAQ)

TO BE DONE: find simple example documents that illustrate OEBPS. (Suggestions welcome!)


"This standard defines the format and content of the electronic file set that comprises a digital talking book (DTB) and establishes a limited set of requirements for DTB playback devices. It uses established and new specifications to delineate the structure of DTBs whose content can range from XML text only, to text with corresponding spoken audio, to audio with little or no text. DTBs are designed to make print material accessible and navigable for blind or otherwise print-disabled persons."

TO BE DONE: find simple example documents that illustrate DTB / DAISY. (Suggestions welcome!)


"Initially launched in 1987, the TEI is an international and interdisciplinary standard that helps libraries, museums, publishers, and individual scholars represent all kinds of literary and linguistic texts for online research and teaching, using an encoding scheme that is maximally expressive and minimally obsolescent."

Here's a side-by-side comparison of TEI vs. play.dtd, showing the XML tags for a small portion of Hamlet. (Play.dtd is a custom vocabulary created by Jon Bosak, one of the inventors of XML.) I list pros, cons and questions for each. Conclusion: TEI strikes me as more complex than needed, which may be somewhat of a barrier for volunteer efforts such as Project Gutenberg.

Project Gutenberg ("PG") is link_outexperimenting with TEI. It's a great start! I'm not a fan of TEI at the moment, but I'm open to being convinced -- and I also believe that any XML is better than having no master format. Every text that gets released under the current policies is likely to require at least some manual rework in the future. Here's my specific feedback on PGTEI and a (very rough draft) PGTEI Quick Reference.

Joshua Hutchinson is exploring PGTEI on the link_outGUTVOL-D mailing list. Please join the discussion!


Posted Oct. 28, 2004

Classicosm is a Product Architect site.
classicosm -at- product architect -dot- com (Feedback welcome!)
Copyright 2004 by Scott S. Lawton. All Rights Reserved. "Classicosm" and "A world of timeless value" are service marks owned by Scott S. Lawton.



Web Classicosm

Web Classicosm