classicosm logo

Feedback on PGTEI

Project Gutenberg ("PG") is link_outexperimenting with TEI. First, kudos to Marcello for his efforts! He has a complete workflow in place: convert PG text to PGTEI (providing a starting point that will usually require manual cleanup), validate, convert PGTEI to HTML, PDF, Palm, and back to text.

The following feedback is intended to be constructive. I think PG should move towards an XML "master" document as soon as possible. I'm not convinced that TEI is the most appropriate XML vocabulary; I hope the following feedback either helps improve it or replace it. I would guess that much of Marcello's work can be modified for any XML vocabulary.

My feedback includes many minor issues and a few major issues. If there's any confusion over which is which, just ask!

PGTEI Vocabulary

Section 18: I strongly recommend omitting the requirement that TEX and NROFF characters must be escaped. (As far as I can tell, that's not part of TEI.) It may well be a useful optional feature; perhaps it could be turned on by including a specific processing instruction.

Consistent with the separation of structure and format, the rend attribute (e.g. rend=italic) should only be used when the format ("rendition" or "rendering") for a particular instance is different than the format of other instances of the same structural element -- or when the structural or semantic purpose of the original formatting isn't known.

 : I strongly recommend minimizing use of this "presentation" markup in favor of structural markup, e.g. rend=indent1..n or adding a new attribute indent=1..n.

quote is used in an example but apparently isn't part of TEI Lite (it's not in link_outAppendix A). What's the story?

q: in cases where the quotation marks don't balance, it may be difficult to automatically convert quotation marks to the appropriate q.../q form, and time consuming to manually proof. Accordingly, I suggest this step be left as optional.

seriesStmt may be ambiguous, e.g. PG likes to keep track of "nth by this author"; yet a book may also be part of an externally defined series.

langUsage: I suggest the standard should be to omit the content of the tag (e.g. "British", which is probably more useful as "British English" or "English (British)"). This information should be generated to ensure consistency. (They appear in the generated PGTEI and in alice.tei, but not in lmiss.tei.)

pgHeader looks like it's contains information that should be described in teiHeader (though I'm new to TEI so may be wrong). alice.tei and lmiss.tei both contain pgHeader; the generated PGTEI does not.

PGTEI Examples

Having separate index tags for TOC, PDF and PDB strikes me as unnecessary and prone to error. Shouldn't the TOC one suffice for all?

In fact, the tag itself seems redundant. Shouldn't the head itself suffice? (If TEI requires it, that's another example of where I think TEI is too complex.)

It would be very useful to label every div with a type, e.g. introduction, chapter, section.

alice.tei: reg="Carroll, Lewis" should use the complete "authority" form, which I believe is "Carroll, Lewis, 1832-1898". Note that unlike the PG website, there are no parens around the dates. Here's an illustration of paren usage: "Baum, L. Frank (Lyman Frank), 1856-1919".

alice.tei and lmiss.tei are both missing pubPlace and idno type=etext-file. (Not sure how important either is considered to be, but both are generated by the pgtext-to-pgtei converter.)

PGTEI Documentation

Kudos for actually writing documentation, and for numbering the sections to match the TEI Lite docs!

It would be nice to add hyperlinks that point to the TEI Lite docs (preferably the link_outmulti-page HTML version).

Section 6.3: Information on foreign appears multiple places but should be gathered here; the section is currently empty.

Section 7: The paragraph beginning "In the PDF format it does not insert a footnote marker in the text" wasn't clear to me, and the example didn't help. What is it trying to say? Is the example correct as shown?

Generated HTML, PDF and Text

In the PG license, section numbers such as "1.A." should appear on the same line as the text that follows -- per the original and to avoid wasting space.

Generated HTML

There appear to be two validation errors, e.g. in the link_outPGTEI documentation:
  Error (7/117): <SPAN> must not contain block level elements like <H1>.
  Error (379/1): The start tag for </P> can't be found.

&emsp;: translating to &#8195; is technically correct but probably not a good choice for HTML intended to work with older browsers

In the documentation, why is "Versprich mir, Heinrich" repeated in the output, the second time in white?

In the "Faust" example, I would prefer much less whitespace after the speaker labels (though I did not check the text original, much less look for an original scan).

With the caveat that I'm not a CSS expert, I believe that the following suggestions follow the spirit of using HTML for structural markup and CSS for format:

  • Replace the style attribute in every Table of Contents entry with an appropriate class.
  • Replace <div class="eg"><pre> with a CSS class that does both.

Default CSS

The lack of space between paragraphs goes against Web conventions. (It's fine as an option but a poor choice for the default.)

The filename ("de-gnutenberg-press-1.0-persistent.css") is too long for Macintosh OS 9 and below (38 chars vs. the max of 31). Possibly relevant: did Windows jump from 8.3 filenames directly to "long" filenames, or was there an intermediate limit?

Generated PDF

It would be useful to have at least a few user-selectable parameters, e.g. page size.

For the default page size, a width of 5.5" rather than 5.83" would accommodate US Letter as well as A4 paper. (Incidentally, using quote marks for inches is a heuristic that should be checked when converting text to XML.)

The default font size seems overly large for printing, though I didn't compare it to a representative sample of books.

The spacing between sections of the PG footer seems excessive.

Generated Text

It would be useful to generate a form that can be easily compared to the original PG text, even if that's an option rather than the default. For example: don't rewrap lines!

The format didn't appear to match PG standards (e.g. "underlining" the chapter titles using various characters) -- though I know very little about PG standards so I may well be wrong.


Again, kudos for including this. Semi-automated conversion is an important part of migrating legacy PG texts to XML.

Are the heuristics from things like GutenMark included? That would seem to be quite valuable.

Please do not rewrap lines! That makes it very difficult to compare different versions, editions, formats, markup techniques, etc.

Try to identify the table of contents. If deleting it is too drastic, just add a comment such as "If this is the table of contents, delete it. It will be generated automatically by divGen type=toc." Why? Because I suspect (hope?) that many people who review the generated TEI will be beginners.

If an Introduction is properly part of front matter, try to identify it and place the front tag after it.

In addition to processing the source text, include information from the database, e.g. LoC Class, Subject, Alternate Title.

<idno type='etext-nr'> and <idno type='etext-file'> are generated with single rather than double quotes. That's legal, but consistent usage is preferable.

If langUsage is intended to describe the language in the current document, perhaps it's worth adding a comment such as "delete any languages not present in this etext".

Rather than div rend="newpage", only a plain div was generated. (I would actually prefer more semantic markup, e.g. div type="chapter" or even a chapter tag.)

Spaces before closing p and head tags are preserved. Is that intentional?

When converting The Wonderful Wizard of Oz, several dozen instances of the following markup were incorrectly added (in every case, the text didn't appear significantly different than surrounding paragraphs):

  • <p rend="center"> instead of just <p>
  • <q rend="pre"> instead of just <q>
  • <head> instead of <p>
  • <head type="sub"> instead of <p>

This line was incorrectly marked as a head and given its own div and index tags: "End of The Project Gutenberg Etext of The Wonderful Wizard of Oz".


Posted Oct. 28, 2004

Classicosm is a Product Architect site.
classicosm -at- product architect -dot- com (Feedback welcome!)
Copyright 2004 by Scott S. Lawton. All Rights Reserved. "Classicosm" and "A world of timeless value" are service marks owned by Scott S. Lawton.



Web Classicosm

Web Classicosm