Feedback on PGTEI
Project Gutenberg ("PG") is experimenting with TEI. First, kudos to Marcello for his efforts! He has a complete workflow in place: convert PG text to PGTEI (providing a starting point that will usually require manual cleanup), validate, convert PGTEI to HTML, PDF, Palm, and back to text.
The following feedback is intended to be constructive. I think PG should move towards an XML "master" document as soon as possible. I'm not convinced that TEI is the most appropriate XML vocabulary; I hope the following feedback either helps improve it or replace it. I would guess that much of Marcello's work can be modified for any XML vocabulary.
My feedback includes many minor issues and a few major issues. If there's any confusion over which is which, just ask!
Section 18: I strongly recommend omitting the requirement that TEX and NROFF characters must be escaped. (As far as I can tell, that's not part of TEI.) It may well be a useful optional feature; perhaps it could be turned on by including a specific processing instruction.
Consistent with the separation of structure and format, the
In fact, the tag itself seems redundant. Shouldn't the
It would be very useful to label every
alice.tei and lmiss.tei are both missing
Kudos for actually writing documentation, and for numbering the sections to match the TEI Lite docs!
It would be nice to add hyperlinks that point to the TEI Lite docs (preferably the multi-page HTML version).
Section 6.3: Information on
Section 7: The paragraph beginning "In the PDF format it does not insert a footnote marker in the text" wasn't clear to me, and the example didn't help. What is it trying to say? Is the example correct as shown?
Generated HTML, PDF and Text
In the PG license, section numbers such as "1.A." should appear on the same line as the text that follows -- per the original and to avoid wasting space.
There appear to be two validation errors, e.g. in the PGTEI documentation:
In the documentation, why is "Versprich mir, Heinrich" repeated in the output, the second time in white?
In the "Faust" example, I would prefer much less whitespace after the speaker labels (though I did not check the text original, much less look for an original scan).
With the caveat that I'm not a CSS expert, I believe that the following suggestions follow the spirit of using HTML for structural markup and CSS for format:
The lack of space between paragraphs goes against Web conventions. (It's fine as an option but a poor choice for the default.)
The filename ("de-gnutenberg-press-1.0-persistent.css") is too long for Macintosh OS 9 and below (38 chars vs. the max of 31). Possibly relevant: did Windows jump from 8.3 filenames directly to "long" filenames, or was there an intermediate limit?
It would be useful to have at least a few user-selectable parameters, e.g. page size.
For the default page size, a width of 5.5" rather than 5.83" would accommodate US Letter as well as A4 paper. (Incidentally, using quote marks for inches is a heuristic that should be checked when converting text to XML.)
The default font size seems overly large for printing, though I didn't compare it to a representative sample of books.
The spacing between sections of the PG footer seems excessive.
It would be useful to generate a form that can be easily compared to the original PG text, even if that's an option rather than the default. For example: don't rewrap lines!
The format didn't appear to match PG standards (e.g. "underlining" the chapter titles using various characters) -- though I know very little about PG standards so I may well be wrong.
PGText to PGTEI
Again, kudos for including this. Semi-automated conversion is an important part of migrating legacy PG texts to XML.
Are the heuristics from things like GutenMark included? That would seem to be quite valuable.
Please do not rewrap lines! That makes it very difficult to compare different versions, editions, formats, markup techniques, etc.
Try to identify the table of contents. If deleting it is too drastic, just add a comment such as "If this is the table of contents, delete it. It will be generated automatically by divGen type=toc." Why? Because I suspect (hope?) that many people who review the generated TEI will be beginners.
If an Introduction is properly part of front matter, try to identify it and place the
In addition to processing the source text, include information from the database, e.g. LoC Class, Subject, Alternate Title.
If langUsage is intended to describe the language in the current document, perhaps it's worth adding a comment such as "delete any languages not present in this etext".
Spaces before closing p and head tags are preserved. Is that intentional?
When converting The Wonderful Wizard of Oz, several dozen instances of the following markup were incorrectly added (in every case, the text didn't appear significantly different than surrounding paragraphs):
This line was incorrectly marked as a
Posted Oct. 28, 2004
Classicosm is a Product Architect site.