Metadata for Classic Books (and beyond)
A preliminary style guide.
Project Gutenberg ("PG" for short) contains over 13,000 items -- with more added every day. A new project is shooting for 1 million books! Accurate and consistent metadata such as title and author can help people find what they're looking for and discover new items of interest. For example, if someone likes one book by a certain author, it's nice to be able to see all books by that author -- which is difficult if the author's name is not consistent across books. Consistent metadata ("data about data") also simplifies the creation of automated software tools that can deliver a superior user experience. (That's one of the goals of Classicosm.)
Note: I speak only for myself, not for Project Gutenberg or Online Books! Although I've known about PG for years, I have no formal or informal association with them. This page and my revised PG metadata are my first contributions -- and may well contain errors or omissions that would not be made by someone with deeper involvement. I will gladly correct any mistakes that are brought to my attention.
There are many challenges in creating and maintaining metadata that is consistent and accurate. Information about books can be messy, e.g. the title may vary between the cover and the title page, or across different editions, different volumes in a series, the same work from different publishers or even the same publisher over time. Author names may occur with initials or spelled out, with or without formal titles, or under various pseudonyms -- which may actually represent different authors over time. In the face of this, PG depends largely on volunteers; few are likely to be trained in cataloging. (I'm certainly not.)
This page is an attempt to itemize some of the key issues, and suggest tentative standards. Although I have lots of experience working with data, I have no particular background in bibliographic information. If there are other pages that provide a better summary, please let me know! MARC may be the definitive source, but that's far too complex for my purpose, and probably for PG.
The Online Books Page
John Mark Ockerbloom maintains The Online Books Page, including an enhanced index into part of the Project Gutenberg collection (approximately 8,200 items). For example, each "parent" eText links to every "child" that appears in PG as its own eText. He has also validated many (most? all?) authors and titles against "authority" data such as this Library of Congress search page. He has graciously agreed to make his underlying data available. See my OB page for a download link, documentation, and a few suggestions.
Classicosm's PG Metadata
I have incorporated most of the "Online Books" (OB) changes, and made thousands of additional edits to the remaining records. I also found several typos and other minor errors in the OB data (details coming soon). Also, the Classicosm data has a subitle field; the OB data does not. Any remaining differences fall into one of three categories: the OB data is better, the Classicosm data is better, or it's a judgement call with good arguments in favor of each. (This "style guide" attempts to clarify some of the judgement calls.) Alas, determining which explanation is most likely for every record requires a bit of work. To the extent that the OB data has been checked against an external authority, it's more accurate -- even if less consistent with other PG records. However, those cases are not specifically identified.
I've reviewed the non-author fields for every PG record through #13159. In order to start a discussion about appropriate "standards" or "style guide" for various fields, I've provided subset files to simplify comparison of my title and subtitle fields against the GUTINDEX data from 2004 (#10626 to #13159), and separately against the OB data.
Note that neither pgdb.txt nor catalog.rdf are appropriate for my use at present since both are covered by the restrictive GPL. (However, others are free to compare the following against either source.) The Classicosm data below is licensed under the Creative Commons Attribution license, with the exception that either PG or OB may use the data without attribution.
Here's the first pair of files:
I'll post additional files later for simplifying the comparison of Classicosm titles to Online Books titles.
The basic file format: ID, 2 spaces, "title" (which may include other information). For the PG data, the subtitle (if any) is indented on the second line. Note that the PG and OB title fields are different. In each case, I generated a title that attempts to match the information and (to some extent) the format for PG or OB, e.g. the latter "title" includes both the subtitle and the language.
A three-way comparison of the GUTINDEX, OB and Classicosm data may be useful at some point, though it would be easier after various formatting issues are resolved.
Here are some of the issues I encountered when trying to create consistent metadata. Comments, constructive disagreement, additions, links, etc. are welcome. (My email address is at the bottom of the page.)
Atomic vs. Aggregate
The TITLE in both the PG and OB metadata is actually an aggregate field that contains many "atoms" of information, each of which may be useful on its own. For example: "The World's Greatest Books, Vol XII: Modern History" includes the series title, the volume number, and the actual title. The AUTHOR fields on the PG Website and in the OB data are also aggregates, containing last name, first name, etc. plus year of birth and death.
In my experience, aggregate fields are very useful for presenting data, though it's very difficult to maintain data quality when the "master copy" of the metadata includes aggregates. I suggest that PG store the data in atomic fields, and generate one or more formats for presentation as needed. As an alternative, any aggregate field in the master copy should be run through a standard parser to extract and validate the components.
There are at least 3 different types of titles that it may be useful to track.
In most cases, I standardized the records to use the series title if it exists, followed by volume identification (if any) and then the uniform title (if I knew it) or the actual title. Open question: is that a good approach in general? If so, any disagreements with particular choices I made (e.g. putting the series title in the subtitle, usually labelled with "Series:")? Or, should PG explicitly track all 3 titles (where applicable)? I prefer that approach.
OB noteI believe that the OB data uses the ACTUAL TITLE rather than the UNIFORM TITLE -- keep that in mind when comparing my data with theirs. And, if only one of the two is kept, that's certainly a reasonable choice.
Title, The Title, A Title ... or many titles
In what circumstances should a leading "a" or "the" be dropped and/or moved to the end? I see some benefit to sort titles without the leading article, but I'm skeptical of leaving it off if it's in the actual title, and I think putting it at the end looks awkward at best. (e.g. "Mayflower and Her Log, The")
How should multiple titles for a single eText be represented (e.g. "The Lamentable Tragey of Locrine; and Mucedorus")?
Title and Subtitle
What rules should be used for Title Case -- in English and otherwise? When should a period be included at the end?
I made some attempt to standardize and simplify the titles of eTexts from PG itself, e.g. "quotes and images". OB did too, though often with different choices. I hope these suggestions are viewed as constructive!
Each of the above titles could have its own subtitle. Since only one title is currently kept (other than in a notes field), the TITLE field often holds the series title, while the actual title is often moved to the SUBTITLE field. That's a reasonable choice -- though with the substantial drawback that any search or display of the title field alone will omit the actual title in favor of the volume title (and presumably the volume number). Excluding periodicals, about 15% of the PG entries are part of a series. It also forces any "series subtitle" out, though I found less than 2 dozen cases in PG's current dataset. As noted above, I included both the series title and the actual title in the TITLE field. That avoids the above drawbacks at the cost of having long titles.
Should an alternate title that's introduced by "or" be included in the TITLE or SUBTITLE? (e.g. "Boy Scouts in the Coal Caverns: or, The Light in Tunnel Six") Should punctuation be consistent with the text or across volumes?
Volume, Book, Part, Issue, Number, Chapter, Letter
Given multiple items in a series, what names should be used to describe the various components? What if the components don't include a specific term, including when the components are "artificial" subsets created by PG? For example:
Over 2,000 items have a volume label of some sort; perhaps as many as several hundred are inconsistent -- one area where I did not try to reconcile differences between the OB and PG datasets. Some labels don't even follow the simple "label + digit" format, e.g. Emily Dickinson's "Poems, Series Two" and "Poems, Third Series" (both are drawn from the text). And, don't forget other languages, e.g. "Les Quarante-Cinq, Premiere Partie" or "Faust, Der Tragoedie erster Teil".
When one eText includes several others, should that one have a label, e.g. "(Complete)"? If the individual eTexts are not listed in their title as part of the series, perhaps "(Entire Collection)" is more accurate?
PG often includes the total number of volumes, e.g. "Vol. 1 of 2". I omitted these. Thoughts?
Abbreviations and punctuation for books:
Abbreviations and punctuation for periodicals:
When in Rome
In what circumstances should roman numerals be preserved? How important is consistency across related works -- especially when the printed works are not available and the original transcribers may not have reproduced the original? Or, when the printed works are inconsistent? My personal preference would be to translate all roman numerals, though I usually left them as is for now.
Date or Date Range
Many titles include one or more dates or date ranges. These may relate to the series as a whole (e.g. "Mayflower and Her Log, The; July 15, 1620-May 6, 1621 — Volume 2"), or may differ for each volume in a series (e.g. "Mark Twain's Letters, Vol. 1: 1835-1866" vs. "Vol. 2: 1867-1875". In the latter case, they may simply describe the current volume (a subtitle of sorts); in other cases, they may be appended almost as an afterthought (e.g. "Letters of Travel (1892-1913)"). Formatting question: are parens appropriate with dates in all, some, or no cases?
Edition and/or Date
When is a date an edition, or vice versa? For example, OB lists #1434 as "Essays (1914 edition)" where the PG website just shows "Essays". The eText contains: "This etext was prepared from the 1914 Burns & Oates edition". Most (?) eTexts include a publication year, and many include the publisher; when should this information be captured as an "edition"? What if there's also an explicit edition label, e.g. OB lists #1302 as "The Enemies of Books (1888 edition)", though the eText contains "SECOND EDITION ... 1888".
How important is this information? Where should it be stored and/or presented? Does formatting matter, e.g. Second or 2nd or even 2d, with the usual tradeoff of consistent with original vs. across related works?
OB includes this information in the title, e.g. "(in German)"; the PG website shows a distinct field. Minor issues: should multiple languages be joined with comma (my preference) or "and"? What order should be used, e.g. should an English-German and German-English dictionary have the same or different order for the two languages? (I prefer the same order; I think that's more useful when sorting.)
... and more ...
There are lots of other issues, especially regarding author, editor, contributor and such. I haven't finished comparing OB's author data to mine, so I'll defer these for now. In any case, the above provides plenty of material for discussion.
Posted Sept. 28, 2004
Classicosm is a Product Architect site.