classicosm logo

PG Data from The Online Books Page

John Mark Ockerbloom maintains link_outThe Online Books Page and has kindly made his enhanced PG data (gzip format, 308K) available under the Creative Commons link_outAttribution, NonCommercial license. The following are my notes on the file; any errors and omissions are mine. The format is simple: a field name followed by one or more spaces then the data, ending with a newline (linefeed). Most fields can be repeated; details below. Each record is separated with an extra newline.

File Format

The data file contains roughly 8231 PG records with complete information (plus a few additional entries with decimal IDs), and another 1000 or so records that only contain the FMT field (and ID).

   NUMBERProject Gutenberg ID (plus a few non-PG with decimal ID)9260
TITLEincludes embedded subtitle, volume, issue, language, etc.8223
AUTHORformat: Last, First, optional prefix/infix/suffix, optional birth/death dates -- may include HTML entities for ISO chars7708
CONTRIBUTORmay cover many authors in a collection (e.g. #1980) but NOT co-authors who work together110
EREFExternal reference; format: TEXT_for_link URL2
FMTfile download information2625
GREFGutenberg folder reference; "NEW" is the current numeric system; older is something like "etext98/ozvrs10"9003
LCCNLibrary of Congress Control Number2
NOTEonly one occurance, the info is not shown at
PREFcomponents; format: GutenbergID TEXT-for-link (typically the title or a descriptive subset that's meaningful in this context)256
SERIALformat: IssueNumber TITLE; facilitates linking multiple issues of magazines and such172
SREFSee also reference; format: GutenbergID TEXT-for-link170
#comment character1

The fields are approximately in the following order, though with some variation among records: GREF, EREF, SERIAL, SREF, AUTHOR, ILLUSTRATOR, TRANSLATOR, EDITOR, CONTRIBUTOR, TITLE, PREF, LCCN, NOTE, FMT, NUMBER. (See my suggestion on this below.)

The following fields have multiple values in the indicated number of records:


* Note that the counts may not be exactly right with the latest file, though I still find them helpful in understanding the data.

My Suggestions

  1. Make sure the data fields are in the same order for every record. That will make it easier to DIFF against data exported from another source.
  2. Add a distinct SUBTITLE field, e.g. as done in PG's GUTINDEX and in Classicosm's PG metadata.
  3. Add a NOTE field.


Updated Oct. 28, 2004

Classicosm is a Product Architect site.
classicosm -at- product architect -dot- com (Feedback welcome!)
Copyright 2004 by Scott S. Lawton. All Rights Reserved. "Classicosm" and "A world of timeless value" are service marks owned by Scott S. Lawton.



Web Classicosm

Web Classicosm