EarlyPrint Library

The EarlyPrint Library builds on the transcriptions of the Text Creation Partnership (TCP), now released into the public domain. The TCP transcriptions are themselves based on image collections available to subscribers to Early English Books Online (EEBO) and EighteenthCentury Collections Online (ECCO), and the Evans-TCP transcriptions of early American imprints. EEBO contains scans of about 130,000 items, about 55,000 of which were transcribed by the TCP. The selection emphasized first editions and—barring special reasons—avoided duplication. "Book" and "duplicate" are very slippery terms, but grosso modo it can be said that the TCP archive comes close to having one copy of every ‘book’ from the first 230 years of English print in a digital and reasonably interoperable format. The ECCO and Evans transcriptions are more selective but provide significant representation of eighteenth-century British and early American (1640-1800) imprints, respectively.

How does a "complex digital surrogate for almost every book before 1700" differ from what already exists? First, it can provide new digital images that will be free and of much better quality than the EEBO images, many of which are no longer up to current standards. Second, there will be a way of fixing the many corrupt or missing words that mar a quarter of the pages in the current transcriptions. Third, more powerful metadata "under the hood" (notably word-level linguistic tagging) of the visible text make the texts more computationally tractable, whether individually or in the aggregate.

A few years ago at a Princeton conference on Research Data Life Cycle Management (RDLCM), Brian Athey, the chair of Computational Medicine at the University of Michigan, observed that "agile data integration is an engine that drives discovery." That is a very pithy phrase. Early modern books are certainly 'research data'. They have a 'life cycle' that has entered a phase in which the affordances of digital media create new challenges. For the managemenet of such life cycles, Joshua Sosin, a papyrologist at Duke who played a major role in the Integrating Digital Papyrology project, has argued forcefully for "investing greater data control in the user community."

Clay Shirky has written eloquently about "cognitive surplus" in his book with that title. In a digital world a lot of people have lots of hours that they can spend in different ways. Most of the 60,000 texts will benefit from some attention of a housekeeping kind, and much of that attention can be given to services that call on patience and attention to detail rather than highly specialized professional competence and may be done in minutes or hours rather days and weeks at a time.

The Renaissance Society of America and the Shakespeare Association of America have between them several thousand members, and their students are probably counted in the tens of thousands. Together they have a lot of cognitive surplus. If a little of it is spent every year on improving the EarlyPrint corpus, the cumulative effect of five years’ work will be considerable.

The most distinctive feature of this site is the annotation capability that makes it very easy for you to donate a little of your cognitive surplus and spend it on textual work that improves the documentary infrastructure of Early Modern Studies. The core of this feature is a data entry template for correcting the most common forms of textual corruption. Anybody anytime and anywhere can contribute to making the transcriptions more accurate or complete than they are now. Such corrections are subject to editorial review. Once approved, they are incorporated into the source file, with due credit given to the contributors. For more detail consult the section on how to correct transcription errors.

Working with a text in this environment will be almost as easy as "reading with a pencil", but the results are more easily shareable. If you do textual work on anything in the TCP corpus, we hope that you will find this site the easiest place to do that work. That is a "win-win" scenario: what is easiest for you is readily shareable with others. If a text you want to work on is part of the TCP corpus or meets its structural conditions, it will be easy to add it. What can be done with Early Modern texts, can also be done with other corpora, notably the smaller TCP-Evans corpus of American texts before 1800. Step by step we may move towards a "Book of English", defined as

A large, growing, collaboratively curated and public domain corpus
Of printed English from its earliest modern form
With full bibliographical detail
And light but consistent structural and linguistic annotation.

What next?

There are probably about 2,000 public domain image sets that can be mapped to TCP texts, would be good enough for most scholarly purposes, and would in most cases be significantly better than their corresponding EEBO images. Over the next few months we will try to create digital combos of many of them. Unfortunately there is always the need for a pair of human hands and eyes to adjust the alignment of text and image, and we will look for routines to ease and share this burden. A growing set of digital combos will add up to a motley collection, but we hope that it will persuade users of the attractiveness of old books in that new format. This is a good moment to draw attention to the eloquent essay Together we can FrEEBO that John Overholt, the curator of Early Modern books at Harvard's Houghton Library, wrote a few years ago.

A group of undergraduate interns from Carleton, Grinnell, Knox, Lake Forest, Northwestern, and Ohio Wesleyan experimented recently with TCP compatible transcriptions of English Civil War pamphlets from

the Rare Book libraries of Brandeis and Queens University in Canada, and we have incorporated their corrections into the site. It is clear that students learn a lot from such an exercise and take some pleasure in creating something that others find useful.

What is New About the EarlyPrint Library?

What next?

Action