From the time EEBO was first released in 1998, users and librarians have been concerned that the inconsistent spellings that occur in early modern English texts would cause users to miss many texts relevant to their research and thus limit their ability to use such resources to their full potential.
Building on research being conducted by Professor Martin Mueller at Northwestern University, the mission of the VosPos (Virtual Orthographic Standardisation and Part of Speech) Project is to develop a tool that allows both expert and non-expert users to search databases such as EEBO using modern English spellings and automatically retrieve instances of extant early modern spelling variants.
The Variant Searching functionality is the result of this ongoing research project to provide orthographic standardisation to a large archive of texts from ProQuest and the Text Creation Partnership (TCP), including EEBO. The CIC CLI Virtual Modernisation Project is an initiative of the Center for Library Initiatives (CLI) of the Committee on Institutional Cooperation (CIC). It is being supported by ProQuest and member institutions of the CIC.
A 'standardised' spelling is typically but not always, a 'modern' word form. Thus louynge and loues maps to loving and loves respectively, but loueth maps to loveth, the standard spelling in which this archaic form appears in, say, the King James Bible.
Another key part of the VosPos project is the creation of lemmatisation data, which takes the process of standardisation one step further. Lemmatisation is the linguist's term for the practice of bundling the different forms of a word under the form in which the word is likely to appear in a dictionary. Thus loves, loved, and loving are forms of the lemma love. Lemmatisation allows users to look for all variant spellings of the standard spelling love or search for the lemma love (retrieving all variant spellings of the standard spellings love, loves, loveth, loving, and loved).
Virtual orthographic standardisation is available to all users for EEBO and Literature Online.
The Variant searching is active by default if the Linguistics are active on the ProQuest platform – which can be deactivate by the library in the ProQuest Administration Module or by the single My Research users from the Preferences. Alternatively, a user can stop variant searching by putting quotes around the search term, adding truncation* or the wildcard symbol to a word.
To see all the variants that were actually searched on at the top of the results page, use Advanced Search and select the option under ‘Results page options’. See example below.
If you type a search term in the search box, you will automatically retrieve all instances of your search term and its early modern variant inflected forms and spellings in EEBO. For example, if you type the word murder, when you submit your search you will retrieve all occurrences of the word murder together with its inflected forms murdered, murdering, murders and its early modern variant spellings murther, murdre, murdir and mvrder.
Plus instances of early modern spelling variants of all the various inflected forms of your original search term, for instance murthred, murthrest, murdreth, murdring, murtherynge and murthers.
Please note: When typing a search expression that includes Truncation and wildcard operators (e.g. je?lo?s*), the Variants searching won’t apply
This process of expanding a search to include inflected forms of your original term is known as lemmatization.
Early modern typographical conventions mean that in pre-1700 texts certain characters are often used interchangeably. For instance, the characters j and i are often exchanged, with the word juniper occasionally appearing as iuniper, and the word Ireland as Jreland. Similarly, u often appears as a v, and vice versa, such that the word love often appears as loue, whilst usurper sometimes appears as vsurper. The letter w is occasionally represented by both vv and uu, with worth appearing as both vvorth and uuorth.
In ProQuest you will automatically retrieve instances of your search term(s) in which any of these simple substitutions (i for j and vice versa, u for v and vice versa, and uu and vv for w) have taken place. Thus a search for the term woman will retrieve forms of this word featuring variant typography such as vvoman and uuoman (along with other old spellings of woman such as womanne and vvoeman).
Note that it is possible that some purely typographic variants of your search terms will not be listed at the top of the Results list, though these variant forms are present in EEBO. This is because the word lists that appear on this screen only include early modern spelling and typographic variants that are present in the corpus of 13,000 keyed texts produced by the Text Creation Partnership; other typographic variants that are unique to the 146,000 bibliographic records in EEBO (i.e. typographic variants that are not present in the Text Creation Partnership collection) will not be displayed. However, the search will automatically retrieve instances of your search term(s) in which any of the typographic substitutions described above have taken place, regardless of whether these variants appear at the top of the Results list or not.
Work on the project began in the summer of 2005 with a group of Northwestern undergraduates and graduate students working under the direction of Professor Mueller. Work has now moved into a more formal phase and is being carried on as a collaborative project between Professor Mueller and staff of the Academic Technologies group at Northwestern University.
The project has also extended its scope to include part-of-speech tagging. Part-of-speech tagging is necessary to resolve ambiguities (bee, doe, etc.), but its benefits extend far beyond this practical application.
When completed, the project will offer virtual orthographic standardisation and part-of-speech tagging for approximately a billion words of written English from the late fifteenth through the nineteenth century, including the Text Creation Partnership's Early English Books Online (TCP) and the ProQuest full-text collections of English Poetry, English Drama (including the Folio text of Shakespeare), Early English Prose Fiction, the King James Bible, Eighteenth-Century Fiction, Nineteenth-Century Fiction, and Literary Theory.
There are roughly three million distinct spellings in this collection of texts, including approximately 500,000 foreign words (mostly Latin and French) and approximately 250,000 names. It is estimated that 750,000 spellings account for at least 99% of all word occurrences. The current version of the functionality available to EEBO users focuses on mapping the spelling of English words to their standard forms. No effort has been made yet to map the spellings of names to standard forms, which has problems of its own.
Reports on work-in-progress are available as PDF files from http://panini.northwestern.edu/mmueller/vospos.pdf.
The following institutions are members of the Virtual Modernisation Project, which has supported the development of the functionality now available to all users of EEBO:
We are grateful for the efforts of the following individuals who have worked on the Virtual Modernisation Project and who have made possible the resulting enhancements to EEBO:
Martin Mueller, Professor of English & Comparative Literature, Northwestern University
Jeffrey Garrett, Assistant University Librarian for Collection Management, Northwestern University
Phil Burns, Academic Technologies, Northwestern University
Jeff Cousens, Academic Technologies, Northwestern University
John Norstad, Academic Technologies, Northwestern University