I am very pleased to announce a new version of the ARTFL electronic edition of Diderot and d'Almebert's Enyclopédie, one in which a great number of corrections and editorial additions have been incoporated. The following introduction addresses some of the editorial concerns we have faced in establishing this edition, as well as the numerous improvements we have enacted over the past 15 years. Undertaking an electronic edition of the Encyclopédie represented a daunting task. Its structure is very complex; the typographical conventions used for textual elements - from article headwords to classifications and cross-references - varied to a significant degree from volume to volume; the relationship between articles and the plate images is in no way clear or systematic. All this notwithstanding, the computer offered a host of new possibilities both for making the work accessible to the scholarly community and for navigating within the work itself. In addition, the digital medium allowed us to think in terms of a "living edition" that could be corrected, developed and improved over time. Our initial choice was to make the work accessible as quickly as possible and progressively to correct it. In order to compensate for the errors introduced during the original data capture process, we chose to make page images of the volumes available for comparison and verification. As we undertook to correct the text, we also strove to improve the search and retrieval capacities. All too often our users limit themselves to simple word and phrase searches, yet these do not always yield the most fruitful results. Using our new search and reporting features can significantly improve the user's ability to move through what Diderot himself described as the "tortuous labyrinth" that is the Encyclopédie. Looking at frequency of occurrence by article or collocation tables, for example, can provide more useful paths into the Encyclopédie than simple word searches alone.
While we have steadily made improvements over the years, this new version marks an important stage in the unfolding development of the electronic edition of Diderot and d'Alembert's monumental work. The author attributions have been verified and corrected; new searching and navigations functions have been introduced; new research and archival materials have been made available. This version includes not only the four volume Supplement to the Encyclopédie, but also the proofs of censored articles and legal documents bound together in the so-called "18th volume." For the first time, our user community will be able to participate in the correction and improvement of the edition by using our "report error" link to inform us of errors they encounter. All these factors have contributed to our decision to make most elements of this site available not just to the scholarly community of ARTFL subscribers, but to the public at large. In the following paragraphs, I will briefly describe the evolution of ARTFL's digital Encyclopédie.
In the Beginning: Choosing an Edition
From the outset of the Encyclopédie project there were several important editorial decisions that greatly affected the initial construction and dissemination of the database. First, there was the choice of the edition. There were many editions of the Encyclopédie in various formats. We chose the first printing of the Paris edition - see our comparison of Encyclopédie editions.Richard Schwab then kindly agreed to expertise the microfiche version produced by IDC (Leiden, The Netherlands) and confirmed that it reproduces a good copy of the first edition - it was from these microfiches that our contractor performed the data entry of the Encyclopédie. We were aware that many typographical errors had been introduced into the text during the data capture procedure. Unfortunately, due to the size of the Encyclopédie and its great semantic diversity, it was impossible to correct these errors by any normal spell-checking procedure. Additionally, given that fact that all identifications of textual elements - articles, authors, cross-references, etc.- were made using automated procedures based on typographical patterns, we were aware that many problems - unidentified or misidentified articles, missing author attributions, incomplete information about grammatical and knowledge categories, malfunctioning cross-references, etc. - would need to be addressed through a large-scale corrections project. These reservations aside, we thought it best to release a largely uncorrected version of the database and to work on progressively integrating both text and metadata corrections as they were made. For more on our corrections effort, see the Encyclopédie Corrections page.
One of the most complex problems we encountered in establishing this edition was in properly attributing authors to their respective articles. In the beginning, we simply tried to identify authors automatically using the authorial marks that occur in the text - e.g., (*) for Diderot, (S) for Rousseau, (O) for d'Alembert, etc. - an approach which, while mostly successful, still left many articles unattributed. Articles with multiple authors, unsigned articles, and articles by authors with no authorial mark all posed significant problems for our automatic recognizers. To address these issues we consulted the Schwab Inventory 1) to identify unsigned articles whose authorship was attributed by Schwab and 2) to correct any missed authorship information that was not included in our metadata (see below). The more than 1,500 author attributions to unsigned articles that resulted from this process are indicated by the number "5" after an author's name, e.g., Holbach5, Saint-Lambert5, Voltaire5, etc. For Diderot's articles, we have followed the Hermann edition of Diderot's complete works (Lough and Proust Eds.) in establishing the "Diderot" "Diderot2" and "Diderot3" designations. We have also verified d'Alembert's articles using outside expertise (M. Groult) - for more, see our Author Attributions page.
The new version of the Encyclopédie database (Revision 3.5, 5/2013) contains more than 650,000 modifications made to the original 1998 source files, these corrections were made using a variety of approaches, both automatic and by hand. Over the past few years we have also worked to improve and correct the Encyclopédie metadata - Article Headwords, Author Attributions, Classes of Knowledge, etc. We are aware, however, that many small textual errors - artifacts of the original data capture project - still remain in the database. In an effort to track these errors down users can now use the "Report Error" link at the top right hand corner of the results page to report errors directly to the ARTFL Project. These errors will be collected and applied on a quarterly basis.
Text Corrections - We have corrected errors in the text using a two-step process: First, we completed an automatic recognition/correction process that fixed most of the high-frequency errors, many of which were of the result of the long-s character in 18th-century typography, which was frequently confused with "f" (e.g., semme for femme, etc.). Other commonly misrecognized characters included "er" for "cr" (deseription for description), "e" for "c" (done for donc), and "c" for "e" (cst for est). Then, using our own spell-checking mechanism, we identified possible remaining errors in the text which we then compared to the Encyclopédie page images and hand-corrected. From 1999 to 2006 this process yielded over 450,000 corrections to the database. See our Text Corrections page.
Metadata Corrections - Over the past 2 years we have systematically checked and corrected the Encyclopédie metadata - Article Titles, Classes of Knowledge, Authorship, etc. - verifying our original metadata against Richard Schwab's Inventory of Diderot's Encyclopédie. Any discrepancies in article title, author attributions, class of knowledge, etc. were then checked against the page images of the Encyclopédie and corrected or added where appropriate. To date more than 8,000 additions and countless corrections have been a made - for more, see our Metadata Corrections page.
Recent Corrections - Over the past several years we have integrated some 5,000 new text corrections that were a direct result of the "Report Error" system. Many times a single error, for example esset recognized for effet can yield hundreds of corrections when applied to the entire database. Thus, we strongly urge our users to continue using the "Report Error" button to signal not only typos or small textual errors, but also structural errors (missing articles, misrecognized headwords, etc), errors in the metadata (author name, headword, class of knowledge, etc.), page links, or any other error for that matter.
In November of 2009 we began the process of converting the text of the Encyclopédie into standard Unicode (UTF-8) using a light TEI-XML encoding scheme. This move is significant in two ways: First, we can coherently represent and associate an article’s metadata (author, classifications, part of speech, etc.) with the article itself, i.e., in a TEI-XML header for each article entry, rather than storing them in external databases as we have done in the past. This will additionally allow us to manipulate the metadata in the future, adding machine classifications, similar article lists, a notes section, or any other relevant information on an article-specific basis. Secondly, the move to the Unicode standard has finally made correction of the Greek passages in the Encyclopédie possible - see our Greek Corrections page.
We have also corrected many, if not all, of the structural problems that have long been an issue with the database - missing articles or mis-recognized headwords, badly formatted front matter (e.g., the "Avertissements,"), etc. - resulting in more than 300 new articles and sub-articles. We have also merged each of the various data "chunks" that made up the volumes (essentially seven 1MB sections for each volume) into 28 discrete TEI files, thus obviating the long-standing "overlap" problem that prohibited moving from one page to the next if occurring between two volume parts.
Further Corrections - We are aware that textual errors still exist and invite users to submit any error they encounter using the "Report Error" link at the top of result pages. Moving forward, we will begin to think about the very complex issue of establishing links from plate references in the text to the appropriate plate volumes. In addition, there is the issue of mathematical formulae and various tables. While the text in these tables is searchable, the best way to visualize graphically these elements is by consulting the page images. For the moment we see no coherent way to represent the mathematical formulae in the digitized text. While this may change as technology evolves, presently these formulae are represented only on the page images.
Cross References (Renvois) - The system of cross-references in the Encyclopédie represented one of the thorniest issues we encountered while establishing this digital edition. From the very outset, it was clear that the renvois were in no way systematically distributed in the original text - i.e., authors would often include a cross-reference to an article that had yet to be written (and perhaps never would be written), resulting in many renvois that lead to non-existent articles or articles with different headwords. We attempted to identify the renvois automatically using typographic conventions ("Voy. ART" at the end of an article for example), leading to some misrecognized links. Recently, we have corrected some 1,200 of these misrecognized renvois, the result of author names in small caps that occurred at the ends of articles, leading our recognizers to treat them as cross-references, e.g., "Cet article est de M. WATELET". We strongly encourage our users to submit any errors they encounter concerning the cross-references (e.g., misrecognized, misspelled, or broken renvois) using the "Report Error" link at the top of the search results page.
The Encyclopédie database uses a modified version of the ARTFL Project's full-text search and retrieval engine, PhiloLogic. With this new version comes several new search and reporting features such as collocation tables, frequency by headword reports, and a sortable keyword in context (KWIC) function.
While word and phrase searching still remain the backbone of the PhiloLogic interface, making use of these new reporting features can offer alternate ways in which to navigate the sometimes overwhelming number of word/phrase occurrences that are returned. These reports are especially important to students working on the Encyclopédie and can provide them with more varied paths into this highly complex work.
The frequency by article report indicates the number of occurrences by article title in descending order of frequency with a link to the article and a link to the occurrences found within that article. For example, if you search for "Newton" you will notice that 45 of the 783 occurrences of "Newton" occur in the article "Wolstrope" - this may seem inconsequential until one realizes all of the biographical information about Newton is found in this article about his home town, a fact which may have eluded some users looking for an article about Newton with a different title.
Additionally, the context and relational aspect of search terms can be examined globally using the collocation table and keyword in context (KWIC) reports. Collocation tables provide users with a simple way of seeing the words with which the search terms most often co-occur, and the sortable KWIC reports allow users to sort their line-by-line results alphabetically, either to the right or left of the highlighted keyword - both reports can help users move away from examining single word occurrences and towards a broader understanding of term usage over the entire Encyclopédie.
In support of the Digital Encyclopédie, the ARTFL Project has begun to build an archive of eighteenth-century documents relating to the production and reception of the work, as well as several chronologies and publication histories. These include several of Diderot's letters from his internment at Vincennes, documents pertaining to the controversial publication history of the Encyclopédie, and a high-resolution version of the Encyclopedic "Arbre généalogique.", "Système Figuré," and "Frontispiece." In bringing these documents together in one central location, we hope to provide our users with convenient access to extensive information that will enrich their research within the work itself. We are constantly looking for new resources to enhance our site and we invite scholars to Contact Us with ideas and materials they would like to contribute.
The "18th" Volume: A New Resource
Working with the University of Virginia's Small Special Collections Library, we are pleased to offer, for the first time, online access to Douglas Gordon's famous "18th Volume" of the Encyclopédie. This extra volume, which includes some of the earliest title pages and prefatory material of the Encyclopédie project, also includes some 284 pages of corrected article proofs, comprising 46 articles submitted by Diderot which were presumably censored or altered by the publisher Le Breton before the final printing. The existence of these proofs, along with the collected legal documents pertaining to Luneau de Boisjermain's lawsuit against the Encyclopédie's publishers, has led many to believe that this volume may have belonged to Le Breton's personal collection. We have included both the transcribed text (with indications of what was censored, added, etc.) of the censored articles as well as image links to the page proofs; from the page image interface users can also browse the entire volume. The extent of the censorship varies greatly among the 46 articles, from excised words and phrases to whole paragraphs (see "SARRASINS") and even entire articles, such as Jaucourt's "TOLERANCE." See the 18th Volume page.
Current Research and Development
The Digital Encyclopédie has been at the forefront of ARTFL's current research into data mining and machine learning techniques, serving as a test-bed from which to experiment with new techniques designed to explore large-scale digital collections. These approaches can help us better understand the rich classification scheme of the Encyclopédie as well as the dialogic construction of its content, connected to articles and outside sources through a complex system of cross-references and intertextual relations. Over the past few years, for example, we have used a variety of text similarity measures (such as the Vector Space Model and K-Nearest Neighbor) to detect the presence of "borrowed" articles from two of the Encyclopédie's Jesuit predecessors - the Dictionnaire de Trévoux and Louis Moréri's Grand dictionnaire historique.
Most recently, using a combination of bayesian and k-NN (k-Nearest Neighbor) classifiers -- much like the ones at work in your everyday spam filter -- we have leveraged the classification scheme of the Encyclopédie (which in today's computer science and information retrieval terminology would be called its "ontology") to predict the classification of the 13,000 (15,000 with plate legends) articles that the editors left with no class of knowledge. This same process was enacted to then "reclassify" the 61,000 remaining articles, those with classes of knowledge originally assigned by the Encyclopédie’s editors but which, for our purposes, were hidden from our classifiers. The resulting machine-generated classes for all 74,000 articles have been added to the metadata of each article for search and display purposes. A little over 73% of the classified articles came back with their original classes, an astounding feat considering the size and complexity of the Encyclopédie's ontology. Thus, the remaining 27% of articles have been assigned "new" classes that may, or may not, represent the content of their articles better. They will most certainly, we hope, generate a fair amount of debate and dialogue amongst our users. To that end, we are exploring ways in which users could comment or evaluate the machine-generated labels as well as the "Similar Article" lists outlined below.
Using the same Vector Space and k-NN similarity approach from above, we have identified the 50 most "similar" articles for nearly 40,000 of the Encyclopédie’s entries (those with 60 or more words). Users are thus able to consult a select number of articles related (via the k-NN calculations) to the article they are reading, as well as a list of shared features (word stems) between any two "similar" articles. This will perhaps allow users to discover related themes, authors, articles, etc. independently of word or metadata searching; in effect "navigating" through the Encyclopédie via similar articles rather than the traditional Google-style point-and-click method of searching. With this same notion of navigation versus searching we are also experimenting with ways of representing the system of cross-references (the renvois mentioned above) independent of the text in which they occur. This level of abstraction can offer a new perspective on how articles are related via the network of links that connect them to each other, a network that according to Diderot was the most "philosophic" of the editors' organizational schemes for the Encyclopédie.
Our newest experiment uses sequence alignment algorithms borrowed from bio-informatics in an effort to find discrete text sequences, from several words to entire articles, that occur in the Encyclopédie and earlier works such as Montesquieu's De l'esprit des lois. It is our hope that by expanding these techniques we can come to a better understanding of the intertextual nature of the Encyclopédie, gauging not only to what extent its authors used previous sources, but also how the philosophes were themselves received and appropriated in the decades following the Encyclopédie's publishing. For more on this and other ongoing research see the ARTFL-PhiloMine bibliography and the ARTFL Research Blog.
Collaborations have been an important part of the Encyclopédie Project's development, and we continue to welcome any opportunity for further collaborative enterprises in the future. Our most successful collaborations have all contributed to the various elements outlined above - bringing us new resources (University of Virginia and the"18th Volume"); translations and classifications (University of Michigan); contributions to our research and archival material, corrections and editorial advice (CNRS); and collaborative research and development (Stanford University). The collaborative atmosphere of this "living edition" will only increase in importance as this edition of the Encyclopédie will reach a far greater audience. All users are encouraged to think about ways to ameliorate this resource, whether simply by alerting us to errors using the "Report Error" link, or through a more engaged reflection on its development. For more, see our Encyclopédie Collaborations page.
None of this would have been possible without the collaboration of a remarkable group of young humanist scholars with considerable technical capabilities. The unique makeup of this team has allowed us to strike a balance between technical innovation, textual improvement, and editorial judgment. I would like to express my enduring gratitude to the entire Development Team for all of their work.