After doing a series of revisions as part of a blog post on this subject (link), we thought it might be helpful to provide an update. We have been interested in teasing out how the Vector Space Model handles small vs large articles and to get some sense of why various similar articles are selected. We thus reran the vector space similarity function on 39,218 articles (those with 60 or more words), taking some 24 hours to complete. We excluded some 150 surface forms of words in a stopword list, all sequences of numbers (and roman numerals), as well as features (in this case word stems) found in more than 1568 and less than 35 articles. This last step removed features like blanch, entend, mort, and so on. Thus, we removed some 600 features, leaving 10,157 features used for the similarity calculation. Here is the search form:
The number of matching terms for small articles can be, of course, very small. For example, the article "Tout-Bec" (62 words) is left with four stems [amer 1|oiseau 2|ornith 1|bec 3]. The first of the most similar articles is Rhinoceros (Hist. nat. Ornith.) -- remember, only the main article here -- matches on three stems:
word frq1 frq2
bec 3 5
oiseau 2 2
ornith 1 1
Are these similar? Well, both very small articles refer to kinds of rare birds that are notable by their beaks, one with a very large beak and one that looks like it has two or more beaks. It is also important to note that "ornith" (the class of knowledge) in both is picked up by this example. The next article down (Pipeliene) matches on:
amer 1 1
bec 3 1
oiseau 2 2
The third most similar in this example is "Connoissance des Oiseaux par le bec & par les pattes.", a plate legend, with as you expect, lots of beaks. This matches on two stems, bec and oiseau.
It seems that the size of the query article, now that we have eliminated many function words and other extraneous data, carries a significant impact. The larger the article, the more possible matches you will get (Zipf's Law applies here). Longer articles will tend to be most similar to other longer articles, and shorter will match better to shorter. So, "similarity" here would appear to be a function of relative frequencies of common features and the length of the articles. We saw this in our original examination of the Encyclopédie and the Dictionnaire de Trévoux, and had built in some restrictions in terms of size as well as comparing articles with the same first letter rather than all to all. As far as we can tell, the kind of feature pruning shown here does not have a significant impact on larger articles.
User feedback might be significant in determining just how many features and what kinds of features are required to get more interesting matches. For any pair, we could store the VSM score, the sizes, and the matching features along with the user rating of the match. That might generate some actionable data for future applications.
[Aside: In some cases, similar passages lead to possibly related plates and legends. Cadrature, for example, links to numerous plate legends dealing with clockmaking.]