SherlockHolmes: An R Program to Analyze the Hidden Structure
of Sherlock Holmes Stories by Statistical Pattern Analysis of
Concordances
Barry
Zeeberg
Motivation
Although Arthur Conan
Doyle was best known for his 60 Sherlock Holmes stories, he was a
prolific writer of many other works
[https://en.wikipedia.org/wiki/Arthur_Conan_Doyle].
For many decades, I have
been interested in the Sherlock Holmes stories. I have also had an
interest in the Dutch artist Johannes Vermeer, the American artist
Edward Hopper, and the American bluesman Robert Johnson. Perhaps it is
just some peculiarity of my own subjective perception, but for each of
these I tend to categorize e.g. “real Vermeers”
versus “fake Vermeers.” I do not
mean “fake” in the sense of a forgery. I mean that certain of the
Vermeer paintings strike me as representing why he is so
highly-regarded, and others are more or less “pedestrian” (as a math
professor used to say about calculus proofs that were fairly
routine).
I found that some of the
Sherlock Holmes stories seemed more like stories about something else,
and Sherlock was just added as an afterthought. Without any actual
knowledge of the subject, I assumed that Sherlock was very popular, and
Conan Doyle could just use Sherlock as a “bait” to get more readership
of his “other” stories.
It would be rather
tedious to read each story and tabulate how much of a story was really
Sherlock detecting, and how much was seemingly thousands of pages about
soldiers in India or the KKK. It occurred to me that I could write an R
program to perform a concordance that might shed some light on the
matter in an objective manner. It is always gratifying to “validate” my
subjective biases using an objective procedure J.
The idea that I had was
that Watson was usually present in the stories, and he would either
address Sherlock directly, calling him “Holmes,” or he would mention
“Holmes did or said such and such.” The point is that, thanks to the
presence of Watson as his chronicler, the literal string “Holmes” could
be used as a proxy for the presence or activity by Sherlock. I am not
sure that this could be done as successfully in
general.
Technical
Methods
All analyses are performed by invoking
Sherlock(titles,texts,patterns,toupper,odir,minl=100,P=0.00001,verbose=FALSE)
In two of the figures
presented below (Figures 5 and 6), I used an excellent R language
package “dpseg: Piecewise Linear
Segmentation by Dynamic Programming” by Rainer Machne and Peter Stadler. This package implements
piecewise linear regression modeling, and enables a quantitative
analysis of the results in those 2 figures. Dr. Machne kindly provided
me with the source code for the plotting function, that permitted me to
make several minor custom modifications.
Literary
Methods
An excellent single text
file containing all of the stories is readily available online
[https://sherlock-holm.es/stories/plain-text/cano.txt]. A small
amount of manual editing was required to prepare this file for automated
analysis (mostly consisting of deleting some passages that were not part
of the story, and some chapter subtitles,
etc.).
The full version
(contents.txt and processed_download.txt), and an abbreviated version
(contents3.txt and processed_download3.txt) of the titles and text files
for the Sherlock Holmes stories are provided in
/inst/extdata.
The 60 Sherlock Holmes
stories are comprised of 4 novels and 56 short stories. My recollection
is that the 4 novels are the most flagrant in not being “real” Sherlock
Holmes stories, although there were plenty of short stories for which
this is also true. There was one short story “The Adventure of the Blue
Carbuncle,” which I remember had a very high degree of participation by
Holmes. I certainly expect that the analyses described below will be
consistent with these subjective impressions.
I performed 2 types of
analyses.
The first type counted
the number of times that the search pattern “Holmes” appeared, in
comparison with the total number of words, in each story. A low ratio of
“Holmes”/total could indicate a “fake” Sherlock story.
The second type
constructed a cumulative distribution of “Holmes” and of total words.
For example, if the pattern “Holmes” appeared twice in the first line of
story, zero in the second and once in the third, the pattern cumulative
distribution would be 2 2 3. The total words cumulative distribution
would be something like e.g. 10 21 29. If Holmes appeared throughout the whole
story, we would see a cumulative distribution like
e.g. 2 4 5 8 9 . . . 20 25 32.
If Holmes appeared only at the end of the story, we would see a
cumulative distribution like
e.g. 0 0 0 0 . . . 2 4 5 8
9.
We can see whether there
is a relationship between the results of the 2 types of
analyses.
We can also see if the
there is a “typical” result for the 2 types of analyses that hold for
“most” of the stories, with just a few stories that deviate. Or perhaps
there is no “typical,” and each story has its own characteristic
analysis.
Results
and Discussion
Basic
Word Count Analysis
Table 1 shows that the
fraction (number of instances of “Holmes”/total number of words) covers
an over 18-fold range, from a low of 0.00066 for “The Musgrave Ritual”
to a high of 0.01209 for “The Adventure of the Three Gables.” This large
range is consistent with the hypothesis that Holmes is but a minor
character in a number of the stories.
Table 1. Fraction values
for the search pattern “Holmes,” across all 60 Sherlock Holmes
stories.
Title |
Words |
Fraction |
|
|
|
A
Study In Scarlet |
43167 |
0.002220 |
The
Sign of the Four |
42915 |
0.003150 |
A
Scandal in Bohemia |
8512 |
0.005640 |
The
Red-Headed League |
9098 |
0.005830 |
A
Case of Identity |
6971 |
0.006600 |
The
Boscombe Valley Mystery |
9614 |
0.004890 |
The
Five Orange Pips |
7312 |
0.003420 |
The
Man with the Twisted Lip |
9192 |
0.003150 |
The
Adventure of the Blue Carbuncle |
7805 |
0.004870 |
The
Adventure of the Speckled Band |
9801 |
0.005710 |
The
Adventure of the Engineer’s Thumb |
8281 |
0.001690 |
The
Adventure of the Noble Bachelor |
8100 |
0.004200 |
The
Adventure of the Beryl Coronet |
9674 |
0.002890 |
The
Adventure of the Copper Beeches |
9943 |
0.004320 |
Silver
Blaze |
9573 |
0.005330 |
The
Yellow Face |
7497 |
0.002530 |
The
Stock-Broker’s Clerk |
6782 |
0.003830 |
The
"Gloria Scott" |
7835 |
0.001020 |
The
Musgrave Ritual |
7568 |
0.000660 |
The
Reigate Squires |
7196 |
0.007370 |
The
Crooked Man |
7126 |
0.001540 |
The
Resident Patient |
6607 |
0.005900 |
The
Greek Interpreter |
6996 |
0.004150 |
The
Naval Treaty |
12603 |
0.005320 |
The
Final Problem |
7155 |
0.004190 |
The
Adventure of the Empty House |
8689 |
0.004600 |
The
Adventure of the Norwood Builder |
9213 |
0.006950 |
The
Adventure of the Dancing Men |
9702 |
0.006290 |
The
Adventure of the Solitary Cyclist |
7824 |
0.006260 |
The
Adventure of the Priory School |
11458 |
0.007510 |
The
Adventure of Black Peter |
8098 |
0.006790 |
The
Adventure of Charles Augustus Milverton |
6699 |
0.008360 |
The
Adventure of the Six Napoleons |
8319 |
0.006970 |
The
Adventure of the Three Students |
6456 |
0.007590 |
The
Adventure of the Golden Pince-Nez |
8921 |
0.006500 |
The
Adventure of the Missing Three-Quarter |
8011 |
0.006490 |
The
Adventure of the Abbey Grange |
9141 |
0.004490 |
The
Adventure of the Second Stain |
9621 |
0.008320 |
The
Hound of the Baskervilles |
59015 |
0.003250 |
The
Valley Of Fear |
57480 |
0.002610 |
The
Adventure of Wisteria Lodge |
11375 |
0.005450 |
The
Adventure of the Cardboard Box |
8510 |
0.003170 |
The
Adventure of the Red Circle |
7277 |
0.004260 |
The
Adventure of the Bruce-Partington Plans |
10668 |
0.005810 |
The
Adventure of the Dying Detective |
5769 |
0.008670 |
The
Disappearance of Lady Frances Carfax |
7665 |
0.007050 |
The
Adventure of the Devil’s Foot |
9968 |
0.005920 |
His
Last Bow |
6054 |
0.003630 |
The
Illustrious Client |
9731 |
0.006170 |
The
Blanched Soldier |
7705 |
0.001690 |
The
Adventure Of The Mazarin
Stone |
5639 |
0.009040 |
The
Adventure of the Three Gables |
6039 |
0.012090 |
The
Adventure of the Sussex Vampire |
5957 |
0.007550 |
The
Adventure of the Three Garridebs |
6184 |
0.008090 |
The
Problem of Thor Bridge |
9569 |
0.006170 |
The
Adventure of the Creeping Man |
7646 |
0.008240 |
The
Adventure of the Lion’s Mane |
7171 |
0.001950 |
The
Adventure of the Veiled Lodger |
4457 |
0.004940 |
The
Adventure of Shoscombe Old
Place |
6230 |
0.008190 |
The
Adventure of the Retired Colourman |
5498 |
0.007640 |
The 4 novels (“A Study in
Scarlet,” “The Valley of Fear,” “The sign of the four,” and “The Hound
of the Baskervilles”) are among the 14 lowest fractions, but are on an
even footing with a substantial number of the short stories. This
observation is consistent with the hypothesis that the longer novels are
mostly a ploy to tell a long non-Holmesian story, but a good number of
the short stories also were used for that purpose.
In the interest of full
disclosure, I recall that “The Adventure of the Blue Carbuncle” was a
story that featured Holmes were actively pursuing clues, and I would
have expected it to be at the top of the range of fractions. Yet its
fraction is roughly in the middle of the range.
The data of Table 1 are
displayed as a histogram (Figure 1), illustrating a roughly normal
distribution of fraction values.
Figure 1. Histogram of
fraction values for the search pattern “Holmes,” across all 60 Sherlock
Holmes stories.
We can perhaps somewhat
arbitrarily divide the fraction values into 3 types:
0.00066
<= low < 0.004
0.004
<= normal < 0.008
0.008
<= high <= 0.01209
Another way to look at
the same data is a scatter plot of fraction values as a function of the
total number of words (Figure 2).
Figure 2. Scatter plot of
fraction values as a function of the total number of words in the story
for the search pattern “Holmes,” across all 60 Sherlock Holmes
stories.
As expected, we can
clearly see a pattern for the 4 novels in the lower right corner.
However, the short stories do not display a discernible
pattern.
It is interesting that
the fraction values tend to increase in accord with the publication date
(Figure 3).
Figure 3. Scatter plot of
fraction values as a function of the chronological order for the search
pattern “Holmes,” across all 60 Sherlock Holmes
stories.
However, the large amount
of scatter in data prevent this correlation from achieving statistical
significance. Perhaps Conan Doyle eventually started feeling some
remorse over “cheating” his loyal readers. The trend is consistent with
2 of the 4 novels being written as the first 2 stories. The other 2
novels were written around the middle chronologically, and like the
first 2, they were written one after the other.
Cumulative
Distribution Analysis
The results of the
cumulative distribution analysis are given in a series of 60 graphs, one
for each story. Let us first take the 2 most extreme stories, as they
might be expected to most clearly show distinct characteristic
behaviors.
“The Musgrave Ritual”
(Figure 4) exhibited the lowest overall fraction value
(0.00066).
Figure 4. Cumulative
distribution analysis for the search pattern “Holmes” in “The Musgrave
Ritual.”
This story had such a low
number of instances of Holmes, that I was worried that the program had
made a mistake, so I examined this story directly. Yes, there really
were just 5 instances of Holmes. The cumulative analysis (Figure 4)
shows that after the first 2000 words (or around 25% of the story),
“Holmes” is only mentioned once. This is consistent with the final 75%
of the story not really being so much a Sherlock Holmes story as it is a
story about a family ritual.
“The Adventure of the
Three Gables” (Figure 5) exhibited the highest overall fraction value
(0.012090).
Figure 5. Cumulative
distribution analysis for the search pattern “Holmes” in “The Adventure
of the Three Gables.”
The cumulative graph is
so different from that for “The Musgrave Ritual” that it is almost hard
to believe that the 2 stories were written by the same author. The
cumulative graph for “The Adventure of the Three Gables” shows an
uninterrupted presence of Holmes throughout the whole
story.
It is remarkable that I
remembered one story in which an annotator questioned whether it was
written by Conan Doyle, because there were some racist epithets by
Holmes, which was totally contrary to his character. Believe it or not,
I just now looked up the name of that story, and it is in fact “The
Adventure of the Three Gables,” see e.g., [https://lesliesklinger.com/2020/07/07/the-elephant-in-the-room/].
Although I mentioned that
the cumulative graph was very different from that for “The Musgrave
Ritual,” there are many other stories with cumulative graphs that are
qualitatively essentially identical to that for “The Adventure of the
Three Gables,” so it cannot be ruled out as an authentic Conan Doyle
story on that basis.
I had mentioned above
that there was one short story “The Adventure of the Blue Carbuncle,”
(Figure 6) which I remember had a very consistent degree of
participation by Holmes. This recollection is borne out by the
cumulative distribution analysis. The fraction value is 0.004870.
According to the histogram (Figure 1), this value is around the mean for
all 60 stories. The cumulative distribution graph (Figure 6) shows a
consistent presence of “Holmes” throughout the entire story. Apparently the moderate number of mentions of
“Holmes” were distributed evenly through the story.
Figure 6. Cumulative
distribution analysis for the search pattern “Holmes” in “The Adventure
of the Blue Carbuncle.”
Enhanced
Features
In order to keep this
initial description more comprehensible, I did not present certain
enhanced features that were added to the package after the manuscript
was completed. These features will be presented in a subsequent
manuscript.
These
include: