About Us

Name: Rob
Biography
Name: filia_evae
Location: philadelphia, PA
Loading...

Create Your Own Blog Find Other Townhall Blogs

Comments

Just how many monkeys = Shakespeare?

A recent blogger has announced that a few million simulated monkeys really could reproduce Shakespeare. This is such a hoary chestnut, that of course, everyone had to go and read just exactly what the fellow actually did, if only to ridicule it. Here's how he describes his project,
Instead of having real monkeys typing on keyboards, I have virtual, computerized monkeys that output random gibberish. This is supposed to mimic a monkey randomly mashing the keys on a keyboard. The computer program I wrote compares that monkey’s gibberish to every work of Shakespeare to see if it actually matches a small portion of what Shakespeare wrote...
For this project, I used Hadoop, Amazon EC2, and Ubuntu Linux.  Since I don’t have real monkeys, I have to create fake Amazonian Map Monkeys.  The Map Monkeys create random data in ASCII between a and z.  It uses Sean Luke’s Mersenne Twister to make sure I have fast, random, well behaved monkeys.  Once the monkey’s output is mapped, it is passed to the reducer which runs the characters through a Bloom Field membership test.  If the monkey output passes the membership test, the Shakespearean works are checked using a string comparison.  If that passes, a genius monkey has written 9 characters of Shakespeare.  The source material is all of Shakespeare’s works as taken from Project Gutenberg.
The bit about well-behaved monkeys is necessary, because the actual experiment reported,
In 2003, scientists at Paignton Zoo and the University of Plymouth, in Devon in England reported that they had left a computer keyboard in the enclosure of six Sulawesi Crested Macaques for a month; not only did the monkeys produce nothing but five pages consisting largely of the letter S, (the full text may be found here:), they started by attacking the keyboard with a stone, and continued by urinating and defecating on it.
Which would be my reaction too, if at two years old I had been given a toy that was tactilely boring, visually structured for no apparent reason, acoustically monotone, hopelessly inedible and smelled like old coffee. But people aren't as rational as monkeys, and so they spend hours with that thing, banging on it interminably, writing parables about monkeys on keyboards when everyone knows it is really about people. But is it true that a genius monkey could type nine letters of Shakespeare?

Well, of course, but fitting those 9 letter fragments back into a jigsaw puzzle of Shakespeare takes a bona fide human. As one blogger put it, this is just Dawkin's Weasel program all over again--comparing a partially completed solution to the final solution, and modifying only the parts that are wrong. Not a very random way of using those monkeys at all! Imagine taking an MedCat exam where the professor told you which multiple choice problems were wrong and to go back and change them. Even without knowing anything, how long would it take you to score a 100%?

But what would it take to get Shakespeare without any human intervention at all? How long long would it take to collect, say, a page of Shakespeare, say 500 words or 2500 letters completely randomly?

I wouldn't hold my breath for the genius monkeys. This one has been around for a long, long time. From the 2003 internet we can find the software for running it on our own computer. And the blog forums report the progress (cached versions from original site that went down 7 years ago). This comment from Evan Kirschbaum at HP Laboratories:
That seems about right. I'm getting 6 characters about every four or five seconds and there appear to be about 70 characters in the set, so a back of the envelope calculation shows that you should get 7 letters after about 5 minutes, 8 letters after about 6 hours, and 9 letters after about 18 days. (And as I was typing that, I got my first seven letter match: "1. When" from the beginning of Macbeth .) Ten letters would take about three and a half years, and you won't live to see eleven letters (239 years). If a million of us work on this, it'll take two hours to get eleven letters, 14 months to get thirteen, and 82 years to get fourteen. (Got my second seven letter match: "DukeE." from Measure for Measure .)
Okay, how does this work? I'm not sure how Evan justified his calculation, but it seems he was working with the "doubling time", how long does it take to get one more letter is proportionally constant. So he's assuming the ratio Time-for-6-letters / Time-for-5-letters = Time-7 / Time-6.  Let's see how that works out.  His 6-letters is 4.5 sec; 7-letters is 300s (= 5 min);  8-letters is 21600s (= 6 hr); 9-letters is 1555200s (=18 days). Then taking ratios we have 7letter/6letter = 66; 8lett/7lett=72; 9lett/8lett=72; So it looks like his doubling ratio was about 72.

Why ~72? Because he assumed that each letter was drawn from a set of 70 characters, so the information per letter was 70= 26.1 or 6.1 bits. Therefore adding a letter required 6 doublings, or a factor slightly more than 64. But did each letter really add that much information?

Claude Shannon, the inventor of Information Theory, addressed this same problem in one of his little papers that shook the world. He asked a related question: If I erase every other letter (or more) can people reconstruct the sentence? By doing this, he was able to figure out how much information was encoded in the first letter, the 2nd letter and so forth. The answer is that the first letter has some 4 or so bits of information decreasing smoothly to about 1.7 bits of information by the time you reach the 10th letter. You can check that out with this web applet.

Then the probabilities are easy to calculate--just add up the bits. Not having his paper in front of me, let me just pretend it is the sequence 4 + 3 + 3 + 3 + 3 +2 +2 + 2 + 2 = 24 bits in 9 letters. Then 1 in 2^24 are the number of equivalent bits of information in the string, which is 1.68 x 10^7 tries, or 16 million tests. If his circa 2002 computer operates at 100 MHz, and does a thousand comparisons per second, then 16 million tests is completed in 160 seconds or about 3 minutes.

Why the difference?

Well, for one thing, Kirschenbaum was calculating the information from a 70 character alphabet, and Shannon reduced it to 27. Already that is a difference of 6.2 bits versus 4.7 bits. But more importantly, Shannon is asking how much information English speakers assign to each letter, which is a lot less than what the computer assigns to each letter. That's just another way of saying that English text can be compressed by at least a factor of 2. That is, 26 letters of the alphabet plus a space (using only caps) is roughly 4.75 bits of information, but real speakers of English find only 1.7 bits of information after the first few letters. That's because English readers know the rules about words starting with X, and a U after a Q, etc. All those rules mean that the next letter is not really 1 in 27, but more like 1 in 3. So a decent compression algorithm should be able to squeeze English down by 1.7/4.7= 35%.

Back to our Million Monkeys algorithm. When it converts all of Shakespeare into 9-character strings to make the comparison with the randomly generated strings, how many strings is that? The Shakespeare corpus has 884,647 words, which are probably a tad over 5 letters each, for about 4.8 million letters. So does that mean there are approximately 4.8 million 9-letter strings we can extract from his works? No, because those 4.8 million character strings from Shakespeare are all restricted by the rules of English, and so many of them will be identical. Using Shannon's method of calculating above we will have about 224 possible strings, or only 16 million potential strings. This is larger than Shakespeare's corpus, so perhaps the duplicates aren't so big a problem. Let's say this reduces Shakespeare to 3 million unique character strings. Then the calculation using Shannon's method (27 characters = 26 capitals and a space), we get the following statistics:

5 letters = 275 / 3 million = 4.78
6 letters = 276 / 3 million = 129
7 letters = 277 / 3 million = 3500
8 letters = 278 / 3 million = 94,000
9 letters = 279 / 3 million = 2,500,000

Notice that the comparison to Shakespeare's corpus is a constant that cancels out when ratios are taken. This is what makes Kirschenbaum's method work. But if 6 letters took 5 seconds for Kirschenbaum, 7 letters ought to have taken 2.5 minutes. Since Kirschenbaum saw two 7-letter options spring up while he was typing that text, it suggests that his factor 72 was too large, and he should have been using something a bit smaller, maybe closer to Shannon's 27.

So to answer our question about achieving a page of Shakespeare of 2500 letters, we can now calculate that:
272500 / 3 million = 8.5 x 103571 which is longer than the age of the universe by a long shot. If we want Shakespeare any time soon, we'll have to hire some of Dawkin's weasels.
====
HT: Jonathan, Jonathan and Jack.
Email ItEmail It | Print ItPrint It | CommentsComments (0) | TrackbacksTrackbacks (0) | Flag as offensiveFlag as Offensive