Damn! You just put a crucial document through your paper shredder. It was a passage from Moby Dick that you wanted to share with a friend in medical school. You managed to salvage some bits of paper from the shredder with these words (in random order):
into his same should through had like with out which a down they great and were so not the
Well, these words should be enough to find the passage, which was definitely less than 500 words long and ended with a hilarious suggestion for how to improve medical education.
This might be useful: http://www.gutenberg.org/cache/epub/2701/pg2701.txt
Now do the reverse. What is the optimal moby set for the first 500 words of the book? (I mean the passage that begins with 'Call me Ishmael.')
A moby set is a set of words (ignoring capitalization and punctuation) that co-occur in a passage and co-occur in no other passage of equal length in a text, and hence those words are a unique marker for that passage. The optimal moby set for a passage is the set with the lowest moby score: the sum of the frequencies of the words in a moby set. So the optimal moby set is the set with the smallest number of high-frequency words that uniquely define a passage. The smaller the moby score, the better. For example:
{'call','me','ishmael'}
This is a valid moby set for the first 500 words of the book, because those 3 words only co-occur in the first 500 words of the book. But it is far from optimal because the sum of their frequencies is relatively large, mainly because 'ishmael' occurs only twice in the entire book. You can express their frequencies by the inverse of the count of those words in the book. The moby score is the sum:
sum( {1.0/53, 1.0/339, 1.0/2} ) = 0.52
The optimal moby score is much closer to zero.
(Note that you must capture all sentences that correspond to the first 500 words, i.e. one should not need to know that one is looking for a 500-word passage. Rather, the moby set should uniquely correspond to the sentences that fit within the first 500 words of the book.)
And finally, to take your mind off white whales, consider this puzzle:
Cross out nine letters from 'naisnienlgeltetweorrsd' such that a single word remains. (Don't get mad!)
If you have a solution you'd like to share see the Solutions page for instructions.