Next: 4.3.1 Assumptions and Conventions
Up: 4. Using Frequencies for
Previous: 4.2 Decryption and Frequencies:
4.3 Recognizing Unscrambled Frequencies
To use the idea ``unscramble frequencies to crack a cryptosystem'', we
need to study unscrambled frequencies (frequencies of plaintext) to
look for recognizable structures/patterns that identify frequencies as being
unscrambled.
So, let us look at plaintexts from a wide range of areas: If we look
at a narrow range, then we might draw conclusions that apply to only a
narrow range. So, let us consider the following three plaintexts:
- Genesis from the 1970 edition of The New English Bible,
available at
http://etext.library.cornell.edu/cgi-bin/bie-idx?type=HTML&byte=116542360&rgn=book
- An edition from 1597 of William Shakespeare's Romeo and Juliet,
available at
http://etext.library.cornell.edu/cgi-bin/shakeEd-idx?type=HTML&byte=186196093&rgn=play
- A slightly old version of the Announcements page for CS100,
available at
http://courses.cs.cornell.edu/cs100/2000sp/oldannounce.txt
Each of these documents contains a corpus or large plaintext. To compute frequency tables, we must deal with both upper-
and lower-case letters, punctuation, and possibly accents. We defer
the details of what we do to Section 4.3.1, and move immediately
to Figure 20 for the unigram frequencies of these corpuses.
Figure 20:
Unigram Frequencies for Three English Corpuses
 |
Amazing! These wildly different plaintexts all have very similar
frequencies! (Although not shown for space reasons, the bigram
frequencies are also very similar.) This might lead us to postulate that
English has frequencies that are intrinsic or inherent to its very
nature, and indeed this is the case: English has intrinsic
frequencies that large plaintext is very likely to ``closely''
approximate. Therefore, a good approximation to intrinsic frequencies
can be obtained from just about any corpus.
If you try other natural languages, you will discover that they, too,
have intrinsic frequencies. This means that as long as we continue
to use nothing specific to English we will develop an algorithm that
works for natural languages besides English.
Recall that we studied plaintext frequencies to find a way to
recognize ``unscrambled frequencies''. Since we have seen that
frequencies of large plaintext are ``close'' to intrinsic frequencies,
we can use
that as our criterion: If frequencies are ``close'' to intrinsic
frequencies, then we will consider/assume them to be ``unscrambled''.
Thus, unscramble frequencies means
bring ``close'' to intrinsic frequencies.
This has three immediate consequences:
- We need to figure out how to define what ``close'' means precisely
enough for use in a computer program.
- We need a way to compute intrinsic frequencies.
But as we've already observed,
we can simply use the frequencies from just about any large plaintext,
since it is very likely to have frequencies that closely approximate
intrinsic frequencies. A large plaintext used in this
fashion is called training text because it ``teaches'' our program
what the right answer is.
- We need medium to large size ciphertext so that its unscrambled
frequencies are close to intrinsic frequencies: Small plaintext,
like words or brief sentences, tends to not contain enough
letter combinations to be representative of intrinsic
frequencies.4
Roadmap |
Section 3.7 |
Encipher plaintext
scramble frequencies. |
Section 3.7 |
Decipher ciphertext
unscramble frequencies. |
Section 4.2 |
(Hope)
Unscramble frequencies
decipher ciphertext |
Section 4.3 |
Unscramble = Bring ``close'' to intrinsic frequencies |
|
Approximate intrinsic frequencies with training text |
|
Assume ciphertext is medium to large so that unscrambled frequencies resemble intrinsic frequencies |
Section 4.3.1 |
Q: How do we compute frequencies for real texts, e.g. deal with punctuation? |
Section 4.4 |
Q: How do we measure ``closeness'' or distance? |
Section 5 |
Q: What are legal and effective ways to rearrange frequencies? |
Next: 4.3.1 Assumptions and Conventions
Up: 4. Using Frequencies for
Previous: 4.2 Decryption and Frequencies:
Thomas Yan
2000-05-01